 # Optimization algorithm selection for process applications

• By R. Russell Rhinehart
• Process Automation
##### How to match the optimizer to application attributes By R. Russell Rhinehart

Optimization seeks to find the best. It could be to design a process that minimizes capital or maximizes material conversion, to choose operating conditions that maximize throughput or minimize waste, to tune a controller that minimizes rise time, or to determine a future control sequence that minimizes deviation from the set point while avoiding a constraint.

In any application, there are usually several desirables to be maximized and several undesirables to be minimized. The objective function (OF) is the expression used to calculate a value for the overall desirability. Elements of the OF could include the sum of squared deviations in model regression, a controlled variable variance in steady periods, the profitability index in a plant design, or the probability of a worst-case scenario when there is uncertainty.

The optimizer seeks to adjust decision variable (DV) values to result in the best OF value. Usually the DVs are iteratively moved toward the optimum, and each DV choice is termed a trial solution (TS). The search is initialized with a TS, often user defined. The optimum values are represented by DV* and OF*.

Because the optimization procedure incrementally approaches the DV* values, and is stopped when close enough, the DV* and OF* values are not the ideal best. They are in close proximity to the ideal. There is no sense in trying to find the perfect solution. If, for example, the molecular weight of water of 18 is used, the DV* values will change if the more perfect value of 18.0153 is used. Models are imperfect. Users need to understand model imperfection, then they can determine what is justified to claim convergence.

As a simple example of this terminology, the objective is to minimize this function z = (x−2.345)2 + (y−3.210)2 by changing the x and y values. In optimization terminology, the independent variables, x and y, are the DVs; the calculation defined by (x−2.345)+ (y−3.210)2 is the OF; and the value of the response variable, z, is the OF value. There might be a constraint, for instance x must be greater than y. Putting all this together in standard notation, the statement is:

min

{x,y} J = (x−2.345)+ (y−3.210)2         (1)

S.T: x > y

Here, J is the OF. The term min means to seek the minimum. The DVs are listed in brackets under the min term, and S.T: means “subject to,” which states the constraints. Any relevant application will be much more complicated, but it has the same elements.

Equation (1) is not the solution. Nor does it reveal how to obtain the DV* values. It is simply a standard presentation of the elements.

Is the coefficient 2.345 exactly right? Model coefficients associated with viscosity, density, heat transfer, feed composition, etc., all have uncertainty. Uncertainty in the “givens” will propagate to uncertainty of the optimized results. Mathematical perfection in optimization is only as justified as the OF calculation is complete and true.

The user needs to choose the optimizer, and the application characteristics should drive the optimization algorithm selection, not the mathematical appeal, academic fashion, or traditional company choice. Further, the user needs to define the OF, choose the DVs, choose a method of initializing the TS, decide how to handle constraints, decide on convergence criteria, etc. Again, the choices need to be right for the application.

#### Common optimization applications

Regression with phenomenological models: Models derived from a first principles, mechanistic view of the cause-and-effect relationships of a process or product could be termed engineering models. Some model coefficient values have uncertain values, and regression (adjusting model coefficient values to match the data) is a nonlinear optimization with a quadratic-ish OF.

Empirical modeling: These models do not have a first-principles grounding. They could be power series relations, neural networks, first-order-plus-delay, finite impulse response models, or such. Again, nonlinear optimization is used with a quadratic-ish OF.

Trends and associations: The new era of big data and machine learning seeks to identify trends or relations within data using empirical models. The OF response is usually quadratic.

Design: In process design, optimize profitability, flexibility, reliability, safety, etc. In product design, minimize product cost, while meeting functional specifications. These models are usually nonlinear, and the DVs often include class or discretized variables (tray or packed tower distillation, or number of alternating layers, integration time steps). Additionally, coefficient values (future prices) are often uncertain.

Process control: In advanced process control applications, an optimizer defines future control action to best keep the controlled variables at set points while avoiding constraints. Here models are generally linear, but, supporting nonlinear applications include data reconciliation and identification of events/faults.

Scheduling, blending, and managing growth: These relate to allocation or timing, and generally have OF models and constraints that are both linear responses to the DVs.

Process operation optimization: Real-time optimization seeks to determine set points for the controllers to maximize efficiency, conservation, and throughput, while minimizing expenses, waste, variability, risk, and constraints. The models may either be linear or nonlinear.

#### OF model classification

Linear models: If the mathematical form of the DVs in the OF and in the constraints is linear, then the optimal solution will be on an intersection of constraints.

Quadratic models: If the mathematical form of the DVs in the models is independent and linear, and the OF is a quadratic (squared) function of the model, then the solution is easy with classic methods.

Nonlinear: There are many ways the OF might not be a linear response to the DVs. Most process models are nonlinear and nonquadratic. This creates a DV-to-OF relation that confounds linear or second-order optimization techniques.

Deterministic or stochastic: Most OF responses are deterministic, meaning the model returns the same OF value for a particular TS value. The answer to a deterministic question such as “What is 3 × 4?” is always 12. By contrast, experimental noise and other process vagaries lead to continually varying results. When replicate tests give different values, the response is termed stochastic. The issue is that when the TS is moving in the right direction, the improvement may be masked by the fluctuation, and make it seem like the wrong direction. Stochastic responses also arise when optimizing deterministic models in the presence of coefficient uncertainty.

Integer or discretized: Whether the DV or the OF is discretized, the OF response to the DV has flat spots surrounded by vertical cliffs.

#### Difficulties for optimizers

Flat spots, or nearly so: If the OF response to the DV has flat spots, these are places where any small change in the DV has no impact on the OF. Here most optimizers will think that they have found the optimum, will converge on any of the flat spot locations, and misrepresent the situation. Figure 1 indicates that the minimum is at about DV=5, but since the OF has flat spots, the optimizer might claim convergence at DV values of 1.5, 2, 9, or many others.

Ridges, and gently sloping floor within steep walls: Imagine a deep river gorge with steep walls surrounding the river. Many applications have such a relation between DV values and the OF. Optimizers that use the steepest descent (gradient-based) will tend to zigzag across the river, not follow it downstream.

Discontinuities: Conditionals (IF-THEN-ELSE relations) in a function cause discontinuity. A common one is the transition in pressure drop models from laminar to turbulent flow. The conditionals may be related to any number of choices, for instance selecting the larger size when determined by several criteria.

Striations: Imagine a smooth surface, a pasture with gentle hills, which has a downward slope to the stream. Then be a farmer and plant rows of corn in a contour manner. Rainwater will not go across the striations toward the stream, but will stay in local valleys as defined by the rows. In optimization, there may be a concept of a continuum surface, but numerical discretization in either integrating or solving differential equations will create such striations on the surface, and many optimizers will follow the rows to local minima on the hill, not cross over them to find the true optimum.

Figure 2a represents a contour map of a 2-DV case. The minimum is about DV1=5, DV4=4. But the calculation is of a time-discretized simulation and has striations. These are somewhat recognizable from the kinks in the contour lines. The close-up 3-D view in figure 2b shows the ridges on the surface. The connect-the-dots vertical line in figure 2a is the path a Levenberg-Marquardt optimizer took when starting at DV1=4.2. It could not cross up over the local striation to get to the minimum at about (5.5, 3.5).

Planar, or nearly so: Many of the “best” optimizers presume that the entire surface is nearly quadratic (proportional to the DV squared). These include Newton’s and successive quadratic methods. But if the OF response to the DVs is linear in some region, or nearly so, then these methods will tend to jump to extreme values.

Multi-optima: Imagine hail pings on what should be a smooth hood of a car. Normally, rainwater will run off the hood, but the dings will trap water in local minima. Many nonlinear functions have local optima, that trap an optimizer. In the local optima, any direction is worse, and an optimizer will claim convergence, misrepresenting the big picture. Run an optimizer many times from randomized initializations to detect local traps.

Underspecified: In a new application, it is not uncommon for a user to have redundant DVs. As a very simple illustration, consider determining x and y values to minimize (x+y−10)2. Solutions for (x,y) include (1,9), (5,5), and (50,−40). All give the identical best OF value of zero. This means that the user can make a choice between solutions, and if one is better than another, then that additional concept of better should be included in the OF statement. Run an optimizer many times from randomized DV initialization values, and if there are different DV* solutions with the same OF* value, then you have underspecified the statement.

Stochastic: Contrasting a deterministic function that always returns the same value for the same input, a stochastic function returns a random value. When we are seeking to optimize within business or future uncertainty, or if the objective function is determined by an experiment, the surface can be stochastic. The probability distribution may be analyzed statistically, but the outcome may not be predicted precisely. Averaging replicate response values will temper, but not eliminate, the misdirection.

Inflections, saddles, asymptotic leveling: At the minimum, the derivatives of OF w.r.t. DVs are zero. Second-order optimizers seek this point. But they can also be misdirected to points of maximum, saddle points, or to a faraway asymptotic leveling. When using Newton-based or successive quadratic methods, check to see if the solution is a max, min, saddle, or asymptotic extreme. Figure 3 illustrates such a case. The minimum is at a DV value of about 1, but the function asymptotically approaches a constant at DV values of 10 or more. Optimizers that are seeking a point where the derivative is zero often find the high DV values, misrepresenting the solution, or encountering constraints.

Constraints: These may be inequality or equality conditions on DVs, OF, or auxiliary variables. If they are “hard” constraints, they must not be violated. For instance, do not exceed the lower explosive level, or do not ask the computer to take the square root of a negative number. But if they are “soft,” a bit of violation is permitted. For instance, keep the level under 80 percent full. Here, a penalty for constraint violation is usually added to the OF. Hard constraints are a confounding barrier to a steepest descent algorithm. It would want to cross over the constraint. Figure 1. Illustration of flat spots.

#### Algorithm classification

Gradient: Gradient-based means that the search logic is based on a model of the slope of the surface. Steepest descent, successive quadratic, and Newton-type methods are of these types.

Direct search: By contrast, heuristic, or direct search methods, use improvement in trial solutions to guide the next trial solution without a mathematical model of the surface. Hooke-Jeeves, Nelder-Mead, particle swarm, and leapfrogging are of these types.

Steepest descent: Steepest descent algorithms take incremental steps downhill. These include many of the direct search algorithms as well as some gradient-based (Cauchy’s sequential line search and incremental steepest descent).

Second-order: By contrast, second-order algorithms model the surface with a quadratic surrogate model then jump to the model minimum. These include Newton-like or successive quadratic algorithms. If the application matches the surrogate model, this leap can be very accurate, but second-order methods can jump to absurd locations. Levenberg-Marquardt is one of many approaches that blend the reliable (but slower) steepest descent with the impetuous second-order jump.

Conjugate gradient: To temper zig-zag movement across a ridge or steep valley, this approach uses an average of recent slopes.

Constrained: Constraints might be on the DV, the OF, an auxiliary variable (AV), the rate of change, or a future value. Constraints block the optimizer TS path, and one solution is to convert a constraint violation to a penalty in the OF. This soft constraint can permit a bit of constraint violation, the magnitude of which is dependent on the weighting and functionality assigned to the penalty. Some algorithms search along the constraint. Generalized reduced gradient is one. It linearizes the constraint about the current TS, to permit an analytical solution for the next TS, then relinearize. This too, can permit some constraint violation.

Unconstrained: Classic algorithms (Newton-types) are designed for unconstrained applications.

Single TS: A single TS optimizer evolves that single value with each iteration, and may move to a local optimum, not the global. To solve this issue, the optimization can be redone from many randomized initial TS values to see the distribution of local optima and to increase the chance of finding the global.

Multiplayer: A multiplayer algorithm uses many simultaneous TS points to direct the next player move. Multiplayer algorithms (particle swarm, leapfrogging, differential evolution) broadly explore the surface, increasing the probability that the global will be discovered.

Interior/exterior point: Linear programming (LP) is a very efficient search. It presumes that the optimal solution is on a confluence of constraints, on the exterior boundary. However, if the optimum is in the interior of the DV range, LP will not find it. Figure 2a. Contour and optimizer path.

#### Initialization

If there are multiple optima, one run of the optimizer might find a local, not global optima. If you know the proximity of the global solution, initialize the TS there. Otherwise use multiple runs from random TS initializations and take the best of N trials as the solution. To be c confident that one of the best fraction, f, of all possible solutions will be found, calculate N=Ln(1−c)/Ln(1−f). For example, to be 99 percent confident (c = 0.99) that after N randomized trials at least one result will be in the best 5 percent of all possible results (f = 0.05), N = Ln(1−.99)/Ln(1−.05) ≅ 90 trials. Figure 2b. 3-D detail of the central part of figure 2a.

#### Perspective on what is important

Critically important to optimization are aspects of the problem statement: the fidelity of the model and the constraints to the application, handling constraints (method and weighting), and appropriately combining the desirables and undesirables to structure the OF, including uncertainty.

Additionally, as important as the choice of optimizer are choosing a criterion for convergence that makes the answer close enough without excessive iterations, initializing the search perhaps in the proximity of the prior solution to minimize iterations, and ensuring that the procedure is not trapped in a local optimum.

The solution validity depends on those choices. A user should not think that a true perfect solution is found. When you see the DV* and OF* results, reconsider the problem statement. Often, reflection will guide you to alter your choice of “givens,” DVs, constraints, models, desirables and undesirables, and convergence criterion. Do not look at optimization as a mathematical challenge-game. Return to the business focus, and see how to remove/diminish constraints, alter the OF, change the givens, restructure the approach, etc. Optimize your optimization exercise! Figure 3. Asymptotic leveling.

#### Matching algorithms to application characteristics

Within any application category or set of characteristics there are many optimization algorithms that are equivalently effective. Criteria for algorithm preference include robustness to surface features, probability of finding the global optimum, simplicity for user setup, number of function evaluations (speed) of finding the optimum, and answer precision. My experience indicates that multiplayer direct search algorithms are generally best, but I understand that convenience of use and familiarity may lead to alternate preferences. Here is a table that matches optimization algorithms best suited to particular application classes. Table 1. Matching optimizer with application

#### Acknowledgment

The author appreciates comments from Jeffrey Arbogast (Group Manager, Computational & Data Science R&D, Air Liquide) that shaped the content of this article.