Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

CMA-ES

From Wikipedia, the free encyclopedia
Evolutionary algorithm

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy fornumerical optimization.Evolution strategies (ES) arestochastic,derivative-free methods fornumerical optimization of non-linear or non-convexcontinuous optimization problems. They belong to the class ofevolutionary algorithms andevolutionary computation. Anevolutionary algorithm is broadly based on the principle ofbiological evolution, namely the repeated interplay of variation (viarecombination andmutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted asx{\displaystyle x}) are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness orobjective function valuef(x){\displaystyle f(x)}. Like this, individuals with better and betterf{\displaystyle f}-values are generated over the generation sequence.

In anevolution strategy, new candidate solutions are usually sampled according to amultivariate normal distribution inRn{\displaystyle \mathbb {R} ^{n}}. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by acovariance matrix. The covariance matrix adaptation (CMA) is a method to update thecovariance matrix of this distribution. This is particularly useful if the functionf{\displaystyle f} isill-conditioned.

Adaptation of thecovariance matrix amounts to learning a second order model of the underlyingobjective function similar to the approximation of the inverseHessian matrix in thequasi-Newton method in classicaloptimization. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in aSwiss-system tournament.

Principles

[edit]
Illustration of an actual optimization run with covariance matrix adaptation on a simple two-dimensional problem. The spherical optimization landscape is depicted with solid lines of equalf{\displaystyle f}-values. The population (dots) is much larger than necessary, but clearly shows how the distribution of the population (dotted line) changes during the optimization. On this simple problem, the population concentrates over the global optimum within a few generations.

Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.

First, amaximum-likelihood principle predicated on the idea that increasing (though not necessarily maximizing) the sample probability ofsuccessful candidate solutions or search directions is beneficial. The mean of the distribution is updated such that thelikelihood of previously successful candidate solutions is maximized. Thecovariance matrix of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as anatural gradient descent. Also, in consequence, the CMA conducts an iteratedprincipal components analysis of successful search steps while retainingall principal axes.Estimation of distribution algorithms and theCross-Entropy Method are based on very similar ideas, but generally estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solutionpoints rather than successful searchsteps.

Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively preventspremature convergence yet allowing fast convergence to an optimum.

Algorithm

[edit]

In the following the most commonly used (μ/μwλ)-CMA-ES is outlined, where in each iteration step a weighted combination of theμ best out ofλ new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. Apseudocode of the algorithm looks as follows.

setλ{\displaystyle \lambda }  // number of samples per iteration, at least two, generally > 4initializem{\displaystyle m},σ{\displaystyle \sigma },C=I{\displaystyle C=I},pσ=0{\displaystyle p_{\sigma }=0},pc=0{\displaystyle p_{c}=0}  // initialize state variableswhilenot terminatedo  // iteratefori{\displaystyle i}in{1λ}{\displaystyle \{1\ldots \lambda \}}do  // sampleλ{\displaystyle \lambda } new solutions and evaluate themxi={\displaystyle x_{i}={}}sample_multivariate_normal(mean=m{\displaystyle {}=m}, covariance_matrix=σ2C{\displaystyle {}=\sigma ^{2}C})fi=fitness(xi){\displaystyle f_{i}=\operatorname {fitness} (x_{i})}x1λ{\displaystyle x_{1\ldots \lambda }}xs(1)s(λ){\displaystyle x_{s(1)\ldots s(\lambda )}} withs(i)=argsort(f1λ,i){\displaystyle s(i)=\operatorname {argsort} (f_{1\ldots \lambda },i)} // sort solutionsm=m{\displaystyle m'=m}  // we need latermm{\displaystyle m-m'} andxim{\displaystyle x_{i}-m'}m{\displaystyle m} ← update_m(x1,,xλ){\displaystyle (x_{1},\ldots ,x_{\lambda })}  // move mean to better solutionspσ{\displaystyle p_{\sigma }} ← update_ps(pσ,σ1C1/2(mm)){\displaystyle (p_{\sigma },\sigma ^{-1}C^{-1/2}(m-m'))}  // update isotropic evolution pathpc{\displaystyle p_{c}} ← update_pc(pc,σ1(mm),pσ){\displaystyle (p_{c},\sigma ^{-1}(m-m'),\|p_{\sigma }\|)}  // update anisotropic evolution pathC{\displaystyle C} ← update_C(C,pc,(x1m)/σ,,(xλm)/σ){\displaystyle (C,p_{c},(x_{1}-m')/\sigma ,\ldots ,(x_{\lambda }-m')/\sigma )}  // update covariance matrixσ{\displaystyle \sigma } ← update_sigma(σ,pσ){\displaystyle (\sigma ,\|p_{\sigma }\|)}  // update step-size using isotropic path lengthreturnm{\displaystyle m} orx1{\displaystyle x_{1}}

The order of the five update assignments is relevant:m{\displaystyle m} must be updated first,pσ{\displaystyle p_{\sigma }} andpc{\displaystyle p_{c}} must be updated beforeC{\displaystyle C}, andσ{\displaystyle \sigma } must be updated last. The update equations for the five state variables are specified in the following.

Given are the search space dimensionn{\displaystyle n} and the iteration stepk{\displaystyle k}. The five state variables are

The iteration starts with samplingλ>1{\displaystyle \lambda >1} candidate solutionsxiRn{\displaystyle x_{i}\in \mathbb {R} ^{n}} from amultivariate normal distributionN(mk,σk2Ck){\displaystyle \textstyle {\mathcal {N}}(m_{k},\sigma _{k}^{2}C_{k})}, i.e. fori=1,,λ{\displaystyle i=1,\ldots ,\lambda }

xi  N(mk,σk2Ck) mk+σk×N(0,Ck){\displaystyle {\begin{aligned}x_{i}\ &\sim \ {\mathcal {N}}(m_{k},\sigma _{k}^{2}C_{k})\\&\sim \ m_{k}+\sigma _{k}\times {\mathcal {N}}(0,C_{k})\end{aligned}}}

The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vectormk{\displaystyle m_{k}} (the distribution mean vector). The candidate solutionsxi{\displaystyle x_{i}} are evaluated on the objective functionf:RnR{\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } to be minimized. Denoting thef{\displaystyle f}-sorted candidate solutions as

{xi:λi=1λ}={xii=1λ} and f(x1:λ)f(xμ:λ)f(xμ+1:λ),{\displaystyle \{x_{i:\lambda }\mid i=1\dots \lambda \}=\{x_{i}\mid i=1\dots \lambda \}{\text{ and }}f(x_{1:\lambda })\leq \dots \leq f(x_{\mu :\lambda })\leq f(x_{\mu +1:\lambda })\leq \cdots ,}

the new mean value is computed as

mk+1=i=1μwixi:λ=mk+i=1μwi(xi:λmk){\displaystyle {\begin{aligned}m_{k+1}&=\sum _{i=1}^{\mu }w_{i}\,x_{i:\lambda }\\&=m_{k}+\sum _{i=1}^{\mu }w_{i}\,(x_{i:\lambda }-m_{k})\end{aligned}}}

where the positive (recombination) weightsw1w2wμ>0{\displaystyle w_{1}\geq w_{2}\geq \dots \geq w_{\mu }>0} sum to one. Typically,μλ/2{\displaystyle \mu \leq \lambda /2} and the weights are chosen such thatμw:=1/i=1μwi2λ/4{\displaystyle \textstyle \mu _{w}:=1/\sum _{i=1}^{\mu }w_{i}^{2}\approx \lambda /4}. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indicesi:λ{\displaystyle i:\lambda }.

The step-sizeσk{\displaystyle \sigma _{k}} is updated usingcumulative step-size adaptation (CSA), sometimes also denoted aspath length control. The evolution path (or search path)pσ{\displaystyle p_{\sigma }} is updated first.

pσ(1cσ)discount factorpσ+1(1cσ)2complements for discounted varianceμwCk1/2mk+1mkdisplacement of mσkdistributed as N(0,I) under neutral selection{\displaystyle p_{\sigma }\gets \underbrace {(1-c_{\sigma })} _{\!\!\!\!\!{\text{discount factor}}\!\!\!\!\!}\,p_{\sigma }+\overbrace {\sqrt {1-(1-c_{\sigma })^{2}}} ^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{complements for discounted variance}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}\underbrace {{\sqrt {\mu _{w}}}\,C_{k}^{\;-1/2}\,{\frac {\overbrace {m_{k+1}-m_{k}} ^{\!\!\!{\text{displacement of }}m\!\!\!}}{\sigma _{k}}}} _{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{distributed as }}{\mathcal {N}}(0,I){\text{ under neutral selection}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}}σk+1=σk×exp(cσdσ(pσEN(0,I)1)unbiased about 0 under neutral selection){\displaystyle \sigma _{k+1}=\sigma _{k}\times \exp {\bigg (}{\frac {c_{\sigma }}{d_{\sigma }}}\underbrace {\left({\frac {\|p_{\sigma }\|}{\operatorname {E} \|{\mathcal {N}}(0,I)\|}}-1\right)} _{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{unbiased about 0 under neutral selection}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}{\bigg )}}

where

The step-sizeσk{\displaystyle \sigma _{k}} is increased if and only ifpσ{\displaystyle \|p_{\sigma }\|} is larger than theexpected value

EN(0,I)=2Γ((n+1)/2)Γ(n/2)n(114n+121n2){\displaystyle {\begin{aligned}\operatorname {E} \|{\mathcal {N}}(0,I)\|&={\sqrt {2}}\;{\frac {\Gamma ((n+1)/2)}{\Gamma (n/2)}}\\[1ex]&\approx {\sqrt {n}}\,\left(1-{\frac {1}{4n}}+{\frac {1}{21\,n^{2}}}\right)\end{aligned}}}

and decreased if it is smaller. For this reason, the step-size update tends to make consecutive stepsCk1{\displaystyle C_{k}^{-1}}-conjugate, in that after the adaptation has been successful(mk+2mk+1σk+1)TCk1mk+1mkσk0{\displaystyle \textstyle \left({\frac {m_{k+2}-m_{k+1}}{\sigma _{k+1}}}\right)^{T}\!C_{k}^{-1}{\frac {m_{k+1}-m_{k}}{\sigma _{k}}}\approx 0}.[1]

Finally, thecovariance matrix is updated, where again the respective evolution path is updated first.

pc(1cc)discount factorpc+1[0,αn](pσ)indicator function1(1cc)2complements for discounted varianceμwmk+1mkσkdistributed asN(0,Ck)under neutral selection{\displaystyle p_{c}\gets \underbrace {(1-c_{c})} _{\!\!\!\!\!{\text{discount factor}}\!\!\!\!\!}\,p_{c}+\underbrace {\mathbf {1} _{[0,\alpha {\sqrt {n}}]}(\|p_{\sigma }\|)} _{\text{indicator function}}\overbrace {\sqrt {1-(1-c_{c})^{2}}} ^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{complements for discounted variance}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}\underbrace {{\sqrt {\mu _{w}}}\,{\frac {m_{k+1}-m_{k}}{\sigma _{k}}}} _{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{distributed as}}\;{\mathcal {N}}(0,C_{k})\;{\text{under neutral selection}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}}

Ck+1=(1c1cμ+cs)discount factorCk+c1pcpcTrank one matrix+cμi=1μwixi:λmkσk(xi:λmkσk)Trankmin(μ,n) matrix{\displaystyle C_{k+1}=\underbrace {(1-c_{1}-c_{\mu }+c_{s})} _{\!\!\!\!\!{\text{discount factor}}\!\!\!\!\!}\,C_{k}+c_{1}\underbrace {p_{c}p_{c}^{T}} _{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{rank one matrix}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}+\,c_{\mu }\underbrace {\sum _{i=1}^{\mu }w_{i}{\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\left({\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\right)^{T}} _{\operatorname {rank} \min(\mu ,n){\text{ matrix}}}}

whereT{\displaystyle T} denotes the transpose and

Thecovariance matrix update tends to increase thelikelihood forpc{\displaystyle p_{c}} and for(xi:λmk)/σk{\displaystyle (x_{i:\lambda }-m_{k})/\sigma _{k}} to be sampled fromN(0,Ck+1){\displaystyle {\mathcal {N}}(0,C_{k+1})}. This completes the iteration step.

The number of candidate samples per iteration,λ{\displaystyle \lambda }, is not determined a priori and can vary in a wide range. Smaller values, for exampleλ=10{\displaystyle \lambda =10}, lead to more local search behavior. Larger values, for exampleλ=10n{\displaystyle \lambda =10n} with default valueμwλ/4{\displaystyle \mu _{w}\approx \lambda /4}, render the search more global. Sometimes the algorithm is repeatedly restarted with increasingλ{\displaystyle \lambda } by a factor of two for each restart.[2] Besides of settingλ{\displaystyle \lambda } (or possiblyμ{\displaystyle \mu } instead, if for exampleλ{\displaystyle \lambda } is predetermined by the number of available processors), the above introduced parameters are not specific to the given objective function and therefore not meant to be modified by the user.

Example code in MATLAB/Octave

[edit]
functionxmin=purecmaes   %(mu/mu_w, lambda)-CMA-ES% --------------------  Initialization --------------------------------% User defined input parameters (need to be edited)strfitnessfct='frosenbrock';% name of objective/fitness functionN=20;% number of objective variables/problem dimensionxmean=rand(N,1);% objective variables initial pointsigma=0.3;% coordinate wise standard deviation (step size)stopfitness=1e-10;% stop if fitness < stopfitness (minimization)stopeval=1e3*N^2;% stop after stopeval number of function evaluations% Strategy parameter setting: Selectionlambda=4+floor(3*log(N));% population size, offspring numbermu=lambda/2;% number of parents/points for recombinationweights=log(mu+1/2)-log(1:mu)';% muXone array for weighted recombinationmu=floor(mu);weights=weights/sum(weights);% normalize recombination weights arraymueff=sum(weights)^2/sum(weights.^2);% variance-effectiveness of sum w_i x_i% Strategy parameter setting: Adaptationcc=(4+mueff/N)/(N+4+2*mueff/N);% time constant for cumulation for Ccs=(mueff+2)/(N+mueff+5);% t-const for cumulation for sigma controlc1=2/((N+1.3)^2+mueff);% learning rate for rank-one update of Ccmu=min(1-c1,2*(mueff-2+1/mueff)/((N+2)^2+mueff));% and for rank-mu updatedamps=1+2*max(0,sqrt((mueff-1)/(N+1))-1)+cs;% damping for sigma% usually close to 1% Initialize dynamic (internal) strategy parameters and constantspc=zeros(N,1);ps=zeros(N,1);% evolution paths for C and sigmaB=eye(N,N);% B defines the coordinate systemD=ones(N,1);% diagonal D defines the scalingC=B*diag(D.^2)*B';% covariance matrix CinvsqrtC=B*diag(D.^-1)*B';% C^-1/2eigeneval=0;% track update of B and DchiN=N^0.5*(1-1/(4*N)+1/(21*N^2));% expectation of%   ||N(0,I)|| == norm(randn(N,1))% -------------------- Generation Loop --------------------------------counteval=0;% the next 40 lines contain the 20 lines of interesting codewhilecounteval<stopeval% Generate and evaluate lambda offspringfork=1:lambdaarx(:,k)=xmean+sigma*B*(D.*randn(N,1));% m + sig * Normal(0,C)arfitness(k)=feval(strfitnessfct,arx(:,k));% objective function callcounteval=counteval+1;end% Sort by fitness and compute weighted mean into xmean[arfitness,arindex]=sort(arfitness);% minimizationxold=xmean;xmean=arx(:,arindex(1:mu))*weights;% recombination, new mean value% Cumulation: Update evolution pathsps=(1-cs)*ps...+sqrt(cs*(2-cs)*mueff)*invsqrtC*(xmean-xold)/sigma;hsig=norm(ps)/sqrt(1-(1-cs)^(2*counteval/lambda))/chiN<1.4+2/(N+1);pc=(1-cc)*pc...+hsig*sqrt(cc*(2-cc)*mueff)*(xmean-xold)/sigma;% Adapt covariance matrix Cartmp=(1/sigma)*(arx(:,arindex(1:mu))-repmat(xold,1,mu));C=(1-c1-cmu)*C...                  % regard old matrix+c1*(pc*pc'...                 % plus rank one update+(1-hsig)*cc*(2-cc)*C)... % minor correction if hsig==0+cmu*artmp*diag(weights)*artmp';% plus rank mu update% Adapt step size sigmasigma=sigma*exp((cs/damps)*(norm(ps)/chiN-1));% Decomposition of C into B*diag(D.^2)*B' (diagonalization)ifcounteval-eigeneval>lambda/(c1+cmu)/N/10% to achieve O(N^2)eigeneval=counteval;C=triu(C)+triu(C,1)';% enforce symmetry[B,D]=eig(C);% eigen decomposition, B==normalized eigenvectorsD=sqrt(diag(D));% D is a vector of standard deviations nowinvsqrtC=B*diag(D.^-1)*B';end% Break, if fitness is good enough or condition exceeds 1e14, better termination methods are advisableifarfitness(1)<=stopfitness||max(D)>1e7*min(D)break;endend% while, end generation loopxmin=arx(:,arindex(1));% Return best point of last iteration.% Notice that xmean is expected to be even% better.end% ---------------------------------------------------------------functionf=frosenbrock(x)ifsize(x,1)<2error('dimension must be greater one');endf=100*sum((x(1:end-1).^2-x(2:end)).^2)+sum((x(1:end-1)-1).^2);end

Theoretical foundations

[edit]

Given the distribution parameters—mean, variances and covariances—thenormal probability distribution for sampling new candidate solutions is themaximum entropy probability distribution overRn{\displaystyle \mathbb {R} ^{n}}, that is, the sample distribution with the minimal amount of prior information built into the distribution. More considerations on the update equations of CMA-ES are made in the following.

Variable metric

[edit]

The CMA-ES implements a stochasticvariable-metric method. In the very particular case of a convex-quadratic objective function

f(x)=12(xx)TH(xx){\displaystyle f(x)={\textstyle {\frac {1}{2}}}(x-x^{*})^{T}H(x-x^{*})}

the covariance matrixCk{\displaystyle C_{k}} adapts to the inverse of theHessian matrixH{\displaystyle H},up to a scalar factor and small random fluctuations. More general, also on the functiongf{\displaystyle g\circ f}, whereg{\displaystyle g} is strictly increasing and therefore order preserving, the covariance matrixCk{\displaystyle C_{k}} adapts toH1{\displaystyle H^{-1}},up to a scalar factor and small random fluctuations. For selection ratioλ/μ{\displaystyle \lambda /\mu \to \infty } (and hence population sizeλ{\displaystyle \lambda \to \infty }), theμ{\displaystyle \mu } selected solutions yield an empirical covariance matrix reflective of the inverse-Hessian even in evolution strategies without adaptation of the covariance matrix. This result has been proven forμ=1{\displaystyle \mu =1} on a static model, relying on the quadratic approximation.[3]

Maximum-likelihood updates

[edit]

The update equations for mean and covariance matrix maximize alikelihood while resembling anexpectation–maximization algorithm. The update of the mean vectorm{\displaystyle m} maximizes a log-likelihood, such that

mk+1=argmaxmi=1μwilogpN(xi:λm){\displaystyle m_{k+1}=\arg \max _{m}\sum _{i=1}^{\mu }w_{i}\log p_{\mathcal {N}}(x_{i:\lambda }\mid m)}

where

logpN(x)=12logdet(2πC)12(xm)TC1(xm){\displaystyle \log p_{\mathcal {N}}(x)=-{\tfrac {1}{2}}\log \det(2\pi C)-{\tfrac {1}{2}}(x-m)^{T}C^{-1}(x-m)}

denotes the log-likelihood ofx{\displaystyle x} from a multivariate normal distribution with meanm{\displaystyle m} and any positive definite covariance matrixC{\displaystyle C}. To see thatmk+1{\displaystyle m_{k+1}} is independent ofC{\displaystyle C} remark first that this is the case for any diagonal matrixC{\displaystyle C}, because the coordinate-wise maximizer is independent of a scaling factor. Then, rotation of the data points or choosingC{\displaystyle C} non-diagonal are equivalent.

The rank-μ{\displaystyle \mu } update of the covariance matrix, that is, the right most summand in the update equation ofCk{\displaystyle C_{k}}, maximizes a log-likelihood in that

i=1μwixi:λmkσk(xi:λmkσk)T=argmaxCi=1μwilogpN(xi:λmkσk|C){\displaystyle \sum _{i=1}^{\mu }w_{i}{\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\left({\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\right)^{T}=\arg \max _{C}\sum _{i=1}^{\mu }w_{i}\log p_{\mathcal {N}}\left(\left.{\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\right|C\right)}

forμn{\displaystyle \mu \geq n} (otherwiseC{\displaystyle C} is singular, but substantially the same result holds forμ<n{\displaystyle \mu <n}). Here,pN(x|C){\displaystyle p_{\mathcal {N}}(x|C)} denotes the likelihood ofx{\displaystyle x} from a multivariate normal distribution with zero mean and covariance matrixC{\displaystyle C}. Therefore, forc1=0{\displaystyle c_{1}=0} andcμ=1{\displaystyle c_{\mu }=1},Ck+1{\displaystyle C_{k+1}} is the abovemaximum-likelihood estimator. Seeestimation of covariance matrices for details on the derivation.

Natural gradient descent in the space of sample distributions

[edit]

Akimotoet al.[4] and Glasmacherset al.[5] discovered independently that the update of the distribution parameters resembles the descent in direction of a samplednatural gradient of the expected objective function valueEf(x){\displaystyle Ef(x)} (to be minimized), where the expectation is taken under the sample distribution. With the parameter setting ofcσ=0{\displaystyle c_{\sigma }=0} andc1=0{\displaystyle c_{1}=0}, i.e. without step-size control and rank-one update, CMA-ES can thus be viewed as an instantiation ofNatural Evolution Strategies (NES).[4][5]Thenatural gradient is independent of the parameterization of the distribution. Taken with respect to the parametersθ of the sample distributionp, the gradient ofEf(x){\displaystyle Ef(x)} can be expressed as

θE(f(x)θ)=θRnf(x)p(x)dx=Rnf(x)θp(x)dx=Rnf(x)p(x)θlnp(x)dx=E(f(x)θlnp(xθ)){\displaystyle {\begin{aligned}{\nabla }_{\!\theta }E(f(x)\mid \theta )&=\nabla _{\!\theta }\int _{\mathbb {R} ^{n}}f(x)p(x)\,\mathrm {d} x\\&=\int _{\mathbb {R} ^{n}}f(x)\nabla _{\!\theta }p(x)\,\mathrm {d} x\\&=\int _{\mathbb {R} ^{n}}f(x)p(x)\nabla _{\!\theta }\ln p(x)\,\mathrm {d} x\\&=\operatorname {E} (f(x)\nabla _{\!\theta }\ln p(x\mid \theta ))\end{aligned}}}

wherep(x)=p(xθ){\displaystyle p(x)=p(x\mid \theta )} depends on the parameter vectorθ{\displaystyle \theta }. The so-calledscore function,θlnp(xθ)=θp(x)p(x){\displaystyle \nabla _{\!\theta }\ln p(x\mid \theta )={\frac {\nabla _{\!\theta }p(x)}{p(x)}}}, indicates the relative sensitivity ofp w.r.t.θ, and the expectation is taken with respect to the distributionp. Thenatural gradient ofEf(x){\displaystyle Ef(x)}, complying with theFisher information metric (an informational distance measure between probability distributions and the curvature of therelative entropy), now reads

~E(f(x)θ)=Fθ1θE(f(x)θ){\displaystyle {\begin{aligned}{\tilde {\nabla }}\operatorname {E} (f(x)\mid \theta )&=F_{\theta }^{-1}\nabla _{\!\theta }\operatorname {E} (f(x)\mid \theta )\end{aligned}}}

where theFisher information matrixFθ{\displaystyle F_{\theta }} is the expectation of theHessian of−lnp and renders the expression independent of the chosen parameterization. Combining the previous equalities we get

~E(f(x)θ)=Fθ1E(f(x)θlnp(xθ))=E(f(x)Fθ1θlnp(xθ)){\displaystyle {\begin{aligned}{\tilde {\nabla }}\operatorname {E} (f(x)\mid \theta )&=F_{\theta }^{-1}\operatorname {E} (f(x)\nabla _{\!\theta }\ln p(x\mid \theta ))\\&=\operatorname {E} (f(x)F_{\theta }^{-1}\nabla _{\!\theta }\ln p(x\mid \theta ))\end{aligned}}}

A Monte Carlo approximation of the latter expectation takes the average overλ samples fromp

~E^θ(f):=i=1λwipreference weightFθ1θlnp(xi:λθ)candidate direction from xi:λwith wi=f(xi:λ)/λ{\displaystyle {\tilde {\nabla }}{\widehat {E}}_{\theta }(f):=-\sum _{i=1}^{\lambda }\overbrace {w_{i}} ^{\!\!\!\!{\text{preference weight}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}\underbrace {F_{\theta }^{-1}\nabla _{\!\theta }\ln p(x_{i:\lambda }\mid \theta )} _{\!\!\!\!\!{\text{candidate direction from }}x_{i:\lambda }\!\!\!\!\!}\quad {\text{with }}w_{i}=-f(x_{i:\lambda })/\lambda }

where the notationi:λ{\displaystyle i:\lambda } from above is used and thereforewi{\displaystyle w_{i}} are monotonically decreasing ini{\displaystyle i}.

Ollivieret al.[6]finally found a rigorous derivation for the weights,wi{\displaystyle w_{i}}, as they are defined in the CMA-ES. The weights are anasymptotically consistent estimator of theCDF off(X){\displaystyle f(X)} at the points of thei{\displaystyle i}thorder statisticf(xi:λ){\displaystyle f(x_{i:\lambda })}, as defined above, whereXp(.|θ){\displaystyle X\sim p(.|\theta )}, composed with a fixed monotonically decreasing transformationw{\displaystyle w}, that is,

wi=w(rank(f(xi:λ))1/2λ).{\displaystyle w_{i}=w\left({\frac {{\mathsf {rank}}(f(x_{i:\lambda }))-1/2}{\lambda }}\right).}

These weights make the algorithm insensitive to the specificf{\displaystyle f}-values. More concisely, using theCDF estimator off{\displaystyle f} instead off{\displaystyle f} itself let the algorithm only depend on the ranking off{\displaystyle f}-values but not on their underlying distribution. This renders the algorithm invariant to strictly increasingf{\displaystyle f}-transformations. Now we define

θ=[mkTvec(Ck)Tσk]TRn+n2+1{\displaystyle \theta =[m_{k}^{T}\operatorname {vec} (C_{k})^{T}\sigma _{k}]^{T}\in \mathbb {R} ^{n+n^{2}+1}}

such thatp(θ){\displaystyle p(\cdot \mid \theta )} is the density of themultivariate normal distributionN(mk,σk2Ck){\displaystyle {\mathcal {N}}(m_{k},\sigma _{k}^{2}C_{k})}. Then, we have an explicit expression for the inverse of the Fisher information matrix whereσk{\displaystyle \sigma _{k}} is fixed

Fθσk1=[σk2Ck002CkCk]{\displaystyle F_{\theta \mid \sigma _{k}}^{-1}=\left[{\begin{array}{cc}\sigma _{k}^{2}C_{k}&0\\0&2C_{k}\otimes C_{k}\end{array}}\right]}

and for

lnp(xθ)=lnp(xmk,σk2Ck)=12(xmk)Tσk2Ck1(xmk)12lndet(2πσk2Ck){\displaystyle {\begin{aligned}\ln p(x\mid \theta )&=\ln p(x\mid m_{k},\sigma _{k}^{2}C_{k})\\[1ex]&=-{\tfrac {1}{2}}(x-m_{k})^{T}\sigma _{k}^{-2}C_{k}^{-1}(x-m_{k})-{\tfrac {1}{2}}\ln \det(2\pi \sigma _{k}^{2}C_{k})\end{aligned}}}

and, after some calculations, the updates in the CMA-ES turn out as[4]

mk+1=mk[~E^θ(f)]1,,nnatural gradient for mean=mk+i=1λwi(xi:λmk){\displaystyle {\begin{aligned}m_{k+1}&=m_{k}-\underbrace {[{\tilde {\nabla }}{\widehat {E}}_{\theta }(f)]_{1,\dots ,n}} _{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{natural gradient for mean}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}\\&=m_{k}+\sum _{i=1}^{\lambda }w_{i}(x_{i:\lambda }-m_{k})\end{aligned}}}

and

Ck+1=Ck+c1(pcpcTCk)cμmat([~E^θ(f)]n+1,,n+n2natural gradient for covariance matrix)=Ck+c1(pcpcTCk)+cμi=1λwi(xi:λmkσk(xi:λmkσk)TCk){\displaystyle {\begin{aligned}C_{k+1}&=C_{k}+c_{1}(p_{c}p_{c}^{T}-C_{k})-c_{\mu }\operatorname {mat} (\overbrace {[{\tilde {\nabla }}{\widehat {E}}_{\theta }(f)]_{n+1,\dots ,n+n^{2}}} ^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\text{natural gradient for covariance matrix}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!})\\&=C_{k}+c_{1}(p_{c}p_{c}^{T}-C_{k})+c_{\mu }\sum _{i=1}^{\lambda }w_{i}\left({\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\left({\frac {x_{i:\lambda }-m_{k}}{\sigma _{k}}}\right)^{T}-C_{k}\right)\end{aligned}}}

where mat forms the proper matrix from the respective natural gradient sub-vector. That means, settingc1=cσ=0{\displaystyle c_{1}=c_{\sigma }=0}, the CMA-ES updates descend in direction of the approximation~E^θ(f){\displaystyle {\tilde {\nabla }}{\widehat {E}}_{\theta }(f)} of the natural gradient while using different step-sizes (learning rates 1 andcμ{\displaystyle c_{\mu }}) for theorthogonal parametersm{\displaystyle m} andC{\displaystyle C} respectively. More recent versions allow a different learning rate for the meanm{\displaystyle m} as well.[7] The most recent version of CMA-ES also use a different functionw{\displaystyle w} form{\displaystyle m} andC{\displaystyle C} with negative values only for the latter (so-called active CMA).

Stationarity or unbiasedness

[edit]

It is comparatively easy to see that the update equations of CMA-ES satisfy some stationarity conditions, in that they are essentially unbiased. Under neutral selection, wherexi:λN(mk,σk2Ck){\displaystyle x_{i:\lambda }\sim {\mathcal {N}}(m_{k},\sigma _{k}^{2}C_{k})}, we find that

E(mk+1mk)=mk{\displaystyle \operatorname {E} (m_{k+1}\mid m_{k})=m_{k}}

and under some mild additional assumptions on the initial conditions

E(logσk+1σk)=logσk{\displaystyle \operatorname {E} (\log \sigma _{k+1}\mid \sigma _{k})=\log \sigma _{k}}

and with an additional minor correction in the covariance matrix update for the case where the indicator function evaluates to zero, we find

E(Ck+1Ck)=Ck{\displaystyle \operatorname {E} (C_{k+1}\mid C_{k})=C_{k}}

Invariance

[edit]

Invariance properties imply uniform performance on a class of objective functions. They have been argued to be an advantage, because they allow to generalize and predict the behavior of the algorithm and therefore strengthen the meaning of empirical results obtained on single functions. The following invariance properties have been established for CMA-ES.

Any serious parameter optimization method should be translation invariant, but most methods do not exhibit all the above described invariance properties. A prominent example with the same invariance properties is theNelder–Mead method, where the initial simplex must be chosen respectively.

Convergence

[edit]

Conceptual considerations like the scale-invariance property of the algorithm, the analysis of simplerevolution strategies, and overwhelming empirical evidence suggest that the algorithm converges on a large class of functions fast to the global optimum, denoted asx{\displaystyle x^{*}}. On some functions, convergence occurs independently of the initial conditions with probability one. On some functions the probability is smaller than one and typically depends on the initialm0{\displaystyle m_{0}} andσ0{\displaystyle \sigma _{0}}. Empirically, the fastest possible convergence rate ink{\displaystyle k} for rank-based direct search methods can often be observed (depending on the context denoted aslinear convergence orlog-linear orexponential convergence). Informally, we can write

mkxm0x×eck{\displaystyle \|m_{k}-x^{*}\|\;\approx \;\|m_{0}-x^{*}\|\times e^{-ck}}

for somec>0{\displaystyle c>0}, and more rigorously

1ki=1klogmixmi1x=1klogmkxm0xc<0for k,{\displaystyle {\frac {1}{k}}\sum _{i=1}^{k}\log {\frac {\|m_{i}-x^{*}\|}{\|m_{i-1}-x^{*}\|}}\;=\;{\frac {1}{k}}\log {\frac {\|m_{k}-x^{*}\|}{\|m_{0}-x^{*}\|}}\;\to \;-c<0\quad {\text{for }}k\to \infty \;,}

or similarly,

Elogmkxmk1xc<0for k.{\displaystyle \operatorname {E} \log {\frac {\|m_{k}-x^{*}\|}{\|m_{k-1}-x^{*}\|}}\;\to \;-c<0\quad {\text{for }}k\to \infty \;.}

This means that on average the distance to the optimum decreases in each iteration by a "constant" factor, namely byexp(c){\displaystyle \exp(-c)}. The convergence ratec{\displaystyle c} is roughly0.1λ/n{\displaystyle 0.1\lambda /n}, givenλ{\displaystyle \lambda } is not much larger than the dimensionn{\displaystyle n}. Even with optimalσ{\displaystyle \sigma } andC{\displaystyle C}, the convergence ratec{\displaystyle c} cannot largely exceed0.25λ/n{\displaystyle 0.25\lambda /n}, given the above recombination weightswi{\displaystyle w_{i}} are all non-negative. The actual linear dependencies inλ{\displaystyle \lambda } andn{\displaystyle n} are remarkable and they are in both cases the best one can hope for in this kind of algorithm. Yet, a rigorous proof of convergence is missing.

Interpretation as coordinate-system transformation

[edit]

Using a non-identity covariance matrix for themultivariate normal distribution inevolution strategies is equivalent to a coordinate system transformation of the solution vectors,[8] mainly because the sampling equation

xi mk+σk×N(0,Ck) mk+σk×Ck1/2N(0,I){\displaystyle {\begin{aligned}x_{i}&\sim \ m_{k}+\sigma _{k}\times {\mathcal {N}}(0,C_{k})\\&\sim \ m_{k}+\sigma _{k}\times C_{k}^{1/2}{\mathcal {N}}(0,I)\end{aligned}}}

can be equivalently expressed in an "encoded space" asCk1/2xirepresented in the encode space Ck1/2mk+σk×N(0,I){\displaystyle \underbrace {C_{k}^{-1/2}x_{i}} _{{\text{represented in the encode space}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}\sim \ \underbrace {C_{k}^{-1/2}m_{k}} {}+\sigma _{k}\times {\mathcal {N}}(0,I)}

The covariance matrix defines abijective transformation (encoding) for all solution vectors into a space, where the sampling takes place with identity covariance matrix. Because the update equations in the CMA-ES are invariant under linear coordinate system transformations, the CMA-ES can be re-written as an adaptive encoding procedure applied to a simpleevolution strategy with identity covariance matrix.[8]This adaptive encoding procedure is not confined to algorithms that sample from a multivariate normal distribution (like evolution strategies), but can in principle be applied to any iterative search method.

Performance in practice

[edit]

In contrast to most otherevolutionary algorithms, the CMA-ES is, from the user's perspective, quasi-parameter-free. The user has to choose an initial solution point,m0Rn{\displaystyle m_{0}\in \mathbb {R} ^{n}}, and the initial step-size,σ0>0{\displaystyle \sigma _{0}>0}. Optionally, the number of candidate samples λ (population size) can be modified by the user in order to change the characteristic search behavior (see above) and termination conditions can or should be adjusted to the problem at hand.

The CMA-ES has been empirically successful in various applications and is considered to be particularly useful on non-convex, non-separable, ill-conditioned, multi-modal or noisy objective functions.[9] One survey of Black-Box optimizations found it outranked 31 other optimization algorithms, performing especially strongly on "difficult functions" or larger-dimensional search spaces.[10]

The search space dimension ranges typically between two and a few hundred. Assuming a black-box optimization scenario, where gradients are not available (or not useful) and function evaluations are the only considered cost of search, the CMA-ES method is likely to be outperformed by other methods in the following conditions:

On separable functions, the performance disadvantage is likely to be most significant in that CMA-ES might not be able to find at all comparable solutions. On the other hand, on non-separable functions that are ill-conditioned or rugged or can only be solved with more than100n{\displaystyle 100n} function evaluations, the CMA-ES shows most often superior performance.

Variations and extensions

[edit]

The (1+1)-CMA-ES[11] generates only one candidate solution per iteration step which becomes the new distribution mean if it is better than the current mean. Forcc=1{\displaystyle c_{c}=1} the (1+1)-CMA-ES is a close variant ofGaussian adaptation. SomeNatural Evolution Strategies are close variants of the CMA-ES with specific parameter settings. Natural Evolution Strategies do not utilize evolution paths (that means in CMA-ES settingcc=cσ=1{\displaystyle c_{c}=c_{\sigma }=1}) and they formalize the update of variances and covariances on aCholesky factor instead of a covariance matrix. The CMA-ES has also been extended tomultiobjective optimization as MO-CMA-ES.[12] Another remarkable extension has been the addition of a negative update of the covariance matrix with the so-called active CMA.[13]Using the additional active CMA update is considered as the default variant nowadays.[7]

See also

[edit]

References

[edit]
  1. ^Hansen, N. (2006), "The CMA evolution strategy: a comparing review",Towards a new evolutionary computation. Advances on estimation of distribution algorithms, Springer, pp. 1769–1776,CiteSeerX 10.1.1.139.7369
  2. ^Auger, A.; N. Hansen (2005)."A Restart CMA Evolution Strategy With Increasing Population Size"(PDF).2005 IEEE Congress on Evolutionary Computation, Proceedings. IEEE. pp. 1769–1776. Archived fromthe original(PDF) on 2016-03-04. Retrieved2012-07-13.
  3. ^Shir, O.M.; A. Yehudayoff (2020)."On the covariance-Hessian relation in evolution strategies".Theoretical Computer Science.801. Elsevier:157–174.arXiv:1806.03674.doi:10.1016/j.tcs.2019.09.002.
  4. ^abcAkimoto, Y.; Y. Nagata; I. Ono; S. Kobayashi (2010)."Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies".Parallel Problem Solving from Nature, PPSN XI. Springer. pp. 154–163.
  5. ^abGlasmachers, T.; T. Schaul; Y. Sun; D. Wierstra; J. Schmidhuber (2010)."Exponential Natural Evolution Strategies"(PDF).Genetic and Evolutionary Computation Conference GECCO. Portland, OR.
  6. ^Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. (2017)."Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles"(PDF).Journal of Machine Learning Research.18 (18): 1−65.
  7. ^abHansen, N. (2016). "The CMA Evolution Strategy: A Tutorial".arXiv:1604.00772 [cs.LG].
  8. ^abHansen, N. (2008)."Adaptive Encoding: How to Render Search Coordinate System Invariant".Parallel Problem Solving from Nature, PPSN X. Springer. pp. 205–214.
  9. ^"References to CMA-ES Applications"(PDF).
  10. ^Hansen, Nikolaus (2010)."Comparing Results of 31 Algorithms from the Black-Box Optimization Benchmarking BBOB-2009"(PDF).
  11. ^Igel, C.; T. Suttorp; N. Hansen (2006)."A Computational Efficient Covariance Matrix Update and a (1+1)-CMA for Evolution Strategies"(PDF).Proceedings of the Genetic and Evolutionary Computation Conference (GECCO). ACM Press. pp. 453–460.
  12. ^Igel, C.; N. Hansen; S. Roth (2007). "Covariance Matrix Adaptation for Multi-objective Optimization".Evolutionary Computation.15 (1):1–28.doi:10.1162/evco.2007.15.1.1.PMID 17388777.S2CID 7479494.
  13. ^Jastrebski, G.A.; D.V. Arnold (2006). "Improving Evolution Strategies through Active Covariance Matrix Adaptation".2006 IEEE World Congress on Computational Intelligence, Proceedings. IEEE. pp. 9719–9726.doi:10.1109/CEC.2006.1688662.

Bibliography

[edit]
  • Hansen N, Ostermeier A (2001). Completely derandomized self-adaptation in evolution strategies.Evolutionary Computation,9(2) pp. 159–195.[1]
  • Hansen N, Müller SD, Koumoutsakos P (2003). Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary Computation,11(1) pp. 1–18.[2]
  • Hansen N, Kern S (2004). Evaluating the CMA evolution strategy on multimodal test functions. In Xin Yao et al., editors,Parallel Problem Solving from Nature – PPSN VIII, pp. 282–291, Springer.[3]
  • Igel C, Hansen N, Roth S (2007). Covariance Matrix Adaptation for Multi-objective Optimization.Evolutionary Computation,15(1) pp. 1–28.[4]

External links

[edit]
Main Topics
Algorithms
Related techniques
Metaheuristic methods
Related topics
Organizations
Conferences
Journals
Retrieved from "https://en.wikipedia.org/w/index.php?title=CMA-ES&oldid=1318410257"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp