1930Accesses
1Altmetric
Abstract
Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone\(K\), a norm\(\Vert \cdot \Vert \) and a smooth convex function\(f\), we want either (1) to minimize the norm over the intersection of the cone and a level set of\(f\), or (2) to minimize over the cone the sum of\(f\) and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b)\(\Vert \cdot \Vert \) is “too complicated” to allow for computationally cheap Bregman projections required in the first-order proximal gradient algorithms. On the other hand, we assume that it is relatively easy to minimize linear forms over the intersection of\(K\) and the unit\(\Vert \cdot \Vert \)-ball. Motivating examples are given by the nuclear norm with\(K\) being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.


Similar content being viewed by others
Notes
i.e., the Lipschitz constant of\(f\) w.r.t.\(\Vert \cdot \Vert \) in the nonsmooth case, or the Lipschitz constant of the gradient mapping\(x\mapsto f'(x)\) w.r.t. the norm\(\Vert \cdot \Vert \) on the argument and the conjugate of this norm on the image spaces in the smooth case.
Recall that the dual, a.k.a. conjugate, to a norm\(\Vert \cdot \Vert \) on a Euclidean space\(E\) is the norm on\(E\) defined as\(\Vert \xi \Vert _*=\max _{x\in E,\Vert x\Vert \le 1}\langle \xi ,x\rangle \), where\(\langle \cdot ,\cdot \rangle \) is the inner product on\(E\).
In fact,\(x^b_t\) may be substituted for\(x^+_t\) in the recurrence (15), and the resulting approximate solutions\(x_t\) along with the lower bounds\(f_*^t\ge \max _{1\le \tau \le t} f^b_\tau (x^b_\tau )\) will still satisfy the bound (18) of Theorem 1. Indeed, the analysis of the proof of the theorem reveals that a point\(\xi \in X\) can be substituted for\(x_t^+=x_X[f'(x_t)]\) as soon as the inequality
$$\begin{aligned} \langle f'(x_t),x_t-\xi \rangle \ge f(x_t)-f^t_{*}, \end{aligned}$$(21)holds, where\(f^t_{*}\) is the best currently available lower bound for the optimal value\(f_*\) of (13). In other words, for the result of Theorem 1 to hold, one can substitute\(x_t^+=x_X[f'(x_t)]\) with any vector\(\xi \) satisfying (21). Now assume that\(x_t\in X^b_t\). By convexity of\(f^b_t\) (obviously,\(f'(x_t)\in \partial f^b_t(x_t)\), where\(\partial f^b_t(x)\) is the subdifferential of\(f^b_t\) at\(x\)), we have
$$\begin{aligned} \langle f'(x_t),x_t-x^b_t\rangle \ge f^b_t(x^b_t)-f(x_t)\ge f_*^t-f(x_t), \end{aligned}$$what is (21).
Assuming possibility to solve (20) exactly, while being idealization, is basically as “tolerable” as the standard in continuous optimization assumption that one can use exact real arithmetic or compute exactly eigenvalues/eigenvectors of symmetric matrices. The outlined “real life” considerations can be replaced with rigorous error analysis which shows that in order to maintain the efficiency estimates from Theorem 1, it suffices to solve\(t\)-th auxiliary problem within properly selected positive inaccuracy, and this can be achieved in\(O(\ln (t))\) computations of\(f\) and\(f'\).
The iterates\(x_t\), same as other indexed by\(t\) quantities participating in the description of the algorithm, in fact depend on both\(t\) and the stage number\(s\). To avoid cumbersome notation when speaking about a particular stage, we suppress\(s\) in the notation.
This property is an immediate corollary of the fact that in the situation in question, by description of the algorithms\(x_t\) is a convex combination of\(t\) points of the form\(x[\cdot ]\).
When\(f\) is more complicated, optimal adjustment of the mean\(t\) of the image reduces by bisection in\(t\) to solving small series of problems of the same structure as (5), (9) where the mean of the image\(x\) is fixed and, consequently, the problems reduce to those with\(x\in M^n_0\) by shifting\(b\).
Which one of these two options takes place depends on the type of the algorithm.
On a closest inspection, “complex geometry” of the\({\hbox {TV}}\)-norm stems from the fact that after parameterizing a zero mean image by its discrete gradient field and treating this field\((g=\nabla _i x,h=\nabla _jx)\) as our new design variable, the unit ball of the\({\hbox {TV}}\)-norm becomes the intersection of a simple set in the space of pairs\((g,h)\in F={\mathbf {R}}^{(n-1)\times n}\times {\mathbf {R}}^{n\times (n-1)}\) (the\(\ell _1\) ball\(\Delta \) given by\(\Vert g\Vert _1+\Vert h\Vert _1\le 1\)) with a linear subspace\(P\) of\(F\) comprised ofpotential vector fields\((f,g)\)—those which indeed are discrete gradient fields of images. Both dimension and codimension of\(P\) are of order of\(n^2\), which makes it difficult to minimize over\(\Delta \cap P\)nonlinear, even simple, convex functions, which is exactly what is needed in proximal methods.
That is, the rows of\(P\) are indexed by the nodes, the columns are indexed by the arcs, and in the column indexed by an arc\(\gamma \) there are exactly two nonzero entries: entry 1 in the row indexed by the starting node of\(\gamma \), and entry\(-1\) in the row indexed by the terminal node of\(\gamma \).
From the Sobolev embedding theorem it follows that for a smooth function\(f(x,y)\) on the unit square one has\(\Vert f\Vert _{L_2}\le O(1)\Vert \nabla f\Vert _1, \Vert \nabla f\Vert _1{:=}\Vert f'_x\Vert _1+\Vert f'_y\Vert _1\), provided that\(f\) has zero mean. Denoting by\(f^n\) the restriction of the function onto a\(n\times n\) regular grid in the square, we conclude that\(\Vert f^n\Vert _2/{\hbox {TV}}(f^n)\rightarrow \Vert f\Vert _{L_2}/\Vert \nabla f\Vert _1\le O(1)\) as\(n\rightarrow \infty \). Note that the convergence in question takes place only in the 2-dimensional case.
For comparison: solving on the same platform problem (36) corresponding to Experiment A (\(256\times 256\) image) by the state-of-the-art commercial interior point solvermosekopt 6.0 took as much as 3,727 sec, and this—for asingle value of the penalty (there is no clear way to get from a single run approximate solutions for a set of values of the penalty in this case).
References
Andersen, E.D., Andersen, K.D.: The MOSEK optimization tools manual.http://www.mosek.com/fileadmin/products/6_0/tools/doc/pdf/tools.pdf
Bach, F., Jenatton, R., Mairal, J., Obozinski, G. et al.: Convex optimization with sparsity-inducing norms. In: Sra, S., Nowozin, S., Wright, S. J. (eds). Optimization for Machine Learning, pp. 19–53. MIT Press
Cai, J.-F., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim.20(4), 1956–1982 (2008)
Candès, E., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math.9(6), 717–772 (2009)
Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 1–38 (2013). doi:10.1007/s10107-013-0725-1
Demyanov, V., Rubinov, A.: Approximate Methods in Optimization Problems. American Elsevier, Amsterdam (1970)
Dudik, M., Harchaoui, Z., Malick, J.: Lifted coordinate descent for learning with trace-norm regularization. In: AISTATS (2012)
Dunn, J.C., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl.62(2), 432–444 (1978)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q.3, 95–110 (1956)
Goldfarb, D., Ma, S., Wen, Z.: Solving low-rank matrix completion problems efficiently. In: Proceedings of 47th Annual Allerton Conference on Communication, Control, and Computing (2009)
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J.: Large-scale image classification with trace-norm regularization. In: CVPR (2012)
Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for machine learning. In: NIPS Workshop on Optimization for Machine Learning.http://opt.kyb.tuebingen.mpg.de/opt12/papers.html (2012)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, Berlin (2008)
Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Proceedings of the 8th Latin American Conference Theoretical Informatics, pp. 306–316 (2008)
Hearn, D., Lawphongpanich, S., Ventura, J.: Restricted simplicial decomposition: computation and extensions. Math. Program. Stud.31, 99–118 (1987)
Holloway, C.: An extension of the Frank-Wolfe method of feasible directions. Math. Program.6, 14–27 (1974)
Jaggi, M.: Revisiting Frank-Wolfe: projection-free sparse convex optimization. In: ICML (2013)
Jaggi, M., Sulovsky, M.: A simple algorithm for nuclear norm regularized problems. In: ICML (2010)
Juditsky, A., Karzan, F.K., Nemirovski, A.: Randomized first order algorithms with applications to\(\ell _1\)-minimization. Math. Program.142(1–2), 269–310 (2013)
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth large-scale convex minimization, i: general purpose methods; ii: utilizing problem’s structure. In: Sra, S., Nowozin, S., Wright, S. (eds). Optimization for Machine Learning, pp. 121–184. The MIT Press, Cambridge (2012)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program.133(1–2), 365–397 (2012)
LemarÃchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program.69(1–3), 111–147 (1995)
Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization. Math. Program.128, 321–353 (2011)
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res.35(1), 52–78 (2010)
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, New York (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Berlin (2003)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program.140(1), 125–161 (2013)
Nesterov, Y., Nemirovski, A.: On first-order algorithms for l 1/nuclear norm minimization. Acta Numer.22, 509–575 (2013)
Pshenichnyj, B., Danilin, Y.: Numerical Methods in Extremal Problems. Mir, Moscow (1978)
Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev.52(3), 471–501 (2010)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput.5(2), 201–226 (2013)
Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D60 (1992)
Shalev-Shwartz, S., Gonen, A., Shamir, O.: Large-scale convex minimization with a low-rank constraint. In: ICML (2011)
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2010)
Srebro, N., Shraibman, A.: Rank, trace-norm and max-norm. In: COLT (2005)
Ventura, J.A., Hearn, D.W.: Restricted simplicial decomposition for convex constrained problems. Math. Program.59, 71–85 (1993)
Yang, J., Yuan, X.: Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Math. Comput.82(281), 301–329 (2013)
Zhang, X., Yu, Y., Schuurmans, D.: Accelerated training for matrix-norm regularization: a boosting approach. In: NIPS, pp. 2915–2923 (2012)
Zibulevski, M., Narkiss, G.: Sequential subspace optimization method for large-scale unconstrained problems. Technical Report CCIT No 559, Faculty of Electrical engineering, Technion (2005)
Author information
Authors and Affiliations
LJK, Inria, 655 Avenue de l’Europe, Montbonnot, 38334 , Saint-Ismier, France
Zaid Harchaoui
LJK, Université Grenoble Alpes, B.P. 53, 38041 , Grenoble Cedex 9, France
Anatoli Juditsky
Georgia Institute of Technology, Atlanta, GA, 30332, USA
Arkadi Nemirovski
- Zaid Harchaoui
You can also search for this author inPubMed Google Scholar
- Anatoli Juditsky
You can also search for this author inPubMed Google Scholar
- Arkadi Nemirovski
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toAnatoli Juditsky.
Additional information
Research of the first and second authors was supported by the CNRS-Mastodons project GARGANTUA, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025). Research of the third author was supported by the ONR Grant N000140811104 and NSF Grants DMS 0914785, CMMI 1232623.
Appendix
Appendix
1.1Proof of Theorem 1
Define\(\epsilon _t=f(x_t)-f^t_*\). When invoking convexity of\(f\) and the definition (16), (17) of\(f^t_*\), we have
Observing that for a generic GC algorithm we have (cf. (14), (15))\(f(x_{t+1})\le f(x_t+\gamma _t(x_t^+-x_t)), \gamma _t={2\over t+1}\), and invoking (12), we have
where the concluding\(\le \) is due to (38). Since\(f_*^{t+1}\ge f_*^{t}\), it follows that
whence
where, by convention,\(\prod _{k=t+1}^t=1\). Noting that\( \prod _{k=i+1}^t(1-{2\over k+1})=\prod _{k=i+1}^t {k-1\over k+1}= {i(i+1)\over t(t+1)},\;\;\;i=1,\ldots ,t, \) we get
what is (18).\(\square \)
1.2Proof of Theorem 2
The proof, up to minor modifications, goes back to [22], see also [19,26]; we provide it here to make the paper self-contained. W.l.o.g. we can assume that we are in the nontrivial case (see description of the algorithm).
- \(\mathbf{1}^0\).:
As it was explained when describing the method, whenever stage\(s\) takes place, we have\(0<\rho _1\le \rho _s\le \rho _*\), and\(\rho _{s-1}<\rho _s\), provided\(s>1\). Therefore by the termination rule, the output\(\bar{\rho }, \bar{x}\) of the algorithm, if any, satisfies\(\bar{\rho }\le \rho _*, f(\bar{x})\le \epsilon \). Thus, (i) holds true, provided that the algorithm does terminate. Thus, all we need is to verify (ii) and (iii).
- \(\mathbf{2}^0\).:
Let us prove (ii). Let\(s\ge 1\) be such that stage\(s\) takes place. Setting\(X=K[\rho _s]\), observe that\(X-X\subset \{x\in E:\Vert x\Vert \le 2\rho _s\}\), whence\(\Vert \cdot \Vert \le 2\rho _s\Vert \cdot \Vert _X\), and therefore the relation (4) implies the validity of (12) with\(L=4\rho _s^2L_f\). Now, if stage\(s\) does not terminate in course of some number\(t\) steps, then, in the notation from the description of the algorithm,\(f(\bar{x}_t)>\epsilon \) and\(f_*^t<{3\over 4}f(\bar{x}_t)\), whence\(f(\bar{x}_t)-f_*^t> \epsilon /4\). By Theorem 1.ii, the latter is possible only when\(4.5L/(t-2)>\epsilon /4\). Thus,\(t\le \max \left[ 5,2+{72\rho _s^2L_f\over \epsilon }\right] \). Taking into account that\(\rho _s\le \rho _*\), (ii) follows.
- \(\mathbf{3}^0\).:
Let us prove (iii). This statement is trivially true when the number of stages is 1. Assuming that it is not the case, let\(S\ge 1\) be such that the stage\(S+1\) takes place. For every\(s=1,\ldots ,S\), let\(t_s\) be the last step of stage\(s\), and let\(u_s, \ell ^s(\cdot )\) be what in the notation from the description of stage\(s\) was denoted\(f(\bar{x}_{t_s})\) and\(\ell ^{t_s}(\rho )\). Thus,\(u_s>\epsilon \) is an upper bound on\({\mathop {\hbox {Opt}}}(\rho _s), \ell _s{:=}\ell ^s(\rho _s)\) is a lower bound on\({\mathop {\hbox {Opt}}}(\rho _s)\) satisfying\(\ell _s\ge 3u_s/4\), and\(\ell ^s(\cdot )\) is a piecewise linear convex in\(\rho \) lower bound on\({\mathop {\hbox {Opt}}}(\rho ), \rho \ge 0\), and\(\rho _{s+1}>\rho _s\) is the smallest positive root of\(\ell ^s(\cdot )\). Let also\(-g_s\) be a subgradient of\(\ell ^s(\cdot )\) at\(\rho _s\). Note that\(g_s>0\) due to\(\rho _{s+1}>\rho _s\) combined with\(\ell ^s(\rho _s)>0, \ell ^s(\rho _{s+1})=0\), and by the same reasons combined with convexity of\(\ell ^s(\cdot )\) we have
$$\begin{aligned} \rho _{s+1}-\rho _s\ge \ell _s/g_s, \end{aligned}$$(41)and, as we have seen,
$$\begin{aligned} 1\le s\le S\Rightarrow \left\{ \begin{array}{ll} (a)&{}u_s>\epsilon ,\\ (b)&{}u_s\ge {\mathop {\hbox {Opt}}}(\rho _s)\ge \ell _s\ge {3\over 4}u_s,\\ (c)&{}\ell _s-g_s(\rho -\rho _s) \le {\mathop {\hbox {Opt}}}(\rho ),\,\rho \ge 0.\\ \end{array}\right. \!. \end{aligned}$$(42)
Assuming\(1<s\le S\) and applying (41), we get\(\rho _s-\rho _{s-1}\ge {3\over 4}u_{s-1}/g_{s-1}\), whence, invoking (42),
The resulting inequality implies that\({u_s\over u_{s-1}}+{g_s\over g_{s-1}}\le {4\over 3}\), whence\({u_sg_s\over u_{s-1}g_{s-1}}\le (1/4)(4/3)^2=4/9\). It follows that
Now, since the first iterate of the first stage is\(0\), we have\(u_1\le f(0)\), while (42) applied with\(s=1\) implies that\(f(0)={\mathop {\hbox {Opt}}}(0)\ge \ell _1+\rho _1g_1\ge \rho _1g_1\), whence\(u_1g_1\le f(0)/\rho _1=d\). Further, by (41)\(g_s\ge \ell _s/(\rho _{s+1}-\rho _s)\ge \ell _s/\rho _*\ge {3\over 4} u_s/\rho _*\), where the concluding inequality is given by (42). We see that\(u_sg_s\ge {3\over 4}u_s^2/\rho _*\ge {3\over 4}\epsilon ^2/\rho _*\). This lower bound on\(u_sg_s\) combines with the bound\(u_1g_1\le d\) and with (43) to imply that
Finally observe that by the definition of\(\rho _*\) and due to the fact that\(\Vert x[f'(0)]\Vert =1\) in the nontrivial case, we have
(we have used (4) and the definition of\(d\)), whence\(\rho _*d\le f(0)+{\small \frac{1}{2}}L_f\rho _*^2\) and therefore
Since this relation holds true for every\(S\ge 1\) such that the stage\(S+1\) takes place, (iii) follows.\(\square \)
1.3Proof of Theorem 3
By definition of\(z_t\) we have\(z_t\in K^+\) for all\(t\) and\(F(0)=F(z_1)\ge F(z_2)\ge \ldots \), whence\(r_t\le D_*\) for all\(t\) by Assumption A. Besides this,\(r_*\le D_*\) as well. Let now\(\epsilon _t=F(z_t)-F_*\),\(z_t=[x_t;r_t]\), and let\(z^+_t=[x^+_t,r^+_t]\) be a minimizer, as given by Lemma 1, of the linear form\(\langle F'(z_t),z\rangle \) of\(z\in E^+\) over the set\(K^+[r_*]=\{[x;r]: x\in K,\Vert x\Vert \le r\le r_*\}\). Recalling that\(F'(z_t)=[f'(x_t);\kappa ]\) and that\(r_t\le D_*\le \bar{D}\), Lemma 1 implies that\(z^+_t\in \Delta (z_t)\). By definition of\(z_t^+\) and convexity of\(F\) we have
Invoking (12), it follows that for\(0\le s\le 1\) one has
using that\(\Vert x(z_t^+)\Vert \le r_t^+\) and\(\Vert x(z^t)\Vert \le r_t\) due to\(z_t^+,z_t\in K^+\), and that\(r_t^+\le r_*\le D_*\). By (24) we have
and we arrive at the recurrence
When\(t=1\), this recurrence, in view of\(z_1=0\), implies that\(\epsilon _2\le {\small \frac{1}{2}}L_fD_*^2\). Let us show by induction in\(t\ge 2\) that
thus completing the proof. We have already seen that (45) is valid for\(t=2\). Assuming that (45) holds true for\(t=k\ge 2\), we have\(\epsilon _{k}\le {\small \frac{1}{2}}L_f D_*^2\) and therefore\(\epsilon _{k+1}\le \epsilon _{k}-{1\over 8L_f D_*^2}\epsilon _{k}^2\) by (44) combined with\(0\le r_k\le D_*\). Now, the function\( s-{1\over 8L_f D_*^2}s^2\) is nondecreasing on the segment\(0\le s\le 4L_f D_*^2\) which contains\(\bar{\epsilon }_k\) and\(\epsilon _k\le \bar{\epsilon }_k\), whence
so that (45) holds true for\(t=k+1\).\(\square \)
1.4Proofs for Section6
As we have already explained, (31) is solvable, so that\(z\) is well defined. Denoting by\((s_*,r_*)\) an optimal solution to (31) produced, along with\(z\), by our solver, note that the characteristic property of\(z\) is the relation
Since the column sums in\(P\) are zeros and the sum of entries in\(\eta \) is zero, the above characteristic property of\(z\) is preserved when passing from\(z\) to\(\bar{z}\), so that we may assume from the very beginning that\(z=\bar{z}\) is a zero mean image. Now,\(P=[Q,-Q]\), where\(Q\) is the incidence matrix of the network obtained from\(G\) by eliminating backward arcs. Representing a flow\(r\) as\([r_f;r_b]\), where the blocks are comprised, respectively, of flows in the forward and backward arcs, and passing from\(r\) to\(\rho =r_f-r_b\), our characteristic property of\(z\) clearly implies the relation
By the optimality conditions in linear programming it follows that
Indeed,\((a)\) stems from the fact that\(\psi (s,\rho )\), which is affine in\(s\), is above bounded, so that the coefficient at\(s\) in\(\psi \) should be zero;\((b)\) is the constraint in the maximization problem in (46) to which\((s_*,\rho _*)\) is an optimal solution;\((c)\) is the optimality condition for the same problem w.r.t. the\(\rho \)-variable; and\((d)\) expresses the fact that\((s_*,r_*)\) is feasible for (31). (47.\(d\)) and (47.\(a\)) imply that\(\langle Q^Tz,\rho _*\rangle =s_*\), while (47.\(c\)) says that\(\langle Q^Tz,\rho _*\rangle = \Vert Q^Tz\Vert _1\), and\(s_*=\Vert Q^Tz\Vert _1\). By (47.\(a\))\(z\ne 0\), and thus\(z\) is a nonzero image with zero mean; recalling what\(Q\) is, the first\(n(n-1)\) entries in\(Q^Tz\) form\(\nabla _i z\), and the last\(n(n-1)\) entries form\(\nabla _jz\), so that\(\Vert Q^Tz\Vert _1={\hbox {TV}}(z)\). The gradient field of a nonzero image with zero mean cannot be identically zero, whence\({\hbox {TV}}(z)=\Vert Q^Tz\Vert _1=s_*>0\). Thus\(x[\eta ]=-z/{\hbox {TV}}(z)=-z/s_*\) is well defined and\({\hbox {TV}}(x[\eta ])=1\), while by (47.\(a\)) we have\(\langle x[\eta ],\eta \rangle =-1/s_*\). Finally, let\(x\in {\mathcal{T}\mathcal{V}}\), implying that\(Q^Tx\) is the concatenation of\(\nabla _ix\) and\(\nabla _jx\) and thus\(\Vert Q^Tx\Vert _1={\hbox {TV}}(x)\le 1\). Invoking (47.\(b,d\)), we get\(-1\le \langle Q^Tx,\rho ^*\rangle =\langle x, Q\rho _*\rangle =s_*\langle x,\eta \rangle \), whence\(\langle x,\eta \rangle \ge -1/s_*=\langle x[\eta ],\eta \rangle \), meaning that\(x[\eta ]\in {\mathcal{T}\mathcal{V}}\) is a minimizer of\(\langle \eta ,x\rangle \) over\(x\in {\mathcal{T}\mathcal{V}}\).\(\square \)
In the sequel, for a real-valued function\(x\) defined on a finite set (e.g., for an image),\(\Vert x\Vert _p\) stands for the\(L_p\) norm of the function corresponding to the counting measure on the set (the mass of every point from the set is 1). Let us fix\(n\) and\(x\in M^n_0\) with\({\hbox {TV}}(x)\le 1\); we want to prove that
with appropriately selectedabsolute constant\(\mathcal{C}\).
\(\mathbf{1}^0\). Let\(\oplus \) stand for addition, and\(\ominus \) for subtraction of integers modulo\(n\);\(p\oplus q=(p+q) \,\hbox {mod} \,n\in \{0,1,\ldots ,n-1\}\) and similarly for\(p\ominus q\). Along with discrete partial derivatives\(\nabla _i x, \nabla _jx\), let us define their periodic versions\(\widehat{\nabla }_ix, \widehat{\nabla }_jx\):
same as periodic Laplacian\(\widehat{\Delta } x\):
For every\(j, 0\le j<n\), we have\(\sum _{i=0}^{n-1} \widehat{\nabla }_ix(i,j)=0\) and\(\nabla _ix(i,j)=\widehat{\nabla }_ix(i,j)\) for\(0\le i<n-1\), whence\(\sum _{i=0}^{n-1}|\widehat{\nabla }_i(x)|\le 2\sum _{i=0}^{n-1}|\nabla _ix(i,j)|\) for every\(j\), and thus\(\Vert \widehat{\nabla }_ix\Vert _1\le 2\Vert \nabla _ix\Vert _1\). Similarly,\(\Vert \widehat{\nabla }_jx\Vert _1\le 2\Vert \nabla _jx\Vert _1\), and we conclude that
\(\mathbf{2}^0\). Now observe that for\(0\le i,j<n\) we have
whence
Now consider the following linear mapping from\(M^n\times M^n\) into\(M^n\):
From this definition and (50) it follows that
Let for\(u\in M^n, {\hbox {DFT}}[u]\) stand for the 2D Discrete Fourier Transform of\(u\):
Note that every image\(u\) with zero mean is the periodic Laplacian of another, uniquely, defined, image\(X[u]\) with zero mean, with\(X[u]\) given by its Fourier transform
Indeed, representing an\(n\times n\) image\(x(\mu ,\nu ),\)\(0\le \mu ,\nu <n\), as a Fourier sum
we get
where
In other words, in the Fourier domain passing from an image to its periodic Laplacian means multiplication of Fourier coefficient\(c_{p,q}\) by\(D(p,q)\). From the expression for\(D(p,q)\) it is immediately seen that\(D(p,q)\) is nonzero whenever\((p,q)\) with\(0\le p,q<n\) is nonzero. In other words, the periodic Laplacian of a whatever\(n\times n\) zero mean image is an image with zero mean (zero Fourier coefficient\((0,0)\)), and every image of this type is a periodic Laplacian of another zero mean image described in (53).
In particular, invoking (52), we get
By Parseval identity,\(\Vert {\hbox {DFT}}[x]\Vert _2=n\Vert x\Vert _2\), whence
Combining this observation with (49), we see that in order to prove (48), it suffices to check that
(!)Whenever\(g,h\in M^n\)are such that
we have
\(\mathbf{4}^0\). The good news about (!) is that\(Y[B[g,h]]\) is linear in\((g,h)\). Therefore, in order to justify (!), it suffices to prove that (54) holds true for the extreme point of\(G\), i.e., (a) for pairs where\(h\equiv 0\) and\(g\) is an image which is equal to 2 at some point of\(\Gamma _{n,n}\) and vanishes outside of this point, and (b) for pairs where\(g\equiv 0\) and\(h\) is an image which is equal to 2 at some point of\(\Gamma _{n,n}\) and vanishes outside of this point. Task (b) clearly reduces to task (a) by swapping the coordinates\(i,j\) of points from\(\Gamma _{n,n}\), so that we may focus solely on task (a). Thus, assume that\(g\) is a cyclic shift of the image\(2\delta \):
From (51) it follows that then\(B[g,0]\) is a cyclic shift of\(B[2\delta ,0]\), whence\(|{\hbox {DFT}}[B[g,0]](p,q)|=|{\hbox {DFT}}[B[2\delta ,0]](p,q)|\) for all\([p;q]\in \Gamma _{n,n}\), which, by (53), implies that\(|Y[B[g,0]](p,q)|=|Y[B[2\delta ,0]](p,q)|\) for all\([p;q]\in \Gamma _{n,n}\). The bottom line is thatall we need is to verify that (54)holds true for\(g=2\delta ,h=0\), or, which is the same, that with
where the right hand side by definition is\(0\) at\(p=q=0\), it holds
Now, (55) makes sense for all\([p;q]\in {\mathbf {Z}}^2\) (provided that we define the right hand side as zero at all points of\({\mathbf {Z}}^2\) where the denominator in (55) vanishes, that is, at all point where\(p,q\) are integer multiples of\(n\)) and defines\(y\) as a double-periodic, with periods\(n\) in\(p\) and in\(q\), function of\([p;q]\). Therefore, setting\(m=\lfloor n/2\rfloor \ge 1\) and\(W=\{[p;q]\in {\mathbf {Z}}^2: -m\le p,q<n-m\}\), we have
Setting\(\rho (p,q)=\sqrt{p^2+q^2}\), observe that when\(0\ne [p;q]\in W\), we have\(|1-\exp \{2\pi \imath p/n\}|\le C_1n^{-1}\rho (p,q)\) and\(2[1-{1\over 2}[\cos (2\pi p/n)+\cos (2\pi q/n)]]\ge C_2n^{-2}\rho ^2(p,q)\) with positive absolute constants\(C_1,C_2\), whence
With appropriately selected absolute constant\(C_3\) we have
Thus,\(C_n\le (C_1/C_2)^2C_3n^2\ln (n)\), meaning that (54), and thus (48), holds true with\(\mathcal{C}=\sqrt{C_3}C_1/C_2\).\(\square \)
We are very grateful to Referees for their reading of the manuscript and thoughtful comments.
We have included after Theorem 2 a comment on comparison of the proposed parametric optimization algorithm with that based on bisection. However, we finally decided not to include the numerical comparison. Note that including such experiment would require explaining in some detail what the “bisection” algorithm is and how is it implemented, what would make the manuscript which is already too long with respect to the size requirements of theMathematical Programming even longer.
We would like to thank Referee 2 for detailed and clarifying explanations of his comments on the initial manuscript. Some comments of Referee 2 concern notation we have not modified in our revision. Since his requests do not seem to be compulsory and the final decision is left to us, we have modified some (e.g., we replaced\(D^+\) with\(\bar{D}\)) but preferred to keep other unchanged.
In what follows we provide our answers to other questions raised in his detailed report.
\(6.16^*\)...I would suggest that the authors simply say, when introducing (12), that is it is implied by (4) but more general, to give some hint about why they are mentioning it
We have added a small comment after display (12) to make clear the choice of the norm\(\Vert \cdot \Vert _X\) here
\(11.9^*\)...some readers might be more familiar with a statement like “\(\mathcal{A}\)has full column rank,” or “\(\mathcal{A}\)has a trivial nullspace,” etc.
We think that for a linear operator (which, in our case, may not be a matrix) a commonly adopted terminology is indeed “\(\mathcal{A}\) has a trivial nullspace,” or “\(\mathcal{A}\) has a trivial kernel”, what is the exact wording of\(\mathrm{Ker}(\mathcal{A})=\{0\}\).
\(11.19^*\)...but I think it would suffice to simply note in the text that the given value\(\sum |\lambda _\zeta |\rho \)is an upper bound on\(\Vert x\Vert \)for\(x\)in the given form.
Thank you, we have added a comment in this sense after display (29).
\(30.5^*\)...I’ll say that the authors are practicing a false economy in not giving a more complete explanation, and that at the least, “we immediately see” should be “one can show”
Thank you for insisting, you are probably right—we have reproduced in the text our answer to the corresponding comment on the previous version of the manuscript.
Rights and permissions
About this article
Cite this article
Harchaoui, Z., Juditsky, A. & Nemirovski, A. Conditional gradient algorithms for norm-regularized smooth convex optimization.Math. Program.152, 75–112 (2015). https://doi.org/10.1007/s10107-014-0778-9
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative