Regularized Distributed Optimization:
A CommunicationEfficient PrimalDual Framework
Abstract
Despite the importance of sparsity in many largescale applications, there are few methods for distributed optimization of sparsityinducing objectives. In this paper, we present a communicationefficient framework for regularized optimization in the distributed environment. By viewing classical objectives in a more general primaldual setting, we develop a new class of methods that can be efficiently distributed and applied to common sparsityinducing models, such as Lasso, sparse logistic regression, and elastic netregularized problems. We provide theoretical convergence guarantees for our framework, and demonstrate its efficiency and flexibility with a thorough experimental comparison on Amazon EC2. Our proposed framework yields speedups of up to 50 as compared to current stateoftheart methods for distributed regularized optimization.
1 Introduction
In this paper, we consider standard regularized loss minimization problems, including as our main focus regularized optimization problems of the form
where is the weight vector, is a given data matrix, and is a regularization parameter. This formulation includes many popular regularized classification and regression models, such as Lasso and sparse logistic regression, and is easily extended to other separable regularizers like elastic net. Models of this form are particularly useful in highdimensional settings because of their tendency to bias learning towards sparse solutions. However, despite their importance, few methods currently exist to efficiently fit such sparsityinducing models in the distributed environment.
One promising distributed method is CoCoA [12, 17], a recently proposed primaldual framework that demonstrates competitive performance, provides a flexible communication scheme, and enables the use of offtheshelf singlemachine solvers internally. However, by solving the problem in the dual, CoCoA (like SDCA, proxSDCA, and numerous other primaldual methods [26, 27, 30, 35, 36]) is only equipped to handle strongly convex regularizers, which prevents it from being directly applied to regularized objectives. Moreover, by requiring the data to be distributed by data point rather than by feature, communication can become a prohibitive bottleneck for CoCoA as the number of features grows large, which is precisely the setting of interest for regularization.
In this work, we take a different perspective and propose a framework that can run either in the dual, or on the primal directly. From this change in perspective we derive several new primaldual distributed optimization methods, in particular for sparsityinducing regularizers. Our approach uses ideas from CoCoA, though leveraging these ideas in this new setting requires significant theoretical and algorithmic modifications, particularly in handling nonstrongly convex regularizers. The proposed primaldual framework and associated rates are novel contributions even in the nondistributed case.
1.1 Contributions
Generalized framework.
By building on the CoCoA framework, proxCoCoA comes with several benefits, including the use of arbitrary local solvers on each machine, and the analysis of and ability to solve subproblems to arbitrary accuracies. However in contrast to CoCoA, we consider a much broader class of optimization problems. This results in a more general framework that: (1) specifically incorporates the case of regularization; (2) allows for the flexibility of distributing the data by either feature or data point; and (3) can be run on either the primal or dual formulation, which we show to have significant theoretical and practical implications.
Analysis of nonstrongly convex regularizers and losses.
We derive convergence rates for the general class of problems considered in this work, leveraging a novel approach in the analysis of primaldual rates for nonstrongly convex regularizers. The proposed technique is a significant improvement over simple smoothing techniques used in, e.g., [22, 27, 35] that enforce strong convexity by adding a small term to the objective. Our results include primaldual rates and certificates for both strongly convex and nonstrongly convex regularizers and losses, and we show how earlier rates of CoCoA and CoCoA can be derived as a special case of our new rates / methods.
Experimental comparison.
The proposed framework yields orderofmagnitude speedups (as much as 50 faster) as compared to other stateoftheart methods for regularized optimization. We demonstrate these performance gains in an extensive experimental comparison on realworld distributed datasets. We additionally show significant improvements over CoCoA when considering strongly convex objectives. All algorithms for comparison are implemented in Apache Spark and run on Amazon EC2 clusters. Our code is available at: github.com/gingsmith/proxcocoa.
2 Setup
A great variety of methods in machine learning and signal processing are posed as the minimization of a weighted sum of two convex functions, where the first term is a convex function of a linear predictor and the second term is a regularizer:
(A) 
Here is the parameter vector, and is a data matrix with column vectors , and row vectors , . Our central assumption will be that is separable, meaning that
for convex functions . Furthermore, we assume is smooth for .
Examples.
The above setting encompasses all convex loss functions depending on linear predictors , together with most common convex regularizers, including all separable functions, such as  or general norms, or the elastic net given by .
Data partitioning.
To map this setup to the distributed environment, we suppose that the dataset is distributed over machines according to a partition of the columns of . We denote the size of the partition on machine by . For and , we define as the vector with elements if and otherwise.
3 The proxCoCoA Algorithmic Framework
The proxCoCoA framework is given in Algorithm 1. This framework builds on the recent CoCoA framework [12, 17], though with a more general objective, a modified subproblem, and where we allow the method to be applied to either the primal or dual formulation. To distribute the method, we assign each machine to work only on local coordinates of the weight vector , and access only data that is stored locally. Machines share state through the vector . This vector is communicated at each round after using local solvers in parallel to find (possibly) approximate solutions to the subproblems defined in (2). Solving the primal problem (A) directly with proxCoCoA will result in distributing the data columnwise (by feature), and having the vector be of length equal to the number of data points. This can greatly reduce communication costs as the number of features grows (see Section 6). Most importantly, the proposed setup will prepare us to handle nonstrongly convex regularizers in both theory and practice, as we further explain in the following sections.
Datalocal quadratic subproblems.
For each machine, we define a datalocal subproblem of the original optimization problem (A). This simpler problem can be solved on machine and only requires accessing data which is already available locally, i.e., columns such that . The subproblem depends only on the previous shared vector and the local data:
(1) 
where
(2) 
with . We denote the change of local variables for indices as . For a given aggregation parameter , the subproblem relaxation parameter will be set as , but can also be improved in a datadependent way as we discuss in Appendix E.
Reusability of existing singlemachine solvers.
Our local subproblems have the appealing property of being very similar in structure to the global problem (A), with the main difference being that they are defined on a smaller (local) subset of the data. For the user of our framework, this presents a major advantage in that existing single machinesolvers can be directly reused in our distributed framework (Algorithm 1) by employing them on the subproblems . Therefore, problemspecific tuned solvers which have already been developed, along with associated speed improvements (such as multicore implementations), can be easily leveraged in the distributed setting. We quantify the dependence on local solver performance in more detail in our convergence analysis (Section 4).
Interpretation.
The above definition of the local objective functions are such that they closely approximate the global objective in (A) as the “local” variable varies, which we will see in the analysis (Lemma 8 in the appendix). In fact, if the subproblem were solved exactly, this could be interpreted as a datadependent, blockseparable proximal step, applied to the part of the objective (A) as follows:
where
However, note that in contrast to traditional proximal methods, our algorithm does not assume that the prox subproblems be solved to high accuracy, as we instead allow the use of local solvers of any approximation quality . This notion is made precise with the following assumption.
Assumption 1 (approximate solution, see [17]).
We assume that there exists such that , the local solver at any outer iteration produces a (possibly) randomized approximate solution , which satisfies
(3) 
where
(4) 
3.1 PrimalDual Context
Exploiting primaldual structure is not a requirement to optimize (A); indeed, we have shown above how to solve this optimization problem directly. However, noting the relationship between primal and dual objectives has many benefits, including computation of the duality gap, which allows us to have a certificate of approximation quality. It is also useful as an analysis tool and helps relate this work to the prior work of [30, 12, 17]. To leverage this structure, starting from our original formulation (A) with objective function , the dual problem is given by
(B) 
Here is a weight vector and are columns of the data matrix . The functions are the convex conjugates of in the original problem (A). This duality structure is known as FenchelRockafellar Duality (see [4, Theorem 4.4.2] or a selfcontained derivation in the appendix).
Given in the context of (A), a corresponding primal vector for problem (B) is obtained by:
(5) 
This mapping is given by the firstorder optimality conditions for the part of the objective. (Recall that we assumed are arbitrary closed convex functions, is smooth.) The duality gap, given by:
(6) 
acts as a certificate of approximation quality, as the distance to the true optimum is always bounded above by the duality gap. A globally defined and finite duality gap for any problem (A) can be obtained by bounding the region of interest for the iterates . This “Lipschitzing” trick will make the conjugates globally defined and Lipschitz [8], as we prove in Section 4.
Primal vs. Dual.
Previous work of CoCoA mapped machine learning tasks to (B), and then solved this problem in the dual. While this can still be accomplished with the machinery of proxCoCoA (see Section F), here our main focus is to instead solve the original objective (A) directly. This can have a large practical impact for the described applications in the distributed setting, as it implies that we can distribute the data by feature rather than by data point. Further, we will communicate a vector equal in size to the number of data points, as opposed to the number of features. When the number of features is high (as is common in sparsityinducing models) this can significantly reduce communication and improve overall performance, as we demonstrate in Section 6. Further, it allows us to directly leverage stateoftheart coordinatewise primal methods, such as glmnet [11] and extensions [34, 13]. From a theoretical perspective, solving will allow us to consider nonstrongly convex regularizers, which were not covered in CoCoA, as we discuss in Section 4.
4 Convergence Analysis
In this section we provide convergence rates for the proposed framework, and introduce an important theoretical technique in analyzing nonstrongly convex terms in the primaldual setting. For simplicity of presentation, we assume in the analysis that the data partition is balanced; i.e., for all . Furthermore, we assume that the columns of A satisfy for all . We present results for the case where in Algorithm 1, and where the subproblems (2) are defined using the corresponding safe bound . This case delivers the fastest convergence rates in the distributed setting, which in particular don’t degrade as the number of machines grows and remains fixed.
4.1 General Convex
Our first main theorem provides convergence guarantees for objectives with nonstrongly convex regularizers, including models such as Lasso and sparse logistic regression. Providing primaldual rates and globally defined primaldual accuracy certificates requires a theoretical technique that we introduce below, in which we show how to satisfy the following notion of bounded support.
Definition 1 (Bounded Support).
A function has bounded support if its effective domain is bounded by , i.e.,
(7) 
As we explain in Section F of the appendix, our assumption of bounded support for the functions can be interpreted as an assumption that their conjugates are globally Lipschitz.
Theorem 1.
Bounded support modification.
Note that the absolute value function for regularization does not have bounded support, and thus violates the assumptions yielding convergence in Theorem 1. Its dual, the indicator function of the interval, is not defined globally, and thus does not always allow a finite duality gap. To address this, existing approaches typically use a simple smoothing technique as in [22]: by adding a small amount of to the norm, it becomes strongly convex; see, e.g., [27]. This Nesterov smoothing technique is undesirable in practice, as it changes the iterates, the convergence rate, and the tightness of the resulting duality gap. Further, the amount of smoothing can be difficult to tune and can have a large influence on the performance of the method at hand. We show examples of this issue with experiments in Section 6.
In contrast, our approach preserves all solutions of the original objective, leaves the iterate sequence unchanged, and allows for direct reusability of existing solvers. It also removes the need for additional parameter tuning. To achieve this, we modify the function by imposing an additional weak constraint that is inactive in our region of interest. Formally, we replace by
For large enough , this problem yields the same solution as the original regularized objective. Note that this only affects convergence theory, in that it allows us to present a strong primaldual rate (Theorem 1 for =). The modification of does not affect the algorithms for the original problems. Whenever a monotone optimizer is used, we will never leave the level set defined by the objective at the starting point. We provide further details on this technique in Section D.3, and illustrate how to leverage it for a variety of applications (see Section C of the appendix and also [8]).
4.2 Strongly Convex
For the case of strongly convex , including elastic netregularized objectives, we obtain the following faster geometric convergence rate.
5 Related Work
Singlemachine coordinate solvers.
For strongly convex regularizers, current stateoftheart for empirical loss minimization is randomized coordinate ascent on the dual (SDCA) [26] and its accelerated variants, e.g., [27]. In contrast to primal stochastic gradient descent (SGD) methods, the SDCA family is often preferred as it is free of learningrate parameters and has faster (geometric) convergence guarantees. Interestingly, a similar trend in coordinate solvers has been observed in recent Lasso literature, but with the roles of primal and dual reversed. For those problems, coordinate descent methods on the primal have become stateoftheart, as in glmnet [11] and extensions [34]; see, e.g., the overview in [33]. However, primaldual convergence rates for unmodified coordinate algorithms have to our knowledge been obtained only for strongly convex regularizers to date [27, 35].
Connection to coordinatewise Newton methods.
Coordinate descent on regularized problems (A) with can be interpreted as the iterative minimization of a quadratic approximation of the smooth part of the objective (as in a onedimensional Newton step), followed by a shrinkage step resulting from the part. In the singlecoordinate update case, this is at the core of glmnet [11, 33], and widely used in, e.g., solvers based on the primal formulation of regularized objectives [25, 34, 3, 9, 28]. When changing more than one coordinate at a time, again employing a quadratic upper bound on the smooth part, this results in a twoloop method as in glmnet [11] for the special case of logistic regression. This idea is crucial for the distributed setting.
Parallel coordinate descent.
Parallel coordinate descent for regularized objectives (with and without using minibatches) was proposed in [7] (Shotgun) and generalized in [3] , and is among the best performing solvers in the parallel setting. Our framework reduces to Shotgun as a special case when the internal solver is a single coordinate update on the subproblem (2), , and for a suitable . However, Shotgun is not covered by our convergence theory, since it uses a potentially unsafe upper bound instead of , which isn’t guaranteed to satisfy the condition (21). Other parallel coordinate descent methods on the objective have recently been analyzed in [9, 28, 21], but not in the communicationefficient or distributed setting.
Distributed solvers.
The methods most closely related to our approach are distributed variants of glmnet as in [18]. Inspired by glmnet and [34], the work of [3, 18] introduced the idea of a blockdiagonal Hessian upper approximation in the distributed context. The later work of [29] specialized this approach to sparse logistic regression.
If hypothetically each of our quadratic subproblems as defined in (2) were to be minimized exactly, the resulting steps could be interpreted as blockwise Newtontype steps on each coordinate block , where the Newtonsubproblem is modified to also contain the regularizer [18, 34, 23]. While [18] allows a fixed accuracy for these subproblems—but not arbitrary approximation quality as in our framework—the work of [29, 34, 31] assumes that the quadratic subproblems are solved exactly. Therefore, these methods are not able to freely trade off communication and computation. Also, they do not allow the reuse of arbitrary local solvers. On the theoretical side, the rate results provided by [18, 29, 34] are not explicit convergence rates but only asymptotic, as the quadratic upper bounds are not explicitly controlled for safety as with our .
Batch solvers.
ADMM [5], proximal gradient descent, and quasiNewton methods such as LBFGS and are also often used in distributed environments because of their relatively low communication requirements. However, they require at least a full (distributed) batch gradient computation at each round, and therefore do not allow the gradual tradeoff between communication and computation provided by proxCoCoA. The works of [19] and [15] have obtained encouraging results for distributed systems employing coordinate descent variants on problems. The latter approach distributes both columns and rows of the data matrix and can be extended to Lasso. However it only provides asymptotic improvement per step, and no convergence rate. We include experimental comparisons with ADMM, proxGD, and orthantwise limited memory quasiNewton (OWLQN) [1], an LBFGS variant that can handle regularization [32], but which has no convergence rate.
Finally, we note that while the provided convergence rates for proxCoCoA mirror the convergence class of classical batch gradient methods in terms of the number of outer rounds, existing batch proximal gradient methods come with a weaker theory, as they do not allow general inexactness for the local subproblem (2). In contrast, our shown convergence rates incorporate this approximation directly, and, moreover, hold for arbitrary local solvers of much cheaper cost than batch methods (where in each round, every machine has to process exactly a full pass through the local data). This makes proxCoCoA more flexible in the distributed setting, as it can adapt to varied communication costs on real systems. We will see in the following section that this flexibility results in significant performance gains over the competing methods.
6 Experimental Results
In this section we compare proxCoCoA to numerous stateoftheart methods for largescale regularized optimization, including:

[leftmargin=0.3in]

MbSGD: minibatch stochastic gradient
descent with an prox

ProxGD: full proximal gradient descent

OWLQN: orthantwise limited quasiNewton

ADMM: alternating direction method
of multipliers

MbCD: minibatch parallel coordinate
descent, incl. Shotgun
The first three methods are optimized and implemented in Apache Spark’s MLlib (v1.5.0) [20]. We employ coordinate descent as a local solver for proxCoCoA, and apply proxCoCoA directly to the primal formulation of Lasso and elastic net, thereby mapping the problem to (A) and solving this objective directly. A comparison with Shotgun is provided as an extreme case to highlight the detrimental effects of frequent communication in the distributed environment.
We test the performance of each method in largescale experiments fitting Lasso and elastic net regression models to the datasets shown in Table 1. All code is written in Apache Spark and experiments are run on public cloud Amazon EC2 m3.xlarge machines with one core per machine. For MbCD, Shotgun, and proxCoCoA in the primal, datasets are distributed by feature, whereas for MbSGD, ProxGD, OWLQN, ADMM, and CoCoA they are distributed by datapoint.
Dataset  Training  Features  Sparsity 

url  2 M  3 M  3.5e5 
epsilon  400 K  2 K  1.0 
kddb  19 M  29 M  9.8e7 
webspam  350 K  16 M  2.0e4 
We carefully tune each competing method for best performance. ADMM requires the most tuning, both in selecting the penalty parameter and in solving the subproblems. Solving the subproblems to completion for ADMM is prohibitively slow, and we thus use iterations of conjugate gradient and improve performance by allowing early stopping. We also use a varying penalty parameter — practices described in [5, Sec. 4.3, 3.4.1]. For MbSGD, we tune the step size and minibatch size parameters. For MbCD, we scale the updates at each round by for minibatch size and , and tune both parameters and . Further implementation details for all methods are given in the appendix (Section G).
Comparison with methods.
In analyzing the performance of each algorithm (Figure 1), we measure the improvement to the primal objective given in (A) in terms of wallclock time in seconds. We see that both MbSGD and MbCD are slow to converge, and come with the additional burden of having to tune extra parameters (though MbCD makes clear improvements over MbSGD). As expected, naively distributing Shotgun [7] (single coordinate updates per machine) does not perform well, as it is tailored to sharedmemory systems and requires communicating too frequently. OWLQN performs the best of all compared methods, but is still much slower to converge than proxCoCoA, converging, e.g., 50 more slowly for the webspam dataset. The optimal performance of proxCoCoA is particularly evident in datasets with large numbers of features (e.g., url, kddb, webspam), which are exactly the datasets of interest for regularization.
Results are shown for regularization parameters such that the resulting weight vector is sparse. However, our results are robust to varying values of as well as to various problem settings, as we illustrate in Figure 2.
We note that in contrast to the compared methods, proxCoCoA comes with the benefit of having only a single parameter to tune: the subproblem approximation quality, , which can be controlled via the number of local subproblem iterations, . We further explore the effect of this parameter in Figure 3, and provide a general guideline for choosing it in practice (see Remark 1). In particular, we see that while increasing always results in better performance in terms of rounds, smaller or larger values of may result in better performance in terms of wallclock time, depending on the cost of communication and computation. The flexibility to tune is one of the reasons for proxCoCoA’s significant performance gains.
Comparison with CoCoA.
Finally, we point out several important ways in which proxCoCoA improves upon the CoCoA framework [17]. First, CoCoA cannot be included in the set of experiments in Figure 1 because it cannot be directly applied to the Lasso objective (CoCoA only allows for strongly convex regularizers^{1}^{1}1CoCoA in [17] is in fact limited to the case where the regularizer is equal to the norm , though the extension to strongly convex regularizers is covered as a special case in our analysis.). Second, as shown in Figure 4, the performance of CoCoA degrades drastically when considering datasets with large numbers of features, such as the webspam dataset. One reason for this is that CoCoA distributes data by data point, which necessitates communicating a vector of length equal to the feature size. When the feature size is large, this can become expensive. The results shown hold despite the fact that we have tuned (the number of local solver iterations) separately for both proxCoCoA and CoCoA.
Beyond communication, we also see that CoCoA is slower to converge as the regularizer becomes less strongly convex (Figure 4a). Indeed, even when the number of features is relatively low such as for the epsilon dataset, we see that the performance of CoCoA degrades significantly as the regularizer approaches pure . In Figure 4, we illustrate this by implementing the Nesterov smoothing technique used in, e.g., [27, 35] — adding a small amount of strong convexity to the objective for Lasso regression. We show results for decreasing levels of . As decreases, the final sparsity of the problem starts to match that of running pure (Figure 4c), but the performance also degrades (Figure 4b). We note again that through the modification presented in Section 4, we can deliver strong rates without having to make these fundamental alterations to the problem of interest.
Acknowledgments
We thank Michael P. Friedlander and Martin Takáč for fruitful discussions.
References
 [1] G. Andrew and J. Gao. Scalable training of L1regularized loglinear models. In ICML, 2007.
 [2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer New York, New York, NY, 2011.
 [3] Y. Bian et al. Parallel coordinate descent newton method for efficient regularized minimization. arXiv.org, 2013.
 [4] J. M. Borwein and Q. Zhu. Techniques of Variational Analysis and Nonlinear Optimization. Canadian Mathematical Society Books in Math, Springer New York, 2005.
 [5] S. Boyd et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2010.
 [6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [7] J. K. Bradley et al. Parallel coordinate descent for l1regularized loss minimization. In ICML, 2011.
 [8] C. Dünner et al. PrimalDual Rates and Certificates. In ICML, 2016.
 [9] O. Fercoq and P. Richtárik. Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization, 25(4):1997–2023, Oct. 2015.
 [10] S. Forte. Distributed Optimization for NonStrongly Convex Regularizers. Master’s thesis, ETH Zürich, Sept. 2015.
 [11] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
 [12] M. Jaggi et al. Communicationefficient distributed dual coordinate ascent. In NIPS, 2014.
 [13] T. Johnson and C. Guestrin. Blitz: A Principled MetaAlgorithm for Scaling Sparse Optimization. In ICML, 2015.
 [14] S. M. Kakade, S. ShalevShwartz, and A. Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Technical report, TTI, 2009.
 [15] Kang et al. Data/feature distributed stochastic coordinate descent for logistic regression. In CIKM, 2014.
 [16] Z. Lu and L. Xiao. On the complexity analysis of randomized blockcoordinate descent methods. arXiv.org, 2013.
 [17] C. Ma et al. Adding vs. averaging in distributed primaldual optimization. In ICML, 2015.
 [18] D. Mahajan, S. S. Keerthi, and S. Sundararajan. A distributed block coordinate descent method for training regularized linear classifiers. arXiv.org, 2014.
 [19] H. B. McMahan et al. Ad click prediction: a view from the trenches. In KDD, 2013.
 [20] X. Meng et al. Mllib: Machine learning in apache spark. arXiv.org, 2015.
 [21] I. Necoara and D. Clipici. Parallel Random Coordinate Descent Method for Composite Minimization: Convergence Analysis and Error Bounds. SIAM Journal on Optimization, 26(1):197–226, Jan. 2016.
 [22] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103(1):127–152, 2005.
 [23] Z. Qu et al. SDNA: Stochastic dual newton ascent for empirical risk minimization. arXiv.org, 2015.
 [24] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1997.
 [25] S. ShalevShwartz and A. Tewari. Stochastic methods for lregularized loss minimization. Journal of Machine Learning Research, 12:1865–1892, 2011.
 [26] S. ShalevShwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013.
 [27] S. ShalevShwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, Series A:1–41, 2014.
 [28] R. Tappenden and P. Richtárik. On the complexity of parallel coordinate descent. arXiv.org, 2015.
 [29] I. Trofimov and A. Genkin. Distributed coordinate descent for l1regularized logistic regression. arXiv.org, 2014.
 [30] T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In NIPS, 2013.
 [31] I. E.H. Yen, S.W. Lin, and S.D. Lin. A Dual Augmented Block Minimization Framework for Learning with Limited Memory. In NIPS, 2015.
 [32] J. Yu, S. Vishwanathan, S. Günter, and N. N. Schraudolph. A quasinewton approach to nonsmooth convex optimization problems in machine learning. Journal of Machine Learning Research, 2010.
 [33] G.X. Yuan et al. A comparison of optimization methods and software for largescale l1regularized linear classification. Journal of Machine Learning Research, 2010.
 [34] G.X. Yuan, C.H. Ho, and C.J. Lin. An improved glmnet for l1regularized logistic regression. Journal of Machine Learning Research, 2012.
 [35] Y. Zhang and X. Lin. Stochastic PrimalDual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015.
 [36] S. Zheng et al. A general distributed dual coordinate optimization framework for regularized loss minimization. In arXiv.org, 2016.
Appendix
Appendix A Definitions
Definition 2 (Lipschitz Continuity).
A function is Lipschitz continuous if , we have
(10) 
Definition’ 1 (Bounded Support).
A function has bounded support if its effective domain is bounded by , i.e.,
(11) 
Definition 3 (Smoothness).
A function is called smooth, for , if it is differentiable and its derivative is Lipschitz continuous, or equivalently
(12) 
Definition 4 (Strong Convexity).
A function is called strongly convex, for , if
(13) 
And analogously if the same holds for all subgradients, in the case of a general closed convex function .
Appendix B Convex Conjugates
The convex conjugate of a function is defined as
(14) 
Below we list several useful properties of conjugates (see, e.g., [6, Section 3.3.2]):

Double conjugate: if is closed and convex.

Value Scaling: (for )

Argument Scaling: (for )

Conjugate of a separable sum:
Lemma 3 (Duality between Lipschitzness and LBounded Support, [24, Corollary 13.3.3]).
Given a proper convex function , it holds that is Lipschitz if and only if has bounded support.
Lemma 4 (Duality between Smoothness and Strong Convexity, [14, Theorem 6]).
Given a closed convex function , it holds that is strongly convex w.r.t. the norm if and only if is smooth w.r.t. the dual norm .
Appendix C Applications
c.1 and General NonStrongly Convex Regularizers
regularization is obtained in the objective (A) by letting . Primaldual convergence can be obtained by using the modification introduced in Section 4, which will guarantee bounded support. Formally, we replace by
For large enough , this problem yields the same solution as the original objective. We provide a detailed proof and description of this technique in Section D.3. Note that this only affects convergence theory, in that it allows us to present a strong primaldual rate (Theorem 1 for =).
c.2 Elastic Net and General Strongly Convex Regularizers
Another application we can consider is elastic net regularization, , for fixed parameter , which is obtained by setting in (A). For the special case , we obtain the norm. For elasticnetregularized problems of the form (A), Theorem 2 gives a global linear (geometric) convergence rate, since is strongly convex. This holds as long as the datafit function is smooth (see Section C.4), and directly yields a primaldual algorithm and corresponding rate.
c.3 Local Solvers for and Elastic Net
For the regularizer in the primal setting, the local subproblem (2) becomes a simple quadratic problem on the local data, with regularization applied only to local variables . Therefore, existing fast solvers for the singlemachine case, such as glmnet variants [11] or blitz [13] can be directly applied to each local subproblem within Algorithm 1. The sparsity induced on the subproblem solutions of each machine naturally translates into the sparsity of the global solution, since the local variables will be concatenated.
In terms of the approximation quality parameter for the local problems (Assumption 1), we can apply existing recent convergence results from the single machine case. For example, for randomized coordinate descent (as part of glmnet), [16, Theorem 1] gives a approximation quality for any separable regularizer, including and elastic net; see also [28, 25].
c.4 Smooth DataFit Functions
To illustrate the role of as a smooth datafit function in this section—contrasting with its role as a regularizer in traditional CoCoA as we discuss in Section F—we consider the following examples.
Least squares loss.
Let be labels or response values, and consider the least squares objective , which is smooth. We obtain the familiar leastsquares regression objective in our optimization problem (A), using
Observing that the gradient of is , the dualtoprimal mapping is given by: , which is well known as the residual vector in leastsquares regression.
Logistic regression loss.
For classification problems, we consider a logistic regression model with training examples for collected as the rows of the data matrix . For each training example, we are given a binary label, which we collect in the vector . Formally, the objective is defined as , which is again a separable function. The classifier loss is given by
(15) 
where is the parameter vector. It is not hard to show that is smooth if the labels satisfy ; see e.g. Lemma 5 below. The primaldual mapping is given by
Appendix D Proofs of PrimalDual Relationship
In the following subsections we provide derivations of the primaldual relationship of the general objectives (A) and (B), and then show how to derive this primaldual setup for various applications.
d.1 PrimalDual Relationship
The relation of our original formulation (A) to its dual formulation (B) is standard in convex analysis, and is a special case of the concept of Fenchel Duality. Using the combination with the linear map as in our case, the relationship is called FenchelRockafellar Duality, see e.g. [4, Theorem 4.4.2] or [2, Proposition 15.18]. For completeness, we illustrate this correspondence with a selfcontained derivation of the duality.
Starting with the original formulation (A), we introduce a helper variable vector representing . Then optimization problem (A) becomes:
(16) 
Introducing Lagrange multipliers , the Lagrangian is given by:
The dual problem of (A) follows by taking the infimum with respect to both and :
(17) 
We change signs and turn the maximization of the dual problem (17) into a minimization and thus we arrive at the dual formulation as claimed:
d.2 Conjugates and Smoothness of Functions of Interest
Lemma 5 (Conjugate and Smoothness of the Logistic Loss).
The logistic classifier loss function
(see also (15) above) is the conjugate of , where
(18) 
with the box constraint .
Furthermore, is strongly convex over its domain if the labels satisfy .
Proof of Lemma 5.
By separability of , the conjugate of is . For the losses, the conjugate pairs are , and with , see e.g. [26, Page 577].
For the strong convexity, we show 1strong smoothness of the conjugate , which is an equivalent property, see Lemma 4. Using the second derivative of the function , we have that , so is 1smooth w.r.t. the Euclidean norm. ∎
d.3 Conjugates of Common Regularizers
Lemma 6 (Conjugate of the Elastic Net Regularizer).
For , the elastic net function is the convex conjugate of
where is the positive part operator, for , and zero otherwise. Furthermore, this is smooth, i.e. has Lipschitz continuous gradient with constant .
Proof.
We start by applying the definition of convex conjugate, that is:
We now distinguish two cases for the optimal: , . For the first case we get that
Setting the derivative to we get . To satisfy , we must have . Replacing with we thus get:
Similarly we can show that for
Finally, by the fact that is convex, always positive, and , it follows that for every .
For the smoothness properties, we consider the derivative of this function and see that is smooth, i.e. has Lipschitz continuous gradient with constant , assuming . ∎
Continuous conjugate modification for indicator functions.
To apply the theoretical convergence result from Theorem 1 to objectives with norms, we modify the function by imposing an additional constraint. Consider replacing by
With this modified regularizer, the optimization problem (A) with regularization parameter becomes
(19) 
For large enough choice of the value , this problems yields the same solution as the original objective:
(20) 
As we can see, the is nothing more than a constrained version of the absolute value to the interval . Therefore by setting to a large enough value that the interesting values of will never reach, we can have continuous and at the same time make (19) equivalent to (20).
Formally, a simple way to obtain a large enough value of , so that all solutions of (20) are unaffected is the following. Note that we start the algorithm at . For every solution encountered during the execution of the algorithm, the objective values should never become worse than . In other words, we restrict the optimization problem to the level set given by the initial starting value. Formally, this means that for every , we will always require:
(Note that holds without loss of generality). We can thus set the value of to be .
Lemma 7 (Conjugate of the modified norm).
The convex conjugate of as defined above is
and is Lipschitz.
Proof.
We start by applying the definition of convex conjugate:
We begin by looking at the case in which ; in this case it’s easy to see that when , we have:
as . The case holds analogously. We’ll now look at the case ; in this case it is clear we must have . It also must hold that , since
for every . Therefore the maximization becomes
which has maximum at . The remaining case follows in similar fashion. ∎
Appendix E Convergence Proofs
In this section we provide proofs of our main convergence results. The results are motivated by [17], but where we have significantly generalized the problem of interest, and where we derive separate meaning by applying the problem directly to (A). We provide full details of Lemma 8 as a proof of concept, but omit details in later proofs that can be derived using the arguments in [17] or earlier work of [26], and instead outline the proof strategy and highlight sections where the theory deviates.
e.1 Approximation of by the Local Subproblems
We begin with a definition of the datadependent aggregation parameter for proxCoCoA, , which we will use in the throughout our convergence results.
Definition 5 (Datadependent aggregation parameter).
In Algorithm 1, the aggregation parameter controls the level of adding () versus averaging () of the partial solutions from all machines. For the convergence results discussed below to hold, the subproblem parameter must be chosen not smaller than
(21) 
The simple choice of is valid for (21), i.e.,
In some cases, it will be possible to give better (datadependent) choices for , closer to the actual bound given in .
Our first lemma in the overall proof of convergence helps to relate change in local subproblems to the global objective .
Lemma 8.
For any dual , , and real values satisfying (21), it holds that
(22) 
Proof.
In this proof we follow the line of reasoning in [17, Lemma 4] with a more general smoothness assumption on . An outer iteration of proxCoCoA performs the following update:
(23) 
We bound the terms and separately. First we bound A using smoothness of :