Editing Support vector machine (section)

== Computing the SVM classifier ==
Computing the (soft-margin) SVM classifier amounts to minimizing an expression of the form
{{NumBlk||<math display="block">\left[\frac 1 n \sum_{i=1}^n \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right) \right] + \lambda \|\mathbf{w}\|^2. </math>|{{EquationRef|2}}}}

We focus on the soft-margin classifier since, as noted above, choosing a sufficiently small value for <math>\lambda</math> yields the hard-margin classifier for linearly classifiable input data. The classical approach, which involves reducing {{EquationNote|2|(2)}} to a [[quadratic programming]] problem, is detailed below. Then, more recent approaches such as sub-gradient descent and coordinate descent will be discussed.

=== Primal ===
Minimizing {{EquationNote|2|(2)}} can be rewritten as a constrained optimization problem with a differentiable objective function in the following way.

For each <math>i \in \{1,\,\ldots,\,n\}</math> we introduce a variable <math> \zeta_i = \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right)</math>. Note that <math> \zeta_i</math> is the smallest nonnegative number satisfying <math> y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b) \geq 1 - \zeta_i.</math>

Thus we can rewrite the optimization problem as follows

<math display="block"> \begin{align}
&\text{minimize } \frac 1 n \sum_{i=1}^n \zeta_i + \lambda \|\mathbf{w}\|^2 \\[0.5ex]
&\text{subject to } y_i\left(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b\right) \geq 1 - \zeta_i \, \text{ and } \, \zeta_i \geq 0,\, \text{for all } i.
\end{align} </math>

This is called the ''primal'' problem.

=== Dual ===
By solving for the [[Duality (optimization)|Lagrangian dual]] of the above problem, one obtains the simplified problem

<math display="block"> \begin{align}
&\text{maximize}\,\, f(c_1 \ldots c_n) =  \sum_{i=1}^n c_i - \frac 1 2 \sum_{i=1}^n\sum_{j=1}^n y_i c_i(\mathbf{x}_i^\mathsf{T} \mathbf{x}_j)y_j c_j, \\
&\text{subject to } \sum_{i=1}^n c_iy_i = 0,\,\text{and } 0 \leq c_i \leq \frac{1}{2n\lambda}\;\text{for all }i.
\end{align}</math>

This is called the ''dual'' problem. Since the dual maximization problem is a quadratic function of the <math> c_i</math> subject to linear constraints, it is efficiently solvable by [[quadratic programming]] algorithms.

Here, the variables <math> c_i</math> are defined such that

<math display="block"> \mathbf{w} = \sum_{i=1}^n c_iy_i \mathbf{x}_i.</math>

Moreover, <math> c_i = 0</math> exactly when <math> \mathbf{x}_i</math> lies on the correct side of the margin, and <math> 0 < c_i <(2n\lambda)^{-1}</math>  when <math> \mathbf{x}_i</math> lies on the margin's boundary. It follows that <math>\mathbf{w}</math> can be written as a linear combination of the support vectors.

The offset, <math> b</math>, can be recovered by finding an <math> \mathbf{x}_i</math> on the margin's boundary and solving
<math display="block"> y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b) = 1 \iff b = \mathbf{w}^\mathsf{T} \mathbf{x}_i - y_i .</math>

(Note that <math>y_i^{-1}=y_i</math> since <math>y_i=\pm 1</math>.)

=== Kernel trick ===
{{Main|Kernel method}}
[[Image:Kernel trick idea.svg|thumbnail|right|A training example of SVM with kernel given by φ((''a'', ''b'')) = (''a'', ''b'', ''a''<sup>2</sup> + ''b''<sup>2</sup>)]]

Suppose now that we would like to learn a nonlinear classification rule which corresponds to a linear classification rule for the transformed data points <math> \varphi(\mathbf{x}_i).</math> Moreover, we are given a kernel function <math> k</math> which satisfies <math> k(\mathbf{x}_i, \mathbf{x}_j) = \varphi(\mathbf{x}_i) \cdot \varphi(\mathbf{x}_j)</math>.

We know the classification vector <math>\mathbf{w}</math> in the transformed space satisfies

<math display="block">  \mathbf{w} = \sum_{i=1}^n c_iy_i\varphi(\mathbf{x}_i),</math>

where, the <math>c_i</math> are obtained by solving the optimization problem

<math display="block"> \begin{align}
\text{maximize}\,\, f(c_1 \ldots c_n) &=  \sum_{i=1}^n c_i - \frac 1 2 \sum_{i=1}^n\sum_{j=1}^n y_ic_i(\varphi(\mathbf{x}_i) \cdot \varphi(\mathbf{x}_j))y_jc_j \\
                                      &=  \sum_{i=1}^n c_i - \frac 1 2 \sum_{i=1}^n\sum_{j=1}^n y_ic_ik(\mathbf{x}_i, \mathbf{x}_j)y_jc_j \\
\text{subject to } \sum_{i=1}^n c_i y_i &= 0,\,\text{and } 0 \leq c_i \leq \frac{1}{2n\lambda}\;\text{for all }i.
\end{align}
</math>

The coefficients <math> c_i</math> can be solved for using quadratic programming, as before. Again, we can find some index <math> i</math> such that <math> 0 < c_i <(2n\lambda)^{-1}</math>, so that <math> \varphi(\mathbf{x}_i)</math> lies on the boundary of the margin in the transformed space, and then solve

<math display="block"> \begin{align}
b = \mathbf{w}^\mathsf{T} \varphi(\mathbf{x}_i) - y_i &= \left[\sum_{j=1}^n c_jy_j\varphi(\mathbf{x}_j) \cdot \varphi(\mathbf{x}_i)\right] - y_i \\
  &= \left[\sum_{j=1}^n c_jy_jk(\mathbf{x}_j, \mathbf{x}_i)\right] - y_i.
\end{align}</math>

Finally,

<math display="block"> \mathbf{z} \mapsto \sgn(\mathbf{w}^\mathsf{T} \varphi(\mathbf{z}) - b) = \sgn \left(\left[\sum_{i=1}^n c_iy_ik(\mathbf{x}_i, \mathbf{z})\right] - b\right).</math>

=== Modern methods ===
Recent algorithms for finding the SVM classifier include sub-gradient descent and coordinate descent. Both techniques have proven to offer significant advantages over the traditional approach when dealing with large, sparse datasets—sub-gradient methods are especially efficient when there are many training examples, and coordinate descent when the dimension of the feature space is high.

==== Sub-gradient descent ====
[[Subgradient method|Sub-gradient descent]] algorithms for the SVM work directly with the expression

<math display="block">f(\mathbf{w}, b) = \left[\frac 1 n \sum_{i=1}^n \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right) \right] + \lambda \|\mathbf{w}\|^2.</math>

Note that <math>f</math> is a [[convex function]] of <math>\mathbf{w}</math> and <math>b</math>. As such, traditional [[gradient descent]] (or [[Stochastic gradient descent|SGD]]) methods can be adapted, where instead of taking a step in the direction of the function's gradient, a step is taken in the direction of a vector selected from the function's [[Subderivative|sub-gradient]]. This approach has the advantage that, for certain implementations, the number of iterations does not scale with <math>n</math>, the number of data points.<ref>{{Cite journal |title=Pegasos: primal estimated sub-gradient solver for SVM |journal=Mathematical Programming |date=2010-10-16 |issn=0025-5610 |pages=3–30 |volume=127 |issue=1 |doi=10.1007/s10107-010-0420-4 |first1=Shai |last1=Shalev-Shwartz |first2=Yoram |last2=Singer |first3=Nathan |last3=Srebro |first4=Andrew |last4=Cotter |citeseerx=10.1.1.161.9629 |s2cid=53306004 }}</ref>

==== Coordinate descent ====
[[Coordinate descent]] algorithms for the SVM work from the dual problem

<math display="block"> \begin{align}
&\text{maximize}\,\, f(c_1 \ldots c_n) =  \sum_{i=1}^n c_i - \frac 1 2 \sum_{i=1}^n\sum_{j=1}^n y_i c_i(x_i \cdot x_j)y_j c_j,\\
&\text{subject to } \sum_{i=1}^n c_iy_i = 0,\,\text{and } 0 \leq c_i \leq \frac{1}{2n\lambda}\;\text{for all }i.
\end{align}</math>

For each <math> i \in \{1,\, \ldots,\, n\}</math>, iteratively, the coefficient <math> c_i</math> is adjusted in the direction of <math> \partial f/ \partial c_i</math>. Then, the resulting vector of coefficients <math> (c_1',\,\ldots,\,c_n')</math> is projected onto the nearest vector of coefficients that satisfies the given constraints. (Typically Euclidean distances are used.) The process is then repeated until a near-optimal vector of coefficients is obtained. The resulting algorithm is extremely fast in practice, although few performance guarantees have been proven.<ref>{{Cite book |publisher=ACM |date=2008-01-01 |location=New York, NY, USA |isbn=978-1-60558-205-4 |pages=408–415 |doi=10.1145/1390156.1390208 |first1=Cho-Jui |last1=Hsieh |first2=Kai-Wei |last2=Chang |first3=Chih-Jen |last3=Lin |first4=S. Sathiya |last4=Keerthi |first5=S. |last5=Sundararajan |title=Proceedings of the 25th international conference on Machine learning - ICML '08 |chapter=A dual coordinate descent method for large-scale linear SVM |citeseerx=10.1.1.149.5594 |s2cid=7880266 }}</ref>