Editing Support vector machine (section)

=== Risk minimization ===
In supervised learning, one is given a set of training examples <math>X_1 \ldots X_n</math> with labels <math>y_1 \ldots y_n</math>, and wishes to predict <math>y_{n+1}</math> given <math>X_{n+1}</math>. To do so one forms a [[hypothesis]], <math>f</math>, such that <math>f(X_{n+1})</math> is a "good" approximation of <math>y_{n+1}</math>. A "good" approximation is usually defined with the help of a ''[[loss function]],'' <math>\ell(y,z)</math>, which characterizes how bad <math>z</math> is as a prediction of <math>y</math>. We would then like to choose a hypothesis that minimizes the ''[[Loss function#Expected loss|expected risk]]:''

<math display="block">\varepsilon(f) = \mathbb{E}\left[\ell(y_{n+1}, f(X_{n+1})) \right].</math>

In most cases, we don't know the joint distribution of <math>X_{n+1},\,y_{n+1}</math> outright. In these cases, a common strategy is to choose the hypothesis that minimizes the ''empirical risk:''

<math display="block">\hat \varepsilon(f) = \frac 1 n \sum_{k=1}^n \ell(y_k, f(X_k)).</math>

Under certain assumptions about the sequence of random variables <math>X_k,\, y_k</math> (for example, that they are generated by a finite Markov process), if the set of hypotheses being considered is small enough, the minimizer of the empirical risk will closely approximate the minimizer of the expected risk as <math>n</math> grows large. This approach is called ''empirical risk minimization,'' or ERM.