Editing Linear classifier (section)

==Generative models vs. discriminative models==
There are two broad classes of methods for determining the parameters of a linear classifier <math>\vec w</math>. They can be [[Generative model|generative]] and [[Discriminative model|discriminative]] models.<ref>T. Mitchell, [https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression.] Draft Version, 2005</ref><ref>A. Y. Ng and M. I. Jordan. [http://www.cs.berkeley.edu/~jordan/papers/ng-jordan-nips01.ps On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes.]  in NIPS 14, 2002.</ref> Methods of the former model [[joint probability distribution]], whereas methods of the latter model [[Conditional probability distribution|conditional density functions]] <math>P({\rm class}|\vec x)</math>. Examples of such algorithms include:
* [[linear discriminant analysis|Linear Discriminant Analysis]] (LDA)—assumes [[normal distribution|Gaussian]] conditional density models
* [[Naive Bayes classifier]] with multinomial or multivariate Bernoulli event models.

The second set of methods includes [[discriminative model]]s, which attempt to maximize the quality of the output on a [[training set]]. Additional terms in the training cost function can easily perform [[regularization (machine learning)|regularization]] of the final model. Examples of discriminative training of linear classifiers include:
* [[Logistic regression]]—maximum likelihood estimation of <math>\vec w</math> assuming that the observed training set was generated by a binomial model that depends on the output of the classifier.
* [[Perceptron]]—an algorithm that attempts to fix all errors encountered in the training set
* Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification.<ref>R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). {{ISBN|0-471-05669-3}}</ref>
* [[Support vector machine]]—an algorithm that maximizes the [[Margin (machine learning)|margin]] between the decision hyperplane and the examples in the training set.

'''Note:'''   Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear [[dimensionality reduction]] algorithm: [[principal components analysis]] (PCA). LDA is a [[supervised learning]] algorithm that utilizes the labels of the data, while PCA is an [[unsupervised learning]] algorithm that ignores the labels. To summarize, the name is a historical artifact.<ref>{{Cite book |last1=Duda |first1=Richard O. |title=Pattern classification |last2=Hart |first2=Peter E. |last3=Stork |first3=David G. |date=2001 |publisher=John Wiley & Sons, Inc |isbn=978-0-471-05669-0 |edition=Second |series=A Wiley-Interscience publication |location=New York Chichester Weinheim Brisbane Singapore Toronto |pages=117}}</ref>

Discriminative training often yields higher accuracy than modeling the conditional density functions{{Citation needed|date=October 2017}}. However, handling missing data is often easier with conditional density models{{Citation needed|date=October 2017}}.

All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space <math>\varphi(\vec x)</math>, using the [[kernel trick]].

===Discriminative training===
Discriminative training of linear classifiers usually proceeds in a [[supervised learning|supervised]] way, by means of an [[optimization algorithm]] that is given a training set with desired outputs and a [[loss function]] that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form<ref name="ieee">{{cite journal |author1=Guo-Xun Yuan |author2=Chia-Hua Ho |author3=Chih-Jen Lin |title=Recent Advances of Large-Scale Linear Classification |journal=Proc. IEEE |volume=100 |issue=9 |year=2012|url=http://dmkd.cs.vt.edu/TUTORIAL/Bigdata/Papers/IEEE12.pdf |archive-url=https://web.archive.org/web/20170610105707/http://dmkd.cs.vt.edu/TUTORIAL/Bigdata/Papers/IEEE12.pdf |archive-date=2017-06-10 |url-status=live}}</ref>

:<math>\underset{\mathbf{w}}{\arg\min} \;R(\mathbf{w}) + C \sum_{i=1}^N L(y_i, \mathbf{w}^\mathsf{T} \mathbf{x}_i)</math>

where

* {{math|'''w'''}} is a vector of classifier parameters,
* {{math|''L''(''y<sub>i</sub>'', '''w'''<sup>T</sup>'''x'''<sub>''i''</sub>)}} is a loss function that measures the discrepancy between the classifier's prediction and the true output {{mvar|y<sub>i</sub>}} for the {{mvar|i}}'th training example,
* {{math|''R''('''w''')}} is a [[Regularization (mathematics)|regularization]] function that prevents the parameters from getting too large (causing [[overfitting]]), and
* {{mvar|C}} is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function.

Popular loss functions include the [[hinge loss]] (for linear SVMs) and the [[log loss]] (for linear logistic regression). If the regularization function {{mvar|R}} is [[convex function|convex]], then the above is a [[convex optimization|convex problem]].{{r|ieee}} Many algorithms exist for solving such problems; popular ones for linear classification include ([[Stochastic gradient descent|stochastic]]) [[gradient descent]], [[L-BFGS]], [[coordinate descent]] and [[Newton method]]s.