Editing Least squares (section)

==Regularization==
{{Main|Regularized least squares}}

===Tikhonov regularization===
{{Main|Tikhonov regularization}}
In some contexts, a [[Regularization (machine learning)|regularized]] version of the least squares solution may be preferable. [[Tikhonov regularization]] (or [[ridge regression]]) adds a constraint that <math>\left\|\beta\right\|_2^2</math>, the squared [[L2-norm|<math>\ell_2</math>-norm]] of the parameter vector, is not greater than a given value to the least squares formulation, leading to a constrained minimization problem. This is equivalent to the unconstrained minimization problem where the objective function is the residual sum of squares plus a penalty term <math>\alpha \left\|\beta\right\|_2^2</math> and <math>\alpha</math> is a tuning parameter (this is the [[Lagrange multipliers|Lagrangian]] form of the constrained minimization problem).<ref>{{cite arXiv |last=van Wieringen |first=Wessel N. |year=2021 |title=Lecture notes on ridge regression |class=stat.ME |eprint=1509.09169 }}</ref>

In a [[Bayesian statistics|Bayesian]] context, this is equivalent to placing a zero-mean normally distributed [[prior distribution|prior]] on the parameter vector.

===Lasso method===
An alternative [[Regularization (machine learning)|regularized]] version of least squares is [[Lasso (statistics)|Lasso]] (least absolute shrinkage and selection operator), which uses the constraint that <math>\|\beta\|_1</math>, the [[L1-norm|L<sub>1</sub>-norm]] of the parameter vector, is no greater than a given value.<ref name=tibsh>{{cite journal |last=Tibshirani |first=R. |author-link = Rob Tibshirani | year=1996 |title=Regression shrinkage and selection via the lasso |journal=Journal of the Royal Statistical Society, Series B |volume=58|issue=1 |pages=267–288 |doi=10.1111/j.2517-6161.1996.tb02080.x |jstor=2346178}}</ref><ref name="ElementsStatLearn">{{cite book |url=http://www-stat.stanford.edu/~tibs/ElemStatLearn/ |title=The Elements of Statistical Learning |last1=Hastie |first1=Trevor |last2=Tibshirani |first2=Robert |last3=Friedman |first3=Jerome H. |author-link1=Trevor Hastie |author-link3=Jerome H. Friedman |edition=second |date=2009 |publisher=Springer-Verlag |isbn=978-0-387-84858-7 |url-status=dead |archive-url=https://web.archive.org/web/20091110212529/http://www-stat.stanford.edu/~tibs/ElemStatLearn/ |archive-date=2009-11-10 }}</ref><ref>{{cite book|last1=Bühlmann|first1=Peter|last2=van de Geer|first2=Sara|author2-link= Sara van de Geer |title=Statistics for High-Dimensional Data: Methods, Theory and Applications|date=2011|publisher=Springer|isbn=9783642201929}}</ref> (One can show like above using Lagrange multipliers that this is equivalent to an unconstrained minimization of the least-squares penalty with <math>\alpha\|\beta\|_1</math> added.) In a [[Bayesian statistics|Bayesian]] context, this is equivalent to placing a zero-mean [[Laplace distribution|Laplace]] [[prior distribution]] on the parameter vector.<ref>{{cite journal|last1=Park|first1=Trevor|last2=Casella|first2=George|author2-link=George Casella| title=The Bayesian Lasso|journal=Journal of the American Statistical Association|date=2008|volume=103|issue=482|pages=681–686|doi=10.1198/016214508000000337|s2cid=11797924}}</ref> The optimization problem may be solved using [[quadratic programming]] or more general [[convex optimization]] methods, as well as by specific algorithms such as the [[least angle regression]] algorithm.

One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero. This is an advantage of Lasso over ridge regression, as driving parameters to zero deselects the features from the regression. Thus, Lasso automatically selects more relevant features and discards the others, whereas Ridge regression never fully discards any features. Some [[feature selection]] techniques are developed based on the LASSO including Bolasso which bootstraps samples,<ref name=Bolasso>{{cite book|last1=Bach|first1=Francis R|title=Proceedings of the 25th international conference on Machine learning - ICML '08 |chapter=Bolasso |date=2008|pages=33–40|doi=10.1145/1390156.1390161|chapter-url=http://dl.acm.org/citation.cfm?id=1390161|isbn=9781605582054|bibcode=2008arXiv0804.1302B|arxiv=0804.1302|s2cid=609778}}</ref>  and FeaLect which analyzes the regression coefficients corresponding to different values of <math>\alpha</math> to score all the features.<ref name=FeaLect>{{cite journal|last1=Zare|first1=Habil|title=Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis|journal=BMC Genomics|date=2013|volume=14|issue=Suppl 1 |pages=S14|doi=10.1186/1471-2164-14-S1-S14|pmid=23369194|pmc=3549810 |doi-access=free }}</ref>

The ''L''<sup>1</sup>-regularized formulation is useful in some contexts due to its tendency to prefer solutions where more parameters are zero, which gives solutions that depend on fewer variables.<ref name=tibsh/> For this reason, the Lasso and its variants are fundamental to the field of [[compressed sensing]]. An extension of this approach is [[elastic net regularization]].