Editing Least squares (section)

==Solving the least squares problem==
The [[Maxima and minima|minimum]] of the sum of squares is found by setting the [[gradient]] to zero. Since the model contains ''m'' parameters, there are ''m'' gradient equations:
<math display="block">\frac{\partial S}{\partial \beta_j}=2\sum_i r_i\frac{\partial r_i}{\partial \beta_j} = 0,\ j=1,\ldots,m,</math>
and since <math>r_i=y_i-f(x_i,\boldsymbol \beta)</math>, the gradient equations become
<math display="block">-2\sum_i r_i\frac{\partial f(x_i,\boldsymbol \beta)}{\partial \beta_j}=0,\ j=1,\ldots,m.</math>

The gradient equations apply to all least squares problems. Each particular problem requires particular expressions for the model and its [[Partial derivative|partial derivatives]].<ref name=":1">{{Cite book|title=Quantifying measurement: the tyranny of numbers|last=Williams, Jeffrey H. (Jeffrey Huw), 1956-|others=Morgan & Claypool Publishers, Institute of Physics (Great Britain)|isbn=978-1-68174-433-9|location=San Rafael [California] (40 Oak Drive, San Rafael, CA, 94903, USA)|oclc=962422324|date = November 2016}}</ref>

===Linear least squares===
{{main|Linear least squares (mathematics)|l1=Linear least squares}}

A regression model is a linear one when the model comprises a [[linear combination]] of the parameters, i.e.,
<math display="block"> f(x, \boldsymbol \beta) = \sum_{j = 1}^m \beta_j \phi_j(x),</math>
where the function <math>\phi_j</math> is a function of <math> x </math>.<ref name=":1" />

Letting <math> X_{ij}= \phi_j(x_{i})</math> and putting the independent and dependent variables in matrices <math> X</math> and <math> Y,</math> respectively, we can compute the least squares in the following way. Note that <math> D</math> is the set of all data.<ref name=":1" /><ref name=":2">{{Cite book|last1=Rencher|first1=Alvin C.|url=https://books.google.com/books?id=0g-PAuKub3QC&pg=PA19|title=Methods of Multivariate Analysis|last2=Christensen|first2=William F.|date=2012-08-15|publisher=John Wiley & Sons|isbn=978-1-118-39167-9|pages=155|language=en}}</ref>
<math display="block">L(D, \boldsymbol{\beta}) = \left\|Y - X\boldsymbol{\beta} \right\|^2
= (Y - X\boldsymbol{\beta})^\mathsf{T} (Y - X\boldsymbol{\beta})</math>
<math display="block">= Y^\mathsf{T}Y- 2Y^\mathsf{T}X\boldsymbol{\beta} + \boldsymbol{\beta}^\mathsf{T}X^\mathsf{T}X\boldsymbol{\beta}
</math>

The gradient of the loss is:
<math display="block">\frac{\partial L(D, \boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
= \frac{\partial \left(Y^\mathsf{T}Y- 2Y^\mathsf{T}X\boldsymbol{\beta} + \boldsymbol{\beta}^\mathsf{T}X^\mathsf{T}X\boldsymbol{\beta}\right)}{\partial \boldsymbol{\beta}}
= -2X^\mathsf{T}Y + 2X^\mathsf{T}X\boldsymbol{\beta}</math>

Setting the gradient of the loss to zero and solving for <math>\boldsymbol{\beta}</math>, we get:<ref name=":2" /><ref name=":1" /> 
<math display="block">-2X^\mathsf{T}Y + 2X^\mathsf{T}X\boldsymbol{\beta} = 0
\Rightarrow X^\mathsf{T}Y = X^\mathsf{T}X\boldsymbol{\beta}</math>
<math display="block">\boldsymbol{\hat{\beta}} = \left(X^\mathsf{T}X\right)^{-1} X^\mathsf{T}Y</math>

===Non-linear least squares===
{{main|Non-linear least squares}}

There is, in some cases, a [[closed-form solution]] to a non-linear least squares problem – but in general there is not. In the case of no closed-form solution, numerical algorithms are used to find the value of the parameters <math>\beta</math> that minimizes the objective.  Most algorithms involve choosing initial values for the parameters.  Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation:
<math display="block">{\beta_j}^{k+1} = {\beta_j}^k+\Delta \beta_j,</math>
where a superscript ''k'' is an iteration number, and the vector of increments <math>\Delta \beta_j</math> is called the shift vector.  In some commonly used algorithms, at each iteration the model may be linearized by approximation to a first-order [[Taylor series]] expansion about <math> \boldsymbol \beta^k</math>:
<math display="block">\begin{align}
f(x_i,\boldsymbol \beta)
&= f^k(x_i,\boldsymbol \beta) +\sum_j \frac{\partial f(x_i,\boldsymbol \beta)}{\partial \beta_j} \left(\beta_j-{\beta_j}^k \right) \\[1ex]
&= f^k(x_i,\boldsymbol \beta) +\sum_j J_{ij} \,\Delta\beta_j.
\end{align}</math>

The [[Jacobian matrix and determinant|Jacobian]] '''J''' is a function of constants, the independent variable ''and'' the parameters, so it changes from one iteration to the next. The residuals are given by
<math display="block">r_i = y_i - f^k(x_i, \boldsymbol \beta)- \sum_{k=1}^{m} J_{ik}\,\Delta\beta_k = \Delta y_i- \sum_{j=1}^m J_{ij}\,\Delta\beta_j.</math>

To minimize the sum of squares of <math>r_i</math>, the gradient equation is set to zero and solved for <math> \Delta \beta_j</math>:
<math display="block">-2\sum_{i=1}^n J_{ij} \left( \Delta y_i-\sum_{k=1}^m J_{ik} \, \Delta \beta_k \right) = 0,</math>
which, on rearrangement, become ''m'' simultaneous linear equations, the '''normal equations''':
<math display="block">\sum_{i=1}^{n}\sum_{k=1}^m J_{ij} J_{ik} \, \Delta \beta_k=\sum_{i=1}^n J_{ij} \, \Delta y_i \qquad (j=1,\ldots,m).</math>

The normal equations are written in matrix notation as
<math display="block">\left(\mathbf{J}^\mathsf{T} \mathbf{J}\right) \Delta \boldsymbol \beta = \mathbf{J}^\mathsf{T}\Delta \mathbf{y}.</math>
<!-- or <math display="block">\mathbf{\left(J^TWJ\right) \, \Delta \boldsymbol \beta=J^TW \, \Delta y}</math> if weights are used. -->

These are the defining equations of the [[Gauss–Newton algorithm]].

===Differences between linear and nonlinear least squares===
* The model function, ''f'', in LLSQ (linear least squares) is a linear combination of parameters of the form <math>f = X_{i1}\beta_1 + X_{i2}\beta_2 +\cdots</math> The model may represent a straight line, a parabola or any other linear combination of functions. In NLLSQ (nonlinear least squares) the parameters appear as functions, such as <math>\beta^2, e^{\beta x}</math> and so forth. If the derivatives <math>\partial f / \partial \beta_j</math> are either constant or depend only on the values of the independent variable, the model is linear in the parameters. Otherwise, the model is nonlinear.
*Need initial values for the parameters to find the solution to a NLLSQ problem; LLSQ does not require them. 
*Solution algorithms for NLLSQ often require that the Jacobian can be calculated similar to LLSQ. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain either the partial derivatives must be calculated by numerical approximation or an estimate must be made of the Jacobian, often via [[finite differences]].
*Non-convergence (failure of the algorithm to find a minimum) is a common phenomenon in NLLSQ.
*LLSQ is globally concave so non-convergence is not an issue.
*Solving NLLSQ is usually an iterative process which has to be terminated when a convergence criterion is satisfied. LLSQ solutions can be computed using direct methods, although problems with large numbers of parameters are typically solved with iterative methods, such as the [[Gauss–Seidel]] method.
*In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares.
*Under the condition that the errors are uncorrelated with the predictor variables, LLSQ yields unbiased estimates, but even under that condition NLLSQ estimates are generally biased.
These differences must be considered whenever the solution to a nonlinear least squares problem is being sought.<ref name=":1" />