Editing Kurtosis (section)

== Pearson moments ==
The kurtosis is the fourth [[standardized moment]], defined as
<math display="block">
\operatorname{Kurt}[X] = \operatorname{E}\left[{\left(\frac{X - \mu}{\sigma}\right)}^4\right]
= \frac{\operatorname{E}\left[(X - \mu)^4\right]}{\left(\operatorname{E}\left[(X - \mu)^2\right]\right)^2}
= \frac{\mu_4}{\sigma^4},
</math>
where {{math|''μ''<sub>4</sub>}} is the fourth [[central moment]] and {{mvar|σ}} is the [[standard deviation]]. Several letters are used in the literature to denote the kurtosis.  A very common choice is {{mvar|κ}}, which is fine as long as it is clear that it does not refer to a [[cumulant]].  Other choices include {{math|''γ''<sub>2</sub>}}, to be similar to the notation for skewness, although sometimes this is instead reserved for the excess kurtosis.

The kurtosis is bounded below by the squared [[skewness]] plus 1:{{r|Pearson1916|p=432}}
<math display="block"> \frac{\mu_4}{\sigma^4} \geq \left(\frac{\mu_3}{\sigma^3}\right)^2 + 1,</math>
where {{math|''μ''<sub>3</sub>}} is the third [[central moment]]. The lower bound is realized by the [[Bernoulli distribution]]. There is no upper limit to the kurtosis of a general probability distribution, and it may be infinite.

A reason why some authors favor the excess kurtosis is that cumulants are [[intensive and extensive properties|extensive]].  Formulas related to the extensive property are more naturally expressed in terms of the excess kurtosis.  For example, let {{math|''X''<sub>1</sub>, ..., ''X''<sub>''n''</sub>}} be independent random variables for which the fourth moment exists, and let {{mvar|Y}} be the random variable defined by the sum of the {{math|''X''<sub>''i''</sub>}}.  The excess kurtosis of {{mvar|Y}} is
<math display="block">\operatorname{Kurt}[Y] - 3 = \frac{1}{\left( \sum_{j=1}^n \sigma_j^{\,2}\right)^2} \sum_{i=1}^n \sigma_i^{\,4} \cdot \left(\operatorname{Kurt}\left[X_i\right] - 3\right),</math>
where <math>\sigma_i</math> is the standard deviation of {{math|''X''<sub>''i''</sub>}}.  In particular if all of the {{math|''X''<sub>''i''</sub>}} have the same variance, then this simplifies to
<math display="block">\operatorname{Kurt}[Y] - 3 = \frac{1}{n^2} \sum_{i=1}^n \left(\operatorname{Kurt}\left[X_i\right] - 3\right).</math>

The reason not to subtract 3 is that the bare [[moment (statistics)|moment]] better generalizes to [[multivariate distribution]]s, especially when independence is not assumed.  The [[cokurtosis]] between pairs of variables is an order four [[tensor]].  For a bivariate normal distribution, the cokurtosis tensor has off-diagonal terms that are neither 0 nor 3 in general, so attempting to "correct" for an excess becomes confusing.  It is true, however, that the joint cumulants of degree greater than two for any [[multivariate normal distribution]] are zero.

For two random variables, {{mvar|X}} and {{mvar|Y}}, not necessarily independent, the kurtosis of the sum, {{math|''X'' + ''Y''}}, is
<math display="block">\begin{align}
\operatorname{Kurt}[X+Y] = \frac{1}{\sigma_{X+Y}^4} \big( & \sigma_X^4\operatorname{Kurt}[X] \\
& {} + 4\sigma_X^3 \sigma_Y   \operatorname{Cokurt}[X,X,X,Y] \\[6pt]
& {} + 6\sigma_X^2 \sigma_Y^2 \operatorname{Cokurt}[X,X,Y,Y] \\[6pt]
& {} + 4\sigma_X \sigma_Y^3   \operatorname{Cokurt}[X,Y,Y,Y] \\[6pt]
& {} + \sigma_Y^4 \operatorname{Kurt}[Y] \big).
\end{align}</math>
Note that the fourth-power [[binomial coefficient]]s (1, 4, 6, 4, 1) appear in the above equation.

=== Interpretation ===
The interpretation of the Pearson measure of kurtosis (or excess kurtosis) was once debated, but it is now well-established. As noted by Westfall in 2014{{r|Westfall2014}}, "...''its unambiguous interpretation relates to tail extremity.'' Specifically, it reflects either the presence of existing outliers (for sample kurtosis) or the tendency to produce outliers (for the kurtosis of a probability distribution). The underlying logic is straightforward: Kurtosis represents the average (or [[expected value]]) of standardized data raised to the fourth power. Standardized values less than 1—corresponding to data within one standard deviation of the mean (where the “peak” occurs)—contribute minimally to kurtosis. This is because raising a number less than 1 to the fourth power brings it closer to zero. The meaningful contributors to kurtosis are data values outside the peak region, i.e., the outliers. Therefore, kurtosis primarily measures outliers and provides no information about the central "peak".

Numerous misconceptions about kurtosis relate to notions of peakedness. One such misconception is that kurtosis measures both the “peakedness” of a distribution and the  [[heavy-tailed distribution|heaviness of its tail]] .{{r|Balanda1988}} Other incorrect interpretations include notions like “lack of shoulders” (where the “shoulder” refers vaguely to the area between the peak and the tail, or more specifically, the region about one [[standard deviation]] from the mean) or “bimodality.” {{r|Darlington1970}}   Balanda and [[Helen MacGillivray|MacGillivray]] argue that the standard definition of kurtosis “poorly captures the kurtosis, peakedness, or tail weight of a distribution.”Instead, they propose a vague definition of kurtosis as the location- and scale-free movement of [[probability mass]] from the distribution’s shoulders into its center and tails. {{r|Balanda1988}}

===Moors' interpretation===

In 1986, Moors gave an interpretation of kurtosis.{{r|Moors1986}} Let
<math display="block"> Z = \frac{ X - \mu } \sigma, </math> 
where {{mvar|X}} is a random variable, {{mvar|μ}} is the mean and {{mvar|σ}} is the standard deviation.

Now by definition of the kurtosis <math> \kappa </math>, and by the well-known identity <math> \operatorname{E}\left[V^2\right] = \operatorname{var}[V] + \operatorname{E}[V]^2, </math>
<math display="block">\begin{align}
\kappa & = \operatorname{E}\left[ Z^4 \right] \\
& = \operatorname{var}\left[ Z^2 \right] + \operatorname{E}{\!\left[Z^2\right]}^2 \\
& = \operatorname{var}\left[ Z^2 \right] + \operatorname{var}[Z]^2
= \operatorname{var}\left[ Z^2 \right] + 1.
\end{align}</math>

The kurtosis can now be seen as a measure of the dispersion of {{math|''Z''<sup>2</sup>}} around its expectation. Alternatively it can be seen to be a measure of the dispersion of {{mvar|Z}} around {{math|+1}} and&nbsp;{{math|−1}}. {{mvar|κ}} attains its minimal value in a symmetric two-point distribution. In terms of the original variable {{mvar|X}}, the kurtosis is a measure of the dispersion of {{mvar|X}} around the two values {{math|''μ'' ± ''σ''}}.

High values of {{mvar|κ}} arise in two circumstances:
* where the probability mass is concentrated around the mean and the data-generating process produces occasional values far from the mean
* where the probability mass is concentrated in the tails of the distribution.

=== Maximal entropy ===
The [[Differential entropy|entropy]] of a distribution is <math display="inline">-\!\int p(x) \ln p(x) \, dx</math>.

For any <math>\mu \in \R^n, \Sigma \in \R^{n\times n}</math> with <math>\Sigma</math> positive definite, among all probability distributions on <math>\R^n</math> with mean <math>\mu</math> and covariance <math>\Sigma</math>, the normal distribution <math>\mathcal N(\mu, \Sigma)</math> has the largest entropy.

Since mean <math>\mu</math> and covariance <math>\Sigma</math> are the first two moments, it is natural to consider extension to higher moments. In fact, by [[Lagrange multiplier]] method, for any prescribed first n moments, if there exists some probability distribution of form <math>p(x) \propto e^{\sum_i a_i x_i + \sum_{ij} b_{ij} x_i x_j + \cdots + \sum_{i_1 \cdots i_n} x_{i_1} \cdots x_{i_n}}</math> that has the prescribed moments (if it is feasible), then it is the maximal entropy distribution under the given constraints.<ref>{{Cite journal |last=Tagliani |first=A. |date=1990-12-01 |title=On the existence of maximum entropy distributions with four and more assigned moments |url=https://www.sciencedirect.com/science/article/abs/pii/026689209090017E |journal=Probabilistic Engineering Mechanics |volume=5 |issue=4 |pages=167–170 |doi=10.1016/0266-8920(90)90017-E |bibcode=1990PEngM...5..167T |issn=0266-8920}}</ref><ref>{{Cite journal |last1=Rockinger |first1=Michael |last2=Jondeau |first2=Eric |date=2002-01-01 |title=Entropy densities with an application to autoregressive conditional skewness and kurtosis |url=https://www.sciencedirect.com/science/article/pii/S0304407601000926 |journal=Journal of Econometrics |volume=106 |issue=1 |pages=119–142 |doi=10.1016/S0304-4076(01)00092-6 |issn=0304-4076}}</ref>

By serial expansion,
<math display="block">\begin{align}
& \int \frac{1}{\sqrt{2\pi}} e^{-\frac 12 x^2 - \frac 14 gx^4} x^{2n} \, dx \\[6pt]
&= \frac{1}{\sqrt{2\pi}} \int e^{-\frac 12 x^2 - \frac 14 gx^4} x^{2n} \, dx \\[6pt]
&= \sum_k \frac{1}{k!} \left(-\frac{g}{4}\right)^k (2n+4k-1)!! \\[6pt]
&= (2n-1)!! - \tfrac{1}{4} g (2n+3)!! + O(g^2)
\end{align}</math>
so if a random variable has probability distribution <math>p(x) = e^{-\frac 12 x^2 - \frac 14 gx^4}/Z</math>, where <math>Z</math> is a normalization constant, then its kurtosis is {{nowrap|<math>3 - 6g + O(g^2)</math>.}}<ref>{{Cite journal |last1=Bradde |first1=Serena |last2=Bialek |first2=William |date=2017-05-01 |title=PCA Meets RG |url=https://doi.org/10.1007/s10955-017-1770-6 |journal=Journal of Statistical Physics |language=en |volume=167 |issue=3 |pages=462–475 |doi=10.1007/s10955-017-1770-6 |issn=1572-9613 |pmc=6054449 |pmid=30034029|arxiv=1610.09733 |bibcode=2017JSP...167..462B }}</ref>