Editing Outlier (section)

===Modified Thompson Tau test===
{{see also|Studentized residual#Distribution}}
The modified Thompson Tau test is a method used to determine if an outlier exists in a data set.<ref>{{Cite web |last=Wheeler |first=Donald J. |date=11 January 2021 |title=Some Outlier Tests: Part 2 |url=https://www.qualitydigest.com/inside/statistics-column/some-outlier-tests-part-2-011121.html |access-date=2025-02-09 |website=Quality Digest |language=en}}</ref> The strength of this method lies in the fact that it takes into account a data set's standard deviation, average and provides a statistically determined rejection zone; thus providing an objective method to determine if a data point is an outlier.{{Citation needed|reason=Although intuitively appealing, this method appears to be unpublished (it is ''not'' described in Thompson (1985) so one should use it with caution.|date=October 2016}}<ref>Thompson .R. (1985). "[https://www.jstor.org/stable/2345543?seq=1#page_scan_tab_contents A Note on Restricted Maximum Likelihood Estimation with an Alternative Outlier Model]".Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, No. 1, pp. 53-55</ref>
How it works:
First, a data set's average is determined. Next the absolute deviation between each data point and the average are determined. Thirdly, a rejection region is determined using the formula: 
:<math>\text{Rejection Region}{{=}} \frac{{t_{\alpha/2}}{\left ( n-1 \right )}}{\sqrt{n}\sqrt{n-2+{t_{\alpha/2}^2}}}
</math>; 
where <math>\scriptstyle{t_{\alpha/2}}</math> is the critical value from the Student {{mvar|t}} distribution with ''n''-2 degrees of freedom, ''n'' is the sample size, and s is the sample standard deviation.
To determine if a value is an outlier:
Calculate <math>\scriptstyle \delta  = |(X - mean(X)) / s|</math>.
If ''δ'' > Rejection Region, the data point is an outlier.
If ''δ'' ≤ Rejection Region, the data point is not an outlier.

The modified Thompson Tau test is used to find one outlier at a time (largest value of ''δ'' is removed if it is an outlier). Meaning, if a data point is found to be an outlier, it is removed from the data set and the test is applied again with a new average and rejection region. This process is continued until no outliers remain in a data set.

Some work has also examined outliers for nominal (or categorical) data. In the context of a set of examples (or instances) in a data set, instance hardness measures the probability that an instance will be misclassified ( <math>1-p(y|x)</math> where {{mvar|y}} is the assigned class label and {{mvar|x}} represent the input attribute value for an instance in the training set {{mvar|t}}).<ref>Smith, M.R.; Martinez, T.; Giraud-Carrier, C. (2014). "[https://link.springer.com/article/10.1007%2Fs10994-013-5422-z An Instance Level Analysis of Data Complexity]". Machine Learning, 95(2): 225-256.</ref> Ideally, instance hardness would be calculated by summing over the set of all possible hypotheses {{mvar|H}}:

:<math>\begin{align}IH(\langle x, y\rangle) &= \sum_H (1 - p(y, x, h))p(h|t)\\
&= \sum_H p(h|t) - p(y, x, h)p(h|t)\\
&= 1- \sum_H p(y, x, h)p(h|t).\end{align}</math>

Practically, this formulation is unfeasible as {{mvar|H}} is potentially infinite and calculating <math>p(h|t)</math> is unknown for many algorithms. Thus, instance hardness can be approximated using a diverse subset <math>L \subset H</math>:

:<math>IH_L (\langle x,y\rangle) = 1 - \frac{1}{|L|} \sum_{j=1}^{|L|} p(y|x, g_j(t, \alpha))</math>

where <math>g_j(t, \alpha)</math> is the hypothesis induced by learning algorithm <math>g_j</math> trained on training set {{mvar|t}} with hyperparameters <math>\alpha</math>. Instance hardness provides a continuous value for determining if an instance is an outlier instance.