Editing Outlier (section)

== Definitions and detection ==
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.<ref name="ZimekFilzmoser2018">{{cite journal|last1=Zimek|first1=Arthur|last2=Filzmoser|first2=Peter|title=There and back again: Outlier detection between statistical reasoning and data mining algorithms|journal=Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery|volume=8|issue=6|year=2018|pages=e1280|issn=1942-4787|doi=10.1002/widm.1280|s2cid=53305944|url=https://findresearcher.sdu.dk:8443/ws/files/153197807/There_and_Back_Again.pdf|access-date=2019-12-11|archive-date=2021-11-14|archive-url=https://web.archive.org/web/20211114121638/https://findresearcher.sdu.dk:8443/ws/files/153197807/There_and_Back_Again.pdf|url-status=dead}}</ref>  There are various methods of outlier detection, some of which are treated as synonymous with novelty detection.<ref>Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014). A review of novelty detection. Signal Processing, 99, 215-249.</ref><ref>{{citation |last1=Rousseeuw |first1=P |author1-link=Peter Rousseeuw |last2=Leroy |first2=A. |year=1996 |title=Robust Regression and Outlier Detection |publisher=John Wiley & Sons |edition=3rd |title-link= Robust Regression and Outlier Detection}}</ref><ref>{{citation |first1=Victoria J. |last1=Hodge |first2=Jim |last2=Austin |title=A Survey of Outlier Detection Methodologies |journal=Artificial Intelligence Review |volume=22 |issue=2 |pages=85–126 |doi= 10.1023/B:AIRE.0000045502.10941.a9|year=2004 |citeseerx=10.1.1.109.1943 |s2cid=3330313 }}</ref><ref>{{Citation | last1 = Barnett | first1 = Vic
| last2 = Lewis | first2 = Toby | year = 1994 | orig-year = 1978
| title = Outliers in Statistical Data | edition = 3
| publisher = Wiley
| isbn =978-0-471-93094-5}}</ref><ref name="subspace" /> Some are graphical such as [[normal probability plot]]s.  Others are model-based. [[Box plot]]s are a hybrid.

Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation:
* [[Chauvenet's criterion]]
* [[Grubbs's test for outliers]]
* [[Dixon's Q test|Dixon's ''Q'' test]]
* [[ASTM]] E178: Standard Practice for Dealing With Outlying Observations<ref>[https://www.nrc.gov/docs/ML1023/ML102371244.pdf E178: Standard Practice for Dealing With Outlying Observations]</ref>
* [[Mahalanobis distance]] and [[leverage (statistics)|leverage]] are often used to detect outliers, especially in the development of linear regression models.
* Subspace and correlation based techniques for high-dimensional numerical data<ref name="subspace">{{cite journal | last1 = Zimek | first1 = A. | last2 = Schubert | first2 = E.| last3 = Kriegel | first3 = H.-P. | author-link3=Hans-Peter Kriegel| title = A survey on unsupervised outlier detection in high-dimensional numerical data | doi = 10.1002/sam.11161 | journal = Statistical Analysis and Data Mining | volume = 5 | issue = 5 | pages = 363–387| year = 2012| s2cid = 6724536 }}</ref>

===Peirce's criterion===
{{main|Peirce's criterion}}

<blockquote>
It is proposed to determine in a series of <math>m</math> observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as <math>n</math> such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations. (Quoted in the editorial note on page 516 to Peirce (1982 edition) from ''A Manual of Astronomy'' 2:558 by Chauvenet.)
<ref>[[Benjamin Peirce]], [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..161P;data_type=PDF_HIGH "Criterion for the Rejection of Doubtful Observations"], ''Astronomical Journal'' II 45 (1852) and [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..176P;data_type=PDF_HIGH Errata to the original paper].</ref><ref>{{cite journal
|title=On Peirce's criterion
|author-link=Benjamin Peirce
|first=Benjamin
|last=Peirce
|journal=Proceedings of the American Academy of Arts and Sciences
|volume=13
|date=May 1877 – May 1878
|pages=348–351
|jstor=25138498
|doi=10.2307/25138498
}}</ref><ref>{{cite journal
|first=Charles Sanders
|last=Peirce
|author-link=Charles Sanders Peirce
|title=Appendix No. 21. On the Theory of Errors of Observation
|journal=Report of the Superintendent of the United States Coast Survey Showing the Progress of the Survey During the Year 1870
|orig-year=1870 | year=1873
|pages=200–224
}}. NOAA [http://docs.lib.noaa.gov/rescue/cgs/001_pdf/CSC-0019.PDF#page=215 PDF Eprint] (goes to Report p. 200, PDF's p. 215).</ref><ref>{{cite book
|first=Charles Sanders
|last=Peirce
|author-link=Charles Sanders Peirce
|contribution=On the Theory of Errors of Observation
|title=Writings of Charles S. Peirce: A Chronological Edition
|volume=3, 1872-1878
|editor=Kloesel, Christian J. W.
|display-editors=etal
|publisher=Indiana University Press
|location=Bloomington, Indiana
|orig-year=1982
|year=1986 <!-- copyright=1986, but publication is listed as 1982 -->
|pages=[https://archive.org/details/writingsofcharle0002peir/page/140 140–160]
|isbn=978-0-253-37201-7
|url=https://archive.org/details/writingsofcharle0002peir/page/140
}} – Appendix 21, according to the editorial note on page 515</ref>
</blockquote>

===Tukey's fences===
Other methods flag observations based on measures such as the [[interquartile range]]. For example, if <math>Q_1</math> and <math>Q_3</math> are the lower and upper [[quartile]]s respectively, then one could define an outlier to be any observation outside the range:
:<math> \big[ Q_1 - k (Q_3 - Q_1 ) , Q_3 + k (Q_3 - Q_1 ) \big]</math>
for some nonnegative constant <math>k</math>.
[[John Tukey]] proposed this test, where <math>k=1.5</math> indicates an "outlier", and <math>k=3</math> indicates data that is "far out".<ref>{{cite book |last=Tukey |first=John W |title=Exploratory Data Analysis |year=1977 |publisher=Addison-Wesley |isbn=978-0-201-07616-5 |oclc=3058187 |url=https://archive.org/details/exploratorydataa00tuke_0 }}</ref>

=== In anomaly detection ===
{{main|Anomaly detection}}
In various domains such as, but not limited to, [[statistics]], [[signal processing]], [[finance]], [[econometrics]], [[manufacturing]], [[Network science|networking]] and [[data mining]], the task of ''anomaly detection'' may take other approaches. Some of these may be distance-based<ref>{{Cite journal | doi = 10.1007/s007780050006| title = Distance-based outliers: Algorithms and applications| journal = The VLDB Journal the International Journal on Very Large Data Bases| volume = 8| issue = 3–4| pages = 237| year = 2000| last1 = Knorr | first1 = E. M. | last2 = Ng | first2 = R. T. | last3 = Tucakov | first3 = V. | citeseerx = 10.1.1.43.1842| s2cid = 11707259}}</ref><ref>{{Cite conference | doi = 10.1145/342009.335437| title = Efficient algorithms for mining outliers from large data sets| conference = Proceedings of the 2000 ACM SIGMOD international conference on Management of data - SIGMOD '00| pages = 427| year = 2000| last1 = Ramaswamy | first1 = S. | last2 = Rastogi | first2 = R. | last3 = Shim | first3 = K. | isbn = 1581132174}}</ref> and density-based such as [[Local Outlier Factor]] (LOF).<ref>{{Cite conference| doi = 10.1145/335191.335388| title = LOF: Identifying Density-based Local Outliers| year = 2000| last1 = Breunig | first1 = M. M.| last2 = Kriegel | first2 = H.-P. | author-link2 = Hans-Peter Kriegel| last3 = Ng | first3 = R. T.| last4 = Sander | first4 = J.| work = Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data| series = [[SIGMOD]]| isbn = 1-58113-217-4| pages = 93–104| url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf}}</ref> Some approaches may use the distance to the [[k-nearest neighbor]]s to label observations as outliers or non-outliers.<ref>{{Cite journal | last1 = Schubert | first1 = E. | last2 = Zimek | first2 = A. | last3 = Kriegel | first3 = H. -P. | author-link3 = Hans-Peter Kriegel| doi = 10.1007/s10618-012-0300-z | title = Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection | journal = Data Mining and Knowledge Discovery | volume = 28 | pages = 190–237 | year = 2012 | s2cid = 19036098 }}</ref>

===Modified Thompson Tau test===
{{see also|Studentized residual#Distribution}}
The modified Thompson Tau test is a method used to determine if an outlier exists in a data set.<ref>{{Cite web |last=Wheeler |first=Donald J. |date=11 January 2021 |title=Some Outlier Tests: Part 2 |url=https://www.qualitydigest.com/inside/statistics-column/some-outlier-tests-part-2-011121.html |access-date=2025-02-09 |website=Quality Digest |language=en}}</ref> The strength of this method lies in the fact that it takes into account a data set's standard deviation, average and provides a statistically determined rejection zone; thus providing an objective method to determine if a data point is an outlier.{{Citation needed|reason=Although intuitively appealing, this method appears to be unpublished (it is ''not'' described in Thompson (1985) so one should use it with caution.|date=October 2016}}<ref>Thompson .R. (1985). "[https://www.jstor.org/stable/2345543?seq=1#page_scan_tab_contents A Note on Restricted Maximum Likelihood Estimation with an Alternative Outlier Model]".Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, No. 1, pp. 53-55</ref>
How it works:
First, a data set's average is determined. Next the absolute deviation between each data point and the average are determined. Thirdly, a rejection region is determined using the formula: 
:<math>\text{Rejection Region}{{=}} \frac{{t_{\alpha/2}}{\left ( n-1 \right )}}{\sqrt{n}\sqrt{n-2+{t_{\alpha/2}^2}}}
</math>; 
where <math>\scriptstyle{t_{\alpha/2}}</math> is the critical value from the Student {{mvar|t}} distribution with ''n''-2 degrees of freedom, ''n'' is the sample size, and s is the sample standard deviation.
To determine if a value is an outlier:
Calculate <math>\scriptstyle \delta  = |(X - mean(X)) / s|</math>.
If ''δ'' > Rejection Region, the data point is an outlier.
If ''δ'' ≤ Rejection Region, the data point is not an outlier.

The modified Thompson Tau test is used to find one outlier at a time (largest value of ''δ'' is removed if it is an outlier). Meaning, if a data point is found to be an outlier, it is removed from the data set and the test is applied again with a new average and rejection region. This process is continued until no outliers remain in a data set.

Some work has also examined outliers for nominal (or categorical) data. In the context of a set of examples (or instances) in a data set, instance hardness measures the probability that an instance will be misclassified ( <math>1-p(y|x)</math> where {{mvar|y}} is the assigned class label and {{mvar|x}} represent the input attribute value for an instance in the training set {{mvar|t}}).<ref>Smith, M.R.; Martinez, T.; Giraud-Carrier, C. (2014). "[https://link.springer.com/article/10.1007%2Fs10994-013-5422-z An Instance Level Analysis of Data Complexity]". Machine Learning, 95(2): 225-256.</ref> Ideally, instance hardness would be calculated by summing over the set of all possible hypotheses {{mvar|H}}:

:<math>\begin{align}IH(\langle x, y\rangle) &= \sum_H (1 - p(y, x, h))p(h|t)\\
&= \sum_H p(h|t) - p(y, x, h)p(h|t)\\
&= 1- \sum_H p(y, x, h)p(h|t).\end{align}</math>

Practically, this formulation is unfeasible as {{mvar|H}} is potentially infinite and calculating <math>p(h|t)</math> is unknown for many algorithms. Thus, instance hardness can be approximated using a diverse subset <math>L \subset H</math>:

:<math>IH_L (\langle x,y\rangle) = 1 - \frac{1}{|L|} \sum_{j=1}^{|L|} p(y|x, g_j(t, \alpha))</math>

where <math>g_j(t, \alpha)</math> is the hypothesis induced by learning algorithm <math>g_j</math> trained on training set {{mvar|t}} with hyperparameters <math>\alpha</math>. Instance hardness provides a continuous value for determining if an instance is an outlier instance.