Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Outlier
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Definitions and detection == There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.<ref name="ZimekFilzmoser2018">{{cite journal|last1=Zimek|first1=Arthur|last2=Filzmoser|first2=Peter|title=There and back again: Outlier detection between statistical reasoning and data mining algorithms|journal=Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery|volume=8|issue=6|year=2018|pages=e1280|issn=1942-4787|doi=10.1002/widm.1280|s2cid=53305944|url=https://findresearcher.sdu.dk:8443/ws/files/153197807/There_and_Back_Again.pdf|access-date=2019-12-11|archive-date=2021-11-14|archive-url=https://web.archive.org/web/20211114121638/https://findresearcher.sdu.dk:8443/ws/files/153197807/There_and_Back_Again.pdf|url-status=dead}}</ref> There are various methods of outlier detection, some of which are treated as synonymous with novelty detection.<ref>Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014). A review of novelty detection. Signal Processing, 99, 215-249.</ref><ref>{{citation |last1=Rousseeuw |first1=P |author1-link=Peter Rousseeuw |last2=Leroy |first2=A. |year=1996 |title=Robust Regression and Outlier Detection |publisher=John Wiley & Sons |edition=3rd |title-link= Robust Regression and Outlier Detection}}</ref><ref>{{citation |first1=Victoria J. |last1=Hodge |first2=Jim |last2=Austin |title=A Survey of Outlier Detection Methodologies |journal=Artificial Intelligence Review |volume=22 |issue=2 |pages=85β126 |doi= 10.1023/B:AIRE.0000045502.10941.a9|year=2004 |citeseerx=10.1.1.109.1943 |s2cid=3330313 }}</ref><ref>{{Citation | last1 = Barnett | first1 = Vic | last2 = Lewis | first2 = Toby | year = 1994 | orig-year = 1978 | title = Outliers in Statistical Data | edition = 3 | publisher = Wiley | isbn =978-0-471-93094-5}}</ref><ref name="subspace" /> Some are graphical such as [[normal probability plot]]s. Others are model-based. [[Box plot]]s are a hybrid. Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation: * [[Chauvenet's criterion]] * [[Grubbs's test for outliers]] * [[Dixon's Q test|Dixon's ''Q'' test]] * [[ASTM]] E178: Standard Practice for Dealing With Outlying Observations<ref>[https://www.nrc.gov/docs/ML1023/ML102371244.pdf E178: Standard Practice for Dealing With Outlying Observations]</ref> * [[Mahalanobis distance]] and [[leverage (statistics)|leverage]] are often used to detect outliers, especially in the development of linear regression models. * Subspace and correlation based techniques for high-dimensional numerical data<ref name="subspace">{{cite journal | last1 = Zimek | first1 = A. | last2 = Schubert | first2 = E.| last3 = Kriegel | first3 = H.-P. | author-link3=Hans-Peter Kriegel| title = A survey on unsupervised outlier detection in high-dimensional numerical data | doi = 10.1002/sam.11161 | journal = Statistical Analysis and Data Mining | volume = 5 | issue = 5 | pages = 363β387| year = 2012| s2cid = 6724536 }}</ref> ===Peirce's criterion=== {{main|Peirce's criterion}} <blockquote> It is proposed to determine in a series of <math>m</math> observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as <math>n</math> such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations. (Quoted in the editorial note on page 516 to Peirce (1982 edition) from ''A Manual of Astronomy'' 2:558 by Chauvenet.) <ref>[[Benjamin Peirce]], [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..161P;data_type=PDF_HIGH "Criterion for the Rejection of Doubtful Observations"], ''Astronomical Journal'' II 45 (1852) and [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..176P;data_type=PDF_HIGH Errata to the original paper].</ref><ref>{{cite journal |title=On Peirce's criterion |author-link=Benjamin Peirce |first=Benjamin |last=Peirce |journal=Proceedings of the American Academy of Arts and Sciences |volume=13 |date=May 1877 β May 1878 |pages=348β351 |jstor=25138498 |doi=10.2307/25138498 }}</ref><ref>{{cite journal |first=Charles Sanders |last=Peirce |author-link=Charles Sanders Peirce |title=Appendix No. 21. On the Theory of Errors of Observation |journal=Report of the Superintendent of the United States Coast Survey Showing the Progress of the Survey During the Year 1870 |orig-year=1870 | year=1873 |pages=200β224 }}. NOAA [http://docs.lib.noaa.gov/rescue/cgs/001_pdf/CSC-0019.PDF#page=215 PDF Eprint] (goes to Report p. 200, PDF's p. 215).</ref><ref>{{cite book |first=Charles Sanders |last=Peirce |author-link=Charles Sanders Peirce |contribution=On the Theory of Errors of Observation |title=Writings of Charles S. Peirce: A Chronological Edition |volume=3, 1872-1878 |editor=Kloesel, Christian J. W. |display-editors=etal |publisher=Indiana University Press |location=Bloomington, Indiana |orig-year=1982 |year=1986 <!-- copyright=1986, but publication is listed as 1982 --> |pages=[https://archive.org/details/writingsofcharle0002peir/page/140 140β160] |isbn=978-0-253-37201-7 |url=https://archive.org/details/writingsofcharle0002peir/page/140 }} β Appendix 21, according to the editorial note on page 515</ref> </blockquote> ===Tukey's fences=== Other methods flag observations based on measures such as the [[interquartile range]]. For example, if <math>Q_1</math> and <math>Q_3</math> are the lower and upper [[quartile]]s respectively, then one could define an outlier to be any observation outside the range: :<math> \big[ Q_1 - k (Q_3 - Q_1 ) , Q_3 + k (Q_3 - Q_1 ) \big]</math> for some nonnegative constant <math>k</math>. [[John Tukey]] proposed this test, where <math>k=1.5</math> indicates an "outlier", and <math>k=3</math> indicates data that is "far out".<ref>{{cite book |last=Tukey |first=John W |title=Exploratory Data Analysis |year=1977 |publisher=Addison-Wesley |isbn=978-0-201-07616-5 |oclc=3058187 |url=https://archive.org/details/exploratorydataa00tuke_0 }}</ref> === In anomaly detection === {{main|Anomaly detection}} In various domains such as, but not limited to, [[statistics]], [[signal processing]], [[finance]], [[econometrics]], [[manufacturing]], [[Network science|networking]] and [[data mining]], the task of ''anomaly detection'' may take other approaches. Some of these may be distance-based<ref>{{Cite journal | doi = 10.1007/s007780050006| title = Distance-based outliers: Algorithms and applications| journal = The VLDB Journal the International Journal on Very Large Data Bases| volume = 8| issue = 3β4| pages = 237| year = 2000| last1 = Knorr | first1 = E. M. | last2 = Ng | first2 = R. T. | last3 = Tucakov | first3 = V. | citeseerx = 10.1.1.43.1842| s2cid = 11707259}}</ref><ref>{{Cite conference | doi = 10.1145/342009.335437| title = Efficient algorithms for mining outliers from large data sets| conference = Proceedings of the 2000 ACM SIGMOD international conference on Management of data - SIGMOD '00| pages = 427| year = 2000| last1 = Ramaswamy | first1 = S. | last2 = Rastogi | first2 = R. | last3 = Shim | first3 = K. | isbn = 1581132174}}</ref> and density-based such as [[Local Outlier Factor]] (LOF).<ref>{{Cite conference| doi = 10.1145/335191.335388| title = LOF: Identifying Density-based Local Outliers| year = 2000| last1 = Breunig | first1 = M. M.| last2 = Kriegel | first2 = H.-P. | author-link2 = Hans-Peter Kriegel| last3 = Ng | first3 = R. T.| last4 = Sander | first4 = J.| work = Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data| series = [[SIGMOD]]| isbn = 1-58113-217-4| pages = 93β104| url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf}}</ref> Some approaches may use the distance to the [[k-nearest neighbor]]s to label observations as outliers or non-outliers.<ref>{{Cite journal | last1 = Schubert | first1 = E. | last2 = Zimek | first2 = A. | last3 = Kriegel | first3 = H. -P. | author-link3 = Hans-Peter Kriegel| doi = 10.1007/s10618-012-0300-z | title = Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection | journal = Data Mining and Knowledge Discovery | volume = 28 | pages = 190β237 | year = 2012 | s2cid = 19036098 }}</ref> ===Modified Thompson Tau test=== {{see also|Studentized residual#Distribution}} The modified Thompson Tau test is a method used to determine if an outlier exists in a data set.<ref>{{Cite web |last=Wheeler |first=Donald J. |date=11 January 2021 |title=Some Outlier Tests: Part 2 |url=https://www.qualitydigest.com/inside/statistics-column/some-outlier-tests-part-2-011121.html |access-date=2025-02-09 |website=Quality Digest |language=en}}</ref> The strength of this method lies in the fact that it takes into account a data set's standard deviation, average and provides a statistically determined rejection zone; thus providing an objective method to determine if a data point is an outlier.{{Citation needed|reason=Although intuitively appealing, this method appears to be unpublished (it is ''not'' described in Thompson (1985) so one should use it with caution.|date=October 2016}}<ref>Thompson .R. (1985). "[https://www.jstor.org/stable/2345543?seq=1#page_scan_tab_contents A Note on Restricted Maximum Likelihood Estimation with an Alternative Outlier Model]".Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, No. 1, pp. 53-55</ref> How it works: First, a data set's average is determined. Next the absolute deviation between each data point and the average are determined. Thirdly, a rejection region is determined using the formula: :<math>\text{Rejection Region}{{=}} \frac{{t_{\alpha/2}}{\left ( n-1 \right )}}{\sqrt{n}\sqrt{n-2+{t_{\alpha/2}^2}}} </math>; where <math>\scriptstyle{t_{\alpha/2}}</math> is the critical value from the Student {{mvar|t}} distribution with ''n''-2 degrees of freedom, ''n'' is the sample size, and s is the sample standard deviation. To determine if a value is an outlier: Calculate <math>\scriptstyle \delta = |(X - mean(X)) / s|</math>. If ''Ξ΄'' > Rejection Region, the data point is an outlier. If ''Ξ΄'' β€ Rejection Region, the data point is not an outlier. The modified Thompson Tau test is used to find one outlier at a time (largest value of ''Ξ΄'' is removed if it is an outlier). Meaning, if a data point is found to be an outlier, it is removed from the data set and the test is applied again with a new average and rejection region. This process is continued until no outliers remain in a data set. Some work has also examined outliers for nominal (or categorical) data. In the context of a set of examples (or instances) in a data set, instance hardness measures the probability that an instance will be misclassified ( <math>1-p(y|x)</math> where {{mvar|y}} is the assigned class label and {{mvar|x}} represent the input attribute value for an instance in the training set {{mvar|t}}).<ref>Smith, M.R.; Martinez, T.; Giraud-Carrier, C. (2014). "[https://link.springer.com/article/10.1007%2Fs10994-013-5422-z An Instance Level Analysis of Data Complexity]". Machine Learning, 95(2): 225-256.</ref> Ideally, instance hardness would be calculated by summing over the set of all possible hypotheses {{mvar|H}}: :<math>\begin{align}IH(\langle x, y\rangle) &= \sum_H (1 - p(y, x, h))p(h|t)\\ &= \sum_H p(h|t) - p(y, x, h)p(h|t)\\ &= 1- \sum_H p(y, x, h)p(h|t).\end{align}</math> Practically, this formulation is unfeasible as {{mvar|H}} is potentially infinite and calculating <math>p(h|t)</math> is unknown for many algorithms. Thus, instance hardness can be approximated using a diverse subset <math>L \subset H</math>: :<math>IH_L (\langle x,y\rangle) = 1 - \frac{1}{|L|} \sum_{j=1}^{|L|} p(y|x, g_j(t, \alpha))</math> where <math>g_j(t, \alpha)</math> is the hypothesis induced by learning algorithm <math>g_j</math> trained on training set {{mvar|t}} with hyperparameters <math>\alpha</math>. Instance hardness provides a continuous value for determining if an instance is an outlier instance.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Outlier
(section)
Add topic