Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Quantile
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Approximate quantiles from a stream== Computing approximate quantiles from data arriving from a stream can be done efficiently using compressed data structures. The most popular methods are t-digest<ref name="Dunning2019">{{cite arXiv |author=Dunning, Ted |author2=Ertl, Otmar |title=Computing Extremely Accurate Quantiles Using t-Digests |date=February 2019 |eprint=1902.04023 |class=stat.CO}}</ref> and KLL.<ref name="Karnin2016">{{cite arXiv |author1=Zohar Karnin |author2=Kevin Lang |author3=Edo Liberty |title=Optimal Quantile Approximation in Streams |date=2016 |class=cs.DS |eprint=1603.05346}}</ref> These methods read a stream of values in a continuous fashion and can, at any time, be queried about the approximate value of a specified quantile. Both algorithms are based on a similar idea: compressing the stream of values by summarizing identical or similar values with a weight. If the stream is made of a repetition of 100 times v1 and 100 times v2, there is no reason to keep a sorted list of 200 elements, it is enough to keep two elements and two counts to be able to recover the quantiles. With more values, these algorithms maintain a trade-off between the number of unique values stored and the precision of the resulting quantiles. Some values may be discarded from the stream and contribute to the weight of a nearby value without changing the quantile results too much. The t-digest maintains a [[data structure]] of bounded size using an approach motivated by ''k''-means clustering to group similar values. The KLL algorithm uses a more sophisticated "compactor" method that leads to better control of the error bounds at the cost of requiring an unbounded size if errors must be bounded relative to {{mvar|p}}. Both methods belong to the family of ''data sketches'' that are subsets of [[Streaming algorithm|Streaming Algorithms]] with useful properties: t-digest or KLL sketches can be combined. Computing the sketch for a very large vector of values can be split into trivially parallel processes where sketches are computed for partitions of the vector in parallel and merged later. The algorithms described so far directly approximate the empirical quantiles without any particular assumptions on the data, in essence the data are simply numbers or more generally, a set of items that can be ordered. These algorithms are computer science derived methods. Another class of algorithms exist which assume that the data are realizations of a random process. These are statistics derived methods, sequential nonparametric estimation algorithms in particular. There are a number of such algorithms such as those based on stochastic approximation<ref name="tierney1983">{{cite journal |last1=Tierney |first1=Luke |title=A space-efficient recursive procedure for estimating a quantile of an unknown distribution |journal=SIAM Journal on Scientific and Statistical Computing |date=1983 |volume=4 |issue=4 |page=706-711|doi=10.1137/0904048|url=https://doi.org/10.1137/0904048}}</ref><ref name="chen2000">{{cite book |last1=Chen |first1=Fei |last2=Lambert |first2=Diane |last3=Pinheiro |first3=Jose |chapter=Incremental quantile estimation for massive tracking |title=Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining |date=2000 |page=516-522|doi=10.1145/347090.347195|isbn=1-58113-233-6 |chapter-url=https://doi.org/10.1145/347090.347195}}</ref> or Hermite series estimators.<ref name="stephanou2017">{{cite journal |last1=Stephanou |first1=Michael |last2=Varughese |first2=Melvin |last3=Macdonald |first3=Iain |title=Sequential quantiles via Hermite series density estimation |journal=Electronic Journal of Statistics |date=2017 |volume=11 |issue=1 |page=570-607 |doi=10.1214/17-EJS1245 |url=https://doi.org/10.1214/17-EJS1245|arxiv=1507.05073 }}</ref> These statistics based algorithms typically have constant update time and space complexity, but have different error bound guarantees compared to computer science type methods and make more assumptions. The statistics based algorithms do present certain advantages however, particularly in the non-stationary streaming setting i.e. time-varying data. The algorithms of both classes, along with some respective advantages and disadvantages have been recently surveyed.<ref name="StephanouHermiter2022">{{Cite journal |author=Stephanou, M. and Varughese, M |title=Hermiter: R package for sequential nonparametric estimation |journal=Computational Statistics |year=2023 |volume=39 |issue=3 |pages=1127β1163 |doi=10.1007/s00180-023-01382-0 |arxiv=2111.14091 |s2cid=244715035 }}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Quantile
(section)
Add topic