Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Zipf's law
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Word frequencies in natural languages=== [[File:Zipf 30wiki en labels.png|thumb|Zipf's law plot for the first 10 million words in 30 Wikipedias (as of October 2015) in a [[log-log]] scale]] In many texts in human languages, word frequencies approximately follow a Zipf distribution with exponent {{mvar|s}} close to 1; that is, the most common word occurs about {{mvar|n}} times the {{mvar|n}}-th most common one. The actual rank-frequency plot of a natural language text deviates in some extent from the ideal Zipf distribution, especially at the two ends of the range. The deviations may depend on the language, on the topic of the text, on the author, on whether the text was translated from another language, and on the spelling rules used.{{citation needed|date=May 2023}} Some deviation is inevitable because of [[sampling error]]. At the low-frequency end, where the rank approaches {{mvar|N}}, the plot takes a staircase shape, because each word can occur only an integer number of times. {{clear}} <gallery mode="packed" heights="200px" caption="Zipf's law plots for several languages"> Zipf-euro-4 German, Russian, French, Italian, Medieval English.svg|[[German language|German]] (1669), [[Russian language|Russian]] (1972), [[French language|French]] (1865), [[Italian language|Italian]] (1840), and Medieval English (1460) Zipf-semi-1 Arabic, Geez, Hebraic.svg|[[Ge'ez language|Ge'ez]] (14th century), [[Arabic language|Arabic]] (7th century), [[Hebrew language|Hebrew]] (500–800), all with vowels Zipf-asia-1 Chinese, Tibetan, Vietnamese.svg|[[Lhasa Tibetan]], [[Chinese language|Chinese]], [[Vietnamese language|Vietnamese]], all with separated syllables Zipf-heot-0 Hebrew - Books of the Torah.svg|First five books of the [[Old Testament]] (the [[Torah]]) in Hebrew, with vowels Zipf-laot-0 Vulgate Pentateuch books.svg|First five books of the [[Old Testament]] (the [[Pentateuch]]) in the Latin [[Vulgate]] version Zipf-lant-0 Vulgate Gospels.svg|First four books of the [[New Testament]] (the [[Gospels]]) in the Latin [[Vulgate]] version </gallery> [[File:Wikipedia-n-zipf.png|thumb|upright=1.1|A log-log plot of word frequency in the English Wikipedia (27 November 2006). Zipf's law corresponds to the middle linear portion of the curve, roughly following the green {{nobr|(<math display="inline" alt="inverse of x">\frac{1}{x}</math>) line,}} while the early part is closer to the magenta {{nobr|(<math display="inline" alt="inverse of the square root of x">\frac{1}{ \sqrt{x} }</math>) line}} while the later part is closer to the cyan {{nobr|(<math display="inline" alt="inverse of squared x">\frac{1}{ x^2 }</math>) line.}} <!--These lines correspond to three distinct parameterizations of the Zipf–Mandelbrot distribution, overall a [[broken power law]] with three segments: a head, middle, and tail.{{citation needed|date=November 2024}}--> Other descriptions highlight two segments or "regimes" instead.<ref name=cancho2001/><ref name=dm2002/>]] In some [[Romance languages]], the frequencies of the dozen or so most frequent words deviate significantly from the ideal Zipf distribution, because of those words include articles inflected for [[grammatical gender]] and [[grammatical number|number]].{{citation needed|date=May 2023}} In many East Asian languages, such as [[Chinese language|Chinese]], [[Lhasa Tibetan|Tibetan]], and [[Vietnamese language|Vietnamese]], each [[morpheme]] (word or word piece) consists of a single [[syllable]]; a word of English being often translated to a compound of two such syllables. The rank-frequency table for those morphemes deviates significantly from the ideal Zipf law, at both ends of the range.{{citation needed|date=May 2023}} Even in English, the deviations from the ideal Zipf's law become more apparent as one examines large collections of texts. Analysis of a corpus of 30,000 English texts showed that only about 15% of the texts in it have a good fit to Zipf's law. Slight changes in the definition of Zipf's law can increase this percentage up to close to 50%.<ref name=more2016/> In these cases, the observed frequency-rank relation can be modeled more accurately as by separate Zipf–Mandelbrot laws distributions for different subsets or subtypes of words. This is the case for the frequency-rank plot of the first 10 million words of the English Wikipedia. In particular, the frequencies of the closed class of [[function word]]s in English is better described with {{mvar|s}} lower than 1, while open-ended vocabulary growth with document size and corpus size require {{mvar|s}} greater than 1 for convergence of the [[harmonic series (mathematics)|Generalized Harmonic Series]].<ref name=Powers1998/> [[File:Zipf-code-1 English plain, book-coded, Vigenere coded.svg|thumb|left|Well's ''War of the Worlds'' in plain text, in a [[book code]], and in a [[Vigenère cipher]]]] When a text is encrypted in such a way that every occurrence of each distinct plaintext word is always mapped to the same encrypted word (as in the case of simple [[substitution cipher]]s, like the [[Caesar cipher]]s, or simple [[codebook]] ciphers), the frequency-rank distribution is not affected. On the other hand, if separate occurrences of the same word may be mapped to two or more different words (as happens with the [[Vigenère cipher]]), the Zipf distribution will typically have a flat part at the high-frequency end.{{citation needed|date=May 2023}} ====Applications==== Zipf's law has been used for extraction of parallel fragments of texts out of comparable corpora.<ref name=moha2016/> [[Laurance Doyle]] and others have suggested the application of Zipf's law for detection of [[alien language]] in the [[search for extraterrestrial intelligence]].<ref name=doyle20162>{{cite journal |last=Doyle |first=L.R. |author-link=Laurance Doyle |date=2016-11-18 |df=dmy-all |title=Why alien language would stand out among all the noise of the universe |journal=[[Nautilus Quarterly]] |url=http://cosmos.nautil.us/feature/54/listening-for-extraterrestrial-blah-blah |url-status=dead |lang=en |archive-url=https://web.archive.org/web/20200729120031/http://cosmos.nautil.us/feature/54/listening-for-extraterrestrial-blah-blah |archive-date=2020-07-29 |access-date=2020-08-30}}</ref><ref name=kersh20212>{{cite book |last=Kershenbaum |first=Arik |author-link=Arik Kershenbaum |date=2021-03-16 |df=dmy-all |title=The Zoologist's Guide to the Galaxy: What animals on Earth reveal about aliens – and ourselves |title-link=The Zoologist's Guide to the Galaxy |publisher=Penguin |isbn=978-1-9848-8197-7 |pages=251–256 |language=en |oclc=1242873084}}</ref> The frequency-rank word distribution is often characteristic of the author and changes little over time. This feature has been used in the analysis of texts for authorship attribution.<ref name=droo2016/><ref name=droo2019/> The word-like sign groups of the 15th-century codex [[Voynich manuscript|Voynich Manuscript]] have been found to satisfy Zipf's law, suggesting that text is most likely not a hoax but rather written in an obscure language or cipher.<ref name=boyle2022/><ref name=mont2013/>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Zipf's law
(section)
Add topic