Editing Voynich manuscript (section)

==== Statistical patterns ====
The text consists of over 170,000 characters,<ref name=Schmeh /> with spaces dividing the text into about 35,000 groups of varying length, usually referred to as "words" or "word tokens" (37,919); 8,114 of those words [[Hapax legomenon|are considered unique]] "word types".<ref>{{cite web |last1=Reddy |first1=Sravana |last2=Knight |first2=Kevin |year=2011 |title=What we know about the Voynich manuscript |pages=1–9 |url=http://www.isi.edu/natural-language/people/voynich-11.pdf |access-date=11 June 2016 |website=www.isi.edu |archive-url=https://web.archive.org/web/20110826041911/http://www.isi.edu/natural-language/people/voynich-11.pdf |archive-date=26 August 2011}}</ref> The structure of these words seems to follow [[phonological]] or [[orthography|orthographic]] laws of some sort; for example, certain characters must appear in each word (like English [[vowel]]s), some characters never follow others, or some may be doubled or tripled, but others may not. The distribution of letters within words is also rather peculiar: Some characters occur only at the beginning of a word, some only at the end (like Greek [[ς]]), and some always in the middle section.<ref>{{cite web |last=Zandbergen |first=René |title=Analysis Section ( 2/5 ) – Character statistics |url=http://www.voynich.nu/a2_char.html |website=Voynich.nu |access-date=31 March 2018|archive-date=17 August 2019|archive-url=https://web.archive.org/web/20190817033142/http://www.voynich.nu/a2_char.html |url-status=live}}</ref>

Many researchers have commented upon the highly regular structure of the words.<ref>{{cite web |last=Zandbergen |first=René |date=26 December 2015 |title=Analysis Section ( 3/5 ) – Word structure |url=http://www.voynich.nu/a3_para.html |website=Voynich.nu |access-date=8 June 2016|archive-date=2 June 2016|archive-url=https://web.archive.org/web/20160602032135/http://voynich.nu/a3_para.html |url-status=live}}</ref> Professor Gonzalo Rubio, an expert in ancient languages at [[Pennsylvania State University]], stated:

{{blockquote|The things we know as {{em|grammatical markers}} – things that occur commonly at the beginning or end of words, such as 's' or 'd' in our language, and that are used to express grammar, never appear in the middle of 'words' in the Voynich manuscript. That's unheard of for any Indo-European, Hungarian, or Finnish language.<ref name=Telegraph>{{cite web |last=Day |first=Michael |date=24 May 2011 |title=The Voynich Manuscript: will we ever be able to read this book? |url=https://www.telegraph.co.uk/news/science/science-news/8532458/The-Voynich-Manuscript-will-we-ever-be-able-to-read-this-book.html |archive-url=https://ghostarchive.org/archive/20220112/https://www.telegraph.co.uk/news/science/science-news/8532458/The-Voynich-Manuscript-will-we-ever-be-able-to-read-this-book.html |archive-date=12 January 2022 |url-access=subscription |url-status=live |website=[[The Daily Telegraph|The Telegraph]] |access-date=8 June 2016}}{{cbignore}}</ref>}}

Stephan Vonfelt studied statistical properties of the distribution of letters and their correlations (properties which can be vaguely characterised as rhythmic resonance, alliteration, or assonance) and found that under that respect Voynichese is more similar to the [[Mandarin Chinese]] {{transliteration|zh|[[pinyin]]}} text of the ''[[Records of the Grand Historian]]'' than to the text of works from European languages, although the numerical differences between Voynichese and Mandarin Chinese pinyin look larger than those between Mandarin Chinese pinyin and European languages.<ref>{{cite web |last=Vonfelt |first=S. |year=2014 |title=The strange resonances of the Voynich manuscript |website=[[Free (ISP)]] Graphométrie |url=http://graphometrie.free.fr/publications/Voynich_en.pdf |archive-url=https://web.archive.org/web/20180928122339/http://graphometrie.free.fr/publications/Voynich_en.pdf |archive-date=28 September 2018}}</ref>{{better source needed|date=November 2019|reason=Has anyone else done a similar analysis and reached a similar result?}}

Practically no words have fewer than two letters or more than ten.<ref name="Schmeh">{{cite magazine |last=Schmeh |first=Klaus |date=January–February 2011 |title=The Voynich manuscript: The book nobody can read |magazine=[[Skeptical Inquirer]] |volume=35 |issue=1 |access-date=8 June 2016 |url=http://www.csicop.org/si/show/the_voynich_manuscript_the_book_nobody_can_read |archive-date=16 September 2018|archive-url=https://web.archive.org/web/20180916085805/https://www.csicop.org/si/show/the_voynich_manuscript_the_book_nobody_can_read |url-status=live}}</ref> Some words occur in only certain sections, or in only a few pages; others occur throughout the manuscript. Few repetitions occur among the thousand or so labels attached to the illustrations. There are instances where the same common word appears up to five times in a row<ref name="Schmeh" /> (see [[Zipf's law]]). Words that differ by only one letter also repeat with unusual frequency, causing single-substitution alphabet decipherings to yield babble-like text. In 1962, [[cryptanalyst]] [[Elizebeth Friedman]] described such statistical analyses as "doomed to utter frustration".{{refn|
{{cite news |last=Friedman |first=Elizebeth |author-link=Elizebeth Friedman |date=5 August 1962 |title=The most mysterious ms. – still an enigma |newspaper=[[Washington Post]] |pages=E1, E5 |url=http://www.mgh-bibliothek.de//dokumente/b/b043178.pdf}} – quoted by D'Imperio (1978)<ref name=D-Imperio-1978 />{{rp|page=27 (§#x202f;4.4)}}
}}

In 2014, a team led by Diego Amancio of the [[University of São Paulo]] published a study using statistical methods to analyse the relationships of the words in the text. Instead of trying to find the meaning, Amancio's team looked for connections and clusters of words. By measuring the frequency and intermittence of words, Amancio claimed to identify the text's [[keyword (linguistics)|keywords]] and produced three-dimensional models of the text's structure and word frequencies. The team concluded that, in 90% of cases, the Voynich systems are similar to those of other known books, indicating that the text is in an actual language, not random [[gibberish]].<ref name=Amancio-etal-2013 />

{{blockquote|The use of the framework was exemplified with the analysis of the Voynich manuscript, with the final conclusion that it differs from a random sequence of words, being compatible with natural languages. Even though our approach is not aimed at deciphering Voynich, it was capable of providing keywords that could be helpful for decipherers in the future.<ref name=Amancio-etal-2013 />}}

Linguists [[Claire Bowern]] and Luke Lindemann have applied statistical methods to the Voynich manuscript, comparing it to other languages and encodings of languages, and have found both similarities and differences in statistical properties. Character sequences in languages are measured using a metric called h2, or second-order conditional entropy. Natural languages tend to have an h2 between 3 and 4, but Voynichese has much more predictable character sequences, and an h2 around 2. However, at higher levels of organisation, the Voynich manuscript displays properties similar to those of natural languages. Based on this, Bowern dismisses theories that the manuscript is gibberish.<ref name="Miller">{{cite journal |last1=Miller |first1=Greg |title=Can statistics help crack the mysterious Voynich manuscript? |journal=[[Knowable Magazine]] |date=20 August 2021 |doi=10.1146/knowable-081921-1 |url=https://knowablemagazine.org/article/society/2021/can-statistics-help-crack-mysterious-voynich-manuscript |access-date=31 August 2021 |doi-access=free |archive-date=29 November 2021 |archive-url=https://web.archive.org/web/20211129071518/https://knowablemagazine.org/article/society/2021/can-statistics-help-crack-mysterious-voynich-manuscript |url-status=live |url-access=subscription}}</ref> It is likely to be an encoded natural language or a constructed language. Bowern also concludes that the statistical properties of the Voynich manuscript are not consistent with the use of a [[substitution cipher]] or [[polyalphabetic cipher]].<ref name="Bowern">{{cite journal |last1=Bowern |first1=Claire L. |last2=Lindemann |first2=Luke |title=The Linguistics of the Voynich Manuscript |journal=Annual Review of Linguistics |date=14 January 2021 |volume=7 |issue=1 |pages=285–308 |doi=10.1146/annurev-linguistics-011619-030613 |s2cid=228894621 |url=https://www.annualreviews.org/doi/10.1146/annurev-linguistics-011619-030613 |access-date=30 August 2021 |archive-date=2 September 2021 |archive-url=https://web.archive.org/web/20210902103307/https://www.annualreviews.org/doi/10.1146/annurev-linguistics-011619-030613 |url-status=live}}</ref>

As noted in Bowern's review, multiple scribes or "hands" may have written the manuscript, possibly using two methods of encoding at least one natural language.<ref name="Bowern" /><ref name="Currier1976">{{cite web |last1=Currier |first1=PH |last2=Zandbergen |first2=R |title=Papers on the Voynich manuscript. The Voynich Manuscript |url=http://www.voynich.nu/extra/curr_main.html |access-date=30 August 2021 |archive-date=13 May 2021 |archive-url=https://web.archive.org/web/20210513035016/http://www.voynich.nu/extra/curr_main.html |url-status=live}}</ref><ref name="Davis2020">{{cite journal |last1=Davis |first1=Lisa Fagin |title=How many glyphs and how many scribes? Digital paleography and the Voynich Manuscript |journal=Manuscr. Stud. |date=2020 |volume=5 |pages=164–80 |doi=10.1353/mns.2020.0011 |s2cid=218957807 |url=https://repository.upenn.edu/cgi/viewcontent.cgi?article=1082&context=mss_sims |access-date=30 August 2021 |archive-date=26 September 2021 |archive-url=https://web.archive.org/web/20210926115210/https://repository.upenn.edu/cgi/viewcontent.cgi?article=1082&context=mss_sims |url-status=live}}</ref><ref name="Reddy2011">{{cite book |last1=Reddy |chapter=What we know about the Voynich manuscript |first1=Sravana |last2=Knight |first2=Kevin |title=Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities |date=2011 |publisher=Assoc. Comput. Linguist. |location=Stroudsburg, PA |pages=78–86}}</ref> The "language" Voynich A appears in the herbal and pharmaceutical parts of the manuscript. The "language" known as Voynich B appears in the [[balneological]] section, some parts of the medicinal and herbal sections, and the astrological section. The most common vocabulary items of Voynich A and Voynich B are substantially different. [[Topic modeling]] of the manuscript suggests that pages identified as written by a particular scribe may relate to a different topic.<ref name="Bowern" />

In terms of [[Morphology (linguistics)|morphology]], if visual spaces in the manuscript are assumed to indicate word breaks, there are consistent patterns that suggest a three-part word structure of prefix, root or midfix, and suffix. Certain characters and character combinations are more likely to appear in particular fields. There are minor variations between Voynich A and Voynich B. The predictability of certain letters in a relatively small number of combinations in certain parts of words appears to explain the low entropy (h2) of Voynichese. In the absence of obvious punctuation, some variants of the same word appear to be specific to typographical positions, such as the beginning of a paragraph, line, or sentence.<ref name="Bowern" />

The Voynich word frequencies of both variants appear to conform to a [[Zipfian distribution]], supporting the idea that the text has linguistic meaning. This has implications for the encoding methods most likely to have been used, since some forms of encoding interfere with the Zipfian distribution. Measures of the proportional frequency of the ten most common words is similar to those of the Semitic, Iranian, and Germanic languages. Another measure of morphological complexity, the Moving-Average Type–Token Ratio (MATTR) index, is similar to Iranian, Germanic, and Romance languages.<ref name="Bowern" />