Editing Marathi language (section)

==Corpus Development in Marathi==

Text [[Text corpus|Corpus]] and [[Corpus linguistics|Corpus Linguistics]] show how texts, sentences, or words from written or spoken language have changed over time or how they have been used in an organised way. The [https://archive.org/details/LSIV0-V11/LSI-V7/ Volume VII: 'Indo-Aryan Languages (Southern Group)] of the '[[Linguistic Survey of India]]' by [[George Abraham Grierson]] describes first systematic and structured attempt to create documentation of Marathi language data.

'''Corpora in Marathi'''

Attempts have been made to create [[Corpus of Marathi]]. One of the first efforts to make a corpus with Indian text was the Kolhapur Corpus of Indian English<ref>{{Cite web |last=Shastri |first=S.V. |date=1986 |title=The Kolhapur Corpus of Indian English |url=http://korpus.uib.no/icame/manuals/KOLHAPUR/INDEX.HTM#materia.}}</ref> (Shastri, 1986). The corpus was developed at the university in [[Maharashtra|Maharastra]], but Indian English was studied. The [[IIT Bombay]] WordNet<ref>{{Cite book |last=Bhattacharyya |first=Pushpak |title=IndoWordNet. (in Proceedings of the Seventh International Conference on Language Resources and Evaluation LREC'10) |publisher=European Language Resources Association (ELRA). |year=2010 |isbn=978-2-9517408-6-0 |location=Valletta, Malta |pages=3785–3792 |language=English}}</ref> (IndoWordNet; Bhattacharya, 2010) project in Indian languages includes Marathi. WordNet do not give word counts for further useful data analysis. The raw text based corpus in Marathi<ref>{{Cite web |title=A Gold Standard Marathi Raw Text Corpus |url=https://data.ldcil.org/a-gold-standard-marathi-raw-text-corpus |access-date=9 September 2023 |website=data.ldcil.org}}</ref> (Ramamoorthy et al., 2019a) is based on sampled pages from different select books. This work is carried out at [[Central Institute of Indian Languages]], [[Mysore]]. A corpus-based linguistic study at the University of Mumbai explores the language contact between English and Marathi by compiling and analysing an overarching corpus of English loan-words in Marathi existing between the years 2001 and 2020. The study also investigates the attitudes of Marathi speakers towards English loan-words in contemporary Marathi, attempting to understand their motivations for borrowing English words (Doibale, 2022).<ref>{{Cite journal |last=Doibale |first=Kranti |date=2022 |title=A Corpus-Based Linguistic Study of English Loan-Words in Contemporary Marathi |journal=University |url=http://hdl.handle.net/10603/487393 |access-date=9 September 2023 |hdl=10603/487393}}</ref>

The work at [[University of Mumbai]] by Belhekar and Bhargava (2023)<ref name=":0">{{Cite journal |last1=Belhekar |first1=Vivek |last2=Bhargava |first2=Radhika |date=December 2023 |title=Development of word count data corpus for Hindi and Marathi literature |url=https://linkinghub.elsevier.com/retrieve/pii/S2666799123000308 |journal=Applied Corpus Linguistics |language=en |volume=3 |issue=3 |pages=100070 |doi=10.1016/j.acorp.2023.100070|s2cid=261150616 }}</ref> provided the first Marathi word count collection (Marathi WordCorp). The [[Bag-of-words model|bag-of-words]] (BoW) model was used to make 1-gram (single-word) Marathi WordCorp. They used more than 700 complete works of literature.

The [https://books.google.com/ngrams/ Google Books Ngram Viewer] (Michel et al., 2011)<ref>{{Cite journal |last1=Michel |first1=Jean-Baptiste |last2=Shen |first2=Yuan Kui |last3=Aiden |first3=Aviva Presser |last4=Veres |first4=Adrian |last5=Gray |first5=Matthew K. |last6=((The Google Books Team)) |last7=Pickett |first7=Joseph P. |last8=Hoiberg |first8=Dale |last9=Clancy |first9=Dan |last10=Norvig |first10=Peter |last11=Orwant |first11=Jon |last12=Pinker |first12=Steven |last13=Nowak |first13=Martin A. |last14=Aiden |first14=Erez Lieberman |date=14 January 2011 |title=Quantitative Analysis of Culture Using Millions of Digitized Books |journal=Science |language=en |volume=331 |issue=6014 |pages=176–182 |doi=10.1126/science.1199644 |pmid=21163965 |pmc=3279742 |bibcode=2011Sci...331..176M |issn=0036-8075}}</ref> is a relatively new and advanced method that shows how the frequency of n-grams has changed over a specific period. There is no database of Indian languages in the Google Books Ngram viewer. The Indian Languages Word Corpus<ref>{{Cite web |url=https://indianlangwordcorp.shinyapps.io/ILWC/ |title=Indian Languages Word Corpus |access-date=28 September 2023 |website=indianlangwordcorp.shinyapps.io}}</ref> ([https://indianlangwordcorp.shinyapps.io/ILWC/ ILWC]) WebApp, which was made by Belhekar and Bhargava,<ref name=":0" /> shows how often words are used by decade from before 1920 to 2020. The limitation with the method is that it only gives researchers the raw OCR data to "combine and collapse frequencies of correctly and incorrectly recognised words" (p.&nbsp;2).<ref name=":0" />

'''Statistical Models for Marathi Corpora'''

Attempts to evaluate statistical models for Marathi language Corpuses and text-collections have been carried out. For the Marathi corpus (Marathi WordCorp), the y-intercept of Zipf's law is reported as 12.49, and the coefficient is 0.89 and these numbers show that Zipf's law is applicable for Marathi language.<ref name=":0" /> The coefficients show that the number of words and texts used in the corpus metadata is enough. Heaps' law intercept for the Marathi word corpora is 2.48, and the coefficient is 0.73.<ref name=":0" /> The coefficient values show that there are more unique words in Marathi writings than would be expected. The higher number of unique words could be due to the number of alphabets (36 consonant letters and 16 initial-vowel letters, with each consonant taking 14 forms with vowel pairs), the orthographic features of the Devanagari script (for example, the same word can be written in different ways), the use of consonant clusters (jodakshar), the number of suffixes a word can have, etc.