Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Marathi language
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Corpus Development in Marathi== Text [[Text corpus|Corpus]] and [[Corpus linguistics|Corpus Linguistics]] show how texts, sentences, or words from written or spoken language have changed over time or how they have been used in an organised way. The [https://archive.org/details/LSIV0-V11/LSI-V7/ Volume VII: 'Indo-Aryan Languages (Southern Group)] of the '[[Linguistic Survey of India]]' by [[George Abraham Grierson]] describes first systematic and structured attempt to create documentation of Marathi language data. '''Corpora in Marathi''' Attempts have been made to create [[Corpus of Marathi]]. One of the first efforts to make a corpus with Indian text was the Kolhapur Corpus of Indian English<ref>{{Cite web |last=Shastri |first=S.V. |date=1986 |title=The Kolhapur Corpus of Indian English |url=http://korpus.uib.no/icame/manuals/KOLHAPUR/INDEX.HTM#materia.}}</ref> (Shastri, 1986). The corpus was developed at the university in [[Maharashtra|Maharastra]], but Indian English was studied. The [[IIT Bombay]] WordNet<ref>{{Cite book |last=Bhattacharyya |first=Pushpak |title=IndoWordNet. (in Proceedings of the Seventh International Conference on Language Resources and Evaluation LREC'10) |publisher=European Language Resources Association (ELRA). |year=2010 |isbn=978-2-9517408-6-0 |location=Valletta, Malta |pages=3785β3792 |language=English}}</ref> (IndoWordNet; Bhattacharya, 2010) project in Indian languages includes Marathi. WordNet do not give word counts for further useful data analysis. The raw text based corpus in Marathi<ref>{{Cite web |title=A Gold Standard Marathi Raw Text Corpus |url=https://data.ldcil.org/a-gold-standard-marathi-raw-text-corpus |access-date=9 September 2023 |website=data.ldcil.org}}</ref> (Ramamoorthy et al., 2019a) is based on sampled pages from different select books. This work is carried out at [[Central Institute of Indian Languages]], [[Mysore]]. A corpus-based linguistic study at the University of Mumbai explores the language contact between English and Marathi by compiling and analysing an overarching corpus of English loan-words in Marathi existing between the years 2001 and 2020. The study also investigates the attitudes of Marathi speakers towards English loan-words in contemporary Marathi, attempting to understand their motivations for borrowing English words (Doibale, 2022).<ref>{{Cite journal |last=Doibale |first=Kranti |date=2022 |title=A Corpus-Based Linguistic Study of English Loan-Words in Contemporary Marathi |journal=University |url=http://hdl.handle.net/10603/487393 |access-date=9 September 2023 |hdl=10603/487393}}</ref> The work at [[University of Mumbai]] by Belhekar and Bhargava (2023)<ref name=":0">{{Cite journal |last1=Belhekar |first1=Vivek |last2=Bhargava |first2=Radhika |date=December 2023 |title=Development of word count data corpus for Hindi and Marathi literature |url=https://linkinghub.elsevier.com/retrieve/pii/S2666799123000308 |journal=Applied Corpus Linguistics |language=en |volume=3 |issue=3 |pages=100070 |doi=10.1016/j.acorp.2023.100070|s2cid=261150616 }}</ref> provided the first Marathi word count collection (Marathi WordCorp). The [[Bag-of-words model|bag-of-words]] (BoW) model was used to make 1-gram (single-word) Marathi WordCorp. They used more than 700 complete works of literature. The [https://books.google.com/ngrams/ Google Books Ngram Viewer] (Michel et al., 2011)<ref>{{Cite journal |last1=Michel |first1=Jean-Baptiste |last2=Shen |first2=Yuan Kui |last3=Aiden |first3=Aviva Presser |last4=Veres |first4=Adrian |last5=Gray |first5=Matthew K. |last6=((The Google Books Team)) |last7=Pickett |first7=Joseph P. |last8=Hoiberg |first8=Dale |last9=Clancy |first9=Dan |last10=Norvig |first10=Peter |last11=Orwant |first11=Jon |last12=Pinker |first12=Steven |last13=Nowak |first13=Martin A. |last14=Aiden |first14=Erez Lieberman |date=14 January 2011 |title=Quantitative Analysis of Culture Using Millions of Digitized Books |journal=Science |language=en |volume=331 |issue=6014 |pages=176β182 |doi=10.1126/science.1199644 |pmid=21163965 |pmc=3279742 |bibcode=2011Sci...331..176M |issn=0036-8075}}</ref> is a relatively new and advanced method that shows how the frequency of n-grams has changed over a specific period. There is no database of Indian languages in the Google Books Ngram viewer. The Indian Languages Word Corpus<ref>{{Cite web |url=https://indianlangwordcorp.shinyapps.io/ILWC/ |title=Indian Languages Word Corpus |access-date=28 September 2023 |website=indianlangwordcorp.shinyapps.io}}</ref> ([https://indianlangwordcorp.shinyapps.io/ILWC/ ILWC]) WebApp, which was made by Belhekar and Bhargava,<ref name=":0" /> shows how often words are used by decade from before 1920 to 2020. The limitation with the method is that it only gives researchers the raw OCR data to "combine and collapse frequencies of correctly and incorrectly recognised words" (p. 2).<ref name=":0" /> '''Statistical Models for Marathi Corpora''' Attempts to evaluate statistical models for Marathi language Corpuses and text-collections have been carried out. For the Marathi corpus (Marathi WordCorp), the y-intercept of Zipf's law is reported as 12.49, and the coefficient is 0.89 and these numbers show that Zipf's law is applicable for Marathi language.<ref name=":0" /> The coefficients show that the number of words and texts used in the corpus metadata is enough. Heaps' law intercept for the Marathi word corpora is 2.48, and the coefficient is 0.73.<ref name=":0" /> The coefficient values show that there are more unique words in Marathi writings than would be expected. The higher number of unique words could be due to the number of alphabets (36 consonant letters and 16 initial-vowel letters, with each consonant taking 14 forms with vowel pairs), the orthographic features of the Devanagari script (for example, the same word can be written in different ways), the use of consonant clusters (jodakshar), the number of suffixes a word can have, etc.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Marathi language
(section)
Add topic