Editing Marathi language (section)

==Natural language processing for Marathi==

More recent attention has focused on developing [[natural language processing]] tools for Marathi. Some studies proposed a couple of [[text corpora]] for Marathi. L3CubeMahaSent<ref>{{cite conference |first1=Atharva |last1=Kulkarni |first2=Meet|last2=Mandhane|first3=Manali|last3=Likhitkar|first4=Gayatri|last4=Kshirsagar|first5=Raviraj|last5=Joshi|title=L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset |conference=Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis |pages=213–220 |date=2021 |location=Online |url=https://aclanthology.org/2021.wassa-1.23.pdf}}</ref> is the first major publicly available Marathi dataset for [[sentiment analysis]]. It contains about 16,000 distinct tweets classified into three broad classes, such as positive, negative, and neutral. L3Cube-MahaNER
<ref>{{cite arXiv | last1       =Patil | first1      =Parth | last2       =Ranade | first2      =Aparna | last3       =Sabane | first3      =Maithili | last4       =Litake | first4      =Onkar| last5       =Joshi | first5      =Raviraj | eprint     =2204.06029 | title      =L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models | class      =cs.CL | date       =12 April 2022 }}</ref> is a dataset for [[named-entity recognition]] consisting of 25,000 manually tagged sentences categorised according to the eight entity classes. There are at least two public available datasets for [[hate speech]] detection in Marathi: L3Cube-MahaHate 
<ref>{{cite arXiv | last1       =Velankar | first1      =Abhishek | last2       =Patil | first2      =Hrushikes | last3       =Gore | first3      =Amol | last4       =Salunke | first4      =Shubham| last5       =Joshi | first5      =Raviraj | eprint     =2203.13778 | title      =L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT models | class      =cs.CL | date       =22 May 2022 }}</ref> and HASOC2021.<ref>{{cite conference |first1=Sandip |last1=Modha |first2=Thomas|last2=Mandl|first3=Gautam Kishore|last3=Shahi|first4=Hiren|last4=Madhu|first5=Shrey|last5=Satapara|first6=Tharindu |last6=Ranasinghe|first7=Marcos|last7=Zampieri|title=Overview of the HASOC subtrack at FIRE 2021: Hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech |conference=Forum for Information Retrieval Evaluation |pages=1–3 |date=2021 |location=Online |doi=10.1145/3503162.3503176 |url=https://dl.acm.org/doi/pdf/10.1145/3503162.3503176|hdl=2436/624705|hdl-access=free}}</ref>

The HASOC2021 dataset was proposed for conducting a [[machine learning]] competition on hate, offensive, and profane content identification in Marathi collocated with Forum for Information Retrieval Evaluation (FIRE 2021). The participants of the competition presented 25 solutions based on [[supervised learning]]. The winning teams<ref>{{cite conference |first1=Mayuresh |last1=Nene|first2=Kai|last2=North|first3=Tharindu|last3=Ranasinghe|first4=Marcos|last4=Zampieri| title=Transformer Models for Offensive Language Identification in Marathi |conference=Forum for Information Retrieval Evaluation (Working Notes) (FIRE) |pages=272–281 |date=2021 |location=Online}}</ref><ref>{{cite conference |first1=Anna |last1=Glazkova |first2=Michael|last2=Kadantsev|first3=Maksim|last3=Glazkov| title=Fine-tuning of Pre-trained Transformers for Hate, Offensive, and Profane Content Detection in English and Marathi |conference=Forum for Information Retrieval Evaluation (Working Notes) (FIRE) |pages=52–62 |date=2021 |location=Online |arxiv=2110.12687 }}</ref> used pre-trained [[language models]] (XLM-RoBERTa, Language Agnostic [[BERT (language model)|BERT]] Sentence Embeddings (LaBSE)) fine-tuned on the HASOC2021 dataset proposed by the organisers. The participants also experimented with the joint use of multilingual data for fine-tuning.