Editing Text corpus (section)

== Applications ==

Corpora are the main knowledge base in [[corpus linguistics]]. Other notable areas of application include:

* [[Language technology]], [[natural language processing]], [[computational linguistics]]
** The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create [[hidden Markov model]]s for part of speech tagging and other purposes. Corpora and [[frequency list]]s derived from them are useful for [[language teaching]]. Corpora can be considered as a type of [[foreign language writing aid]] as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.<ref name="Yoon">Yoon, H., & Hirvela, A. (2004). [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1073.2322&rep=rep1&type=pdf ESL Student Attitudes toward Corpus Use in L2 Writing]. ''Journal of Second Language Writing, 13''(4), 257–283. Retrieved 21 March 2012.</ref>

* [[Machine translation]]
** Multilingual corpora that have been specially formatted for side-by-side comparison are called ''aligned parallel corpora''. There are two main types of [[parallel corpora]] which contain texts in two languages. In a ''translation corpus'', the texts in one language are translations of texts in the other language. In a ''comparable corpus'', the texts are of the same kind and cover the same content, but they are not translations of each other.<ref>{{cite book | last1 = Wołk | first1 = K. | last2 = Marasek | first2 = K. | title = New Perspectives in Information Systems and Technologies, Volume 1 | chapter = Real-Time Statistical Speech Translation | series = Advances in Intelligent Systems and Computing | date = 7 April 2014 | publisher = Springer | volume = 275 | pages = 107–114 | doi = 10.1007/978-3-319-05951-8_11 | arxiv = 1509.09090 | issn = 2194-5357 | isbn = 978-3-319-05950-1| s2cid = 15361632}}</ref> To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. [[Machine translation]] algorithms for translating between two languages are often trained using parallel fragments comprising a first-language corpus and a second-language corpus, which is an element-for-element translation of the first-language corpus.<ref>{{cite conference |last1=Wolk |first1=Krzysztof |last2=Marasek |first2=Krzysztof |editor1-last=Král |editor1-first=Pavel |editor2-last=Matoušek |editor2-first=Václav |arxiv=1509.08639 |contribution=Tuned and GPU-accelerated parallel data mining from comparable corpora |doi=10.1007/978-3-319-24033-6_4 |pages=32–40 |publisher=Springer |series=Lecture Notes in Computer Science |title=Text, Speech, and Dialogue – 18th International Conference, TSD 2015, Plzeň, Czech Republic, September 14–17, 2015, Proceedings |volume=9302 |year=2015|isbn=978-3-319-24032-9 }}</ref>

* [[Philology|Philologies]]
** Text corpora are also used in the study of [[historical document]]s, for example in attempts to [[decipherment|decipher]] ancient scripts, or in [[Biblical scholarship]]. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time may be the 15–30 year [[Amarna letters]] texts ([[1350 BC]]). The ''corpus'' of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates.