Jump to content

Machine translation: Difference between revisions

From Niidae Wiki
imported>Prototyperspective
m ce
imported>MrOllie
Reverted 1 edit by Icaspspeech (talk): Apparent COI / promo
 
Line 1: Line 1:
[[file:Qxz-ad104.gif|center]]
{{short description|Computerized translation between natural languages}}
{{historical}}
{{distinguish|Computer-assisted translation|Interactive machine translation|Translator (computing)}}
The purpose of the '''Wiki(pedia) Machine Translation Project''' is to develop ideas, methods and tools that can help translate Wikipedia articles (and [[Wikimedia]] pages) from one language to another, particularly out of English and into languages with small numbers of fluent speakers.
{{Use dmy dates|date=July 2014}}
[[File:WordLensDemo5Feb2012.jpg|thumb|upright=1.3|A mobile phone app translating Spanish text into English]]
{{Translation sidebar}}
'''Machine translation''' is use of computational techniques to [[translation|translate]] text or speech from one [[language]] to another, including the contextual, idiomatic and pragmatic nuances of both languages.


Remember to read the current [[talk:Wikipedia Machine Translation Project|talk page]] and particularly what is stated it the Wikipedia Translation page:
Early approaches were mostly [[Rule-based machine translation|rule-based]] or [[Statistical machine translation|statistical]]. These methods have since been superseded by [[neural machine translation]]<ref>{{Cite web |date=3 October 2016 |title=Google Translate Gets a Deep-Learning Upgrade |url=https://spectrum.ieee.org/google-translate-gets-a-deep-learning-upgrade |access-date=2024-07-07 |website=IEEE Spectrum |language=en}}</ref> and [[Large language model|large language models]].<ref>{{Cite web |date=2024-02-23 |title=Google Translate vs. ChatGPT: Which One Is the Best Language Translator? |url=https://uk.pcmag.com/ai/151950/google-translate-vs-chatgpt-which-one-is-the-best-language-translator |access-date=2024-07-07 |website=PCMag UK |language=en-gb}}</ref>


[[Wikipedia]] is a [[Wikipedia:Multilingual coordination|multilingual project]]. Articles on the same subject in different languages can be edited independently; they do not have to be translations of one another or correspond closely in form, style or content.  Still, translation is often useful to spread information between articles in different languages.
==History==
{{Main|History of machine translation}}


Translation takes work.  Machine translation, especially between unrelated languages (e.g. English and Japanese), as of 2013 usually produces very low quality results. Wikipedia consensus is that '''an unedited machine translation, left as a Wikipedia article, is worse than nothing.''' (see for example [//en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_44#Other_Wikis here]). The translation templates have links to machine translations built in automatically, so all readers should be able to access machine translations easily.
===Origins===
The origins of machine translation can be traced back to the work of [[Al-Kindi]], a ninth-century Arabic [[cryptographer]] who developed techniques for systemic language translation, including [[cryptanalysis]], [[frequency analysis]], and [[probability]] and [[statistics]], which are used in modern machine translation.<ref>{{Cite web |last=DuPont |first=Quinn |date=January 2018 |title=The Cryptological Origins of Machine Translation: From al-Kindi to Weaver |url=http://amodern.net/article/cryptological-origins-machine-translation/ |url-status=dead |archive-url=https://web.archive.org/web/20190814061915/http://amodern.net/article/cryptological-origins-machine-translation/ |archive-date=14 August 2019 |access-date=2 September 2019 |website=Amodern}}</ref> The idea of machine translation later appeared in the 17th century. In 1629, [[René Descartes]] proposed a universal language, with equivalent ideas in different tongues sharing one symbol.<ref>{{Cite book |last=Knowlson |first=James |title=Universal Language Schemes in England and France, 1600-1800 |date=1975 |publisher=University of Toronto Press |isbn=0-8020-5296-7 |location=Toronto}}</ref>


Remember that if the idea would be to simply run a Wikipedia article in a fully-automatic machine translation system (such as Google Translate), there would be no point in adding the results to the "foreign" Wikipedia: a user should just feed the system with the desired URL.
The idea of using digital computers for translation of natural languages was proposed as early as 1947 by England's [[Andrew Donald Booth|A. D. Booth]]<ref>{{Cite book |last=Booth |first=Andrew D. |url=https://archive.org/details/sim_computers-and-people_1953-05_2_4/page/n8/ |title=Computers and Automation 1953-05: Vol 2 Iss 4 |date=1953-05-01 |publisher=Berkeley Enterprises |pages=6 |language=en |chapter=MECHANICAL TRANSLATION}}</ref> and [[Warren Weaver]] at [[Rockefeller Foundation]] in the same year. "The memorandum written by [[Warren Weaver]] in 1949 is perhaps the single most influential publication in the earliest days of machine translation."<ref>{{cite book
  |url=https://pdfs.semanticscholar.org/eaa9/ccf94b4d129c26faf45a1353ffcbbe9d4fda.pdf
  |archive-url=https://web.archive.org/web/20200228015454/https://pdfs.semanticscholar.org/eaa9/ccf94b4d129c26faf45a1353ffcbbe9d4fda.pdf
  |url-status=dead
  |archive-date=2020-02-28
  |chapter=Warren Weaver and the launching of MT |via=[[Semantic Scholar]]  |author=J. Hutchins|title=Early Years in Machine Translation |series=Studies in the History of the Language Sciences |year=2000 |volume=97 |page=17 |doi=10.1075/sihols.97.05hut |isbn=978-90-272-4586-1 |s2cid=163460375 }}</ref><ref>{{cite web
|url=https://www.britannica.com/biography/Warren-Weaver
|title=Warren Weaver, American mathematician
|date=July 13, 2020
|access-date=7 August 2020
|archive-date=6 March 2021
|archive-url=https://web.archive.org/web/20210306061225/https://www.britannica.com/biography/Warren-Weaver
|url-status=live
}}</ref> Others followed. A demonstration was made in 1954 on the [[APEXC]] machine at [[Birkbeck, University of London|Birkbeck College]] ([[University of London]]) of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals (for example an article by Cleave and Zacharov in the September 1955 issue of ''[[Wireless World]]'').  A similar application, also pioneered at Birkbeck College at the time, was reading and composing [[Braille]] texts by computer.


== Motivation ==
===1950s===
The first researcher in the field, [[Yehoshua Bar-Hillel]], began his research at MIT (1951). A [[Georgetown University]] MT research team, led by Professor Michael Zarechnak, followed (1951) with a public demonstration of its [[Georgetown-IBM experiment]] system in 1954. MT research programs popped up in Japan<ref>{{cite book|last1=上野|first1=俊夫|title=パーソナルコンピュータによる機械翻訳プログラムの制作|date=1986-08-13|publisher=(株)ラッセル社|location=Tokyo|isbn=494762700X|page=16|language=ja|quote=わが国では1956年、当時の電気試験所が英和翻訳専用機「ヤマト」を実験している。この機械は1962年頃には中学1年の教科書で90点以上の能力に達したと報告されている。(translation (assisted by [[Google Translate]]): In 1959 Japan, the [[w:jp:電気試験所|National Institute of Advanced Industrial Science and Technology]](AIST) tested the proper English-Japanese translation machine ''Yamato'', which reported in 1964 as that reached the power level over the score of 90-point on the textbook of first grade of junior hi-school.)}}</ref><ref>{{Cite web | url=http://museum.ipsj.or.jp/computer/dawn/0027.html | title=機械翻訳専用機「やまと」-コンピュータ博物館 | access-date=4 April 2017 | archive-date=19 October 2016 | archive-url=https://web.archive.org/web/20161019171540/http://museum.ipsj.or.jp/computer/dawn/0027.html | url-status=live }}</ref> and Russia (1955), and the first MT conference was held in London (1956).<ref name="Nye">{{cite journal|last1=Nye|first1=Mary Jo|title=Speaking in Tongues: Science's centuries-long hunt for a common language|journal=Distillations|date=2016|volume=2|issue=1|pages=40–43|url=https://www.sciencehistory.org/distillations/magazine/speaking-in-tongues|access-date=20 March 2018|archive-date=3 August 2020|archive-url=https://web.archive.org/web/20200803130801/https://www.sciencehistory.org/distillations/magazine/speaking-in-tongues|url-status=live}}</ref><ref name="Babel">{{cite book|last1=Gordin|first1=Michael D.|title=Scientific Babel: How Science Was Done Before and After Global English|date=2015|publisher=University of Chicago Press|location=Chicago, Illinois|isbn=9780226000299}}</ref>


Small languages can't produce articles as fast as Wikimedia projects in languages such as English, Japanese, German or Spanish, because the number of wikipedians is too low and some prefer to contribute to bigger projects. One potential solution for this problem, in discussion since 2002, is the translation of Wikimedia projects. As some languages will not have enough translators, Machine Translation can improve the productivity of the community. This sort of automatic translation would be a first step for manual translations to be added and corrected later, while the local communities develop.
[[David G. Hays]] "wrote about computer-assisted language processing as early as 1957" and "was project leader on computational linguistics
at [[RAND Corporation|Rand]] from 1955 to 1968."<ref>{{cite news
|newspaper=[[The New York Times]]
|quote=wrote about computer-assisted language processing as early as 1957.. was project leader on computational linguistics at Rand from 1955 to 1968.
|url=https://www.nytimes.com/1995/07/28/obituaries/david-g-hays-66-a-developer-of-language-study-by-computer.html
|title=David G. Hays, 66, a Developer Of Language Study by Computer
|author=Wolfgang Saxon
|date=July 28, 1995
|access-date=7 August 2020
|archive-date=7 February 2020
|archive-url=https://web.archive.org/web/20200207035914/https://www.nytimes.com/1995/07/28/obituaries/david-g-hays-66-a-developer-of-language-study-by-computer.html
|url-status=live
}}</ref>


A second, but very important motivation, is the development of free tools for Computational Linguistics and Natural Language Processing. These fields are very important, but resources for small languages are usually inexistent, low-quality, expensive and/or restricted in their usage. Even for "big" languages such as English, free resources are still short. We could develop...
===1960–1975===
Researchers continued to join the field as the Association for Machine Translation and Computational Linguistics was formed in the U.S. (1962) and the National Academy of Sciences formed the Automatic Language Processing Advisory Committee (ALPAC) to study MT (1964). Real progress was much slower, however, and after the [[ALPAC|ALPAC report]] (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced.<ref name="ueno">{{cite book
|last1=上野 |first1=俊夫 |title=パーソナルコンピュータによる機械翻訳プログラムの制作 |date=1986-08-13 |publisher=(株)ラッセル社|isbn=494762700X|page=16|location=Tokyo|language=ja}}</ref> According to a 1972 report by the Director of Defense Research and Engineering (DDR&E), the feasibility of large-scale MT was reestablished by the success of the Logos MT system in translating military manuals into Vietnamese during that conflict.


== Approaches ==
The French Textile Institute also used MT to translate abstracts from and into French, English, German and Spanish (1970); Brigham Young University started a project to translate Mormon texts by automated translation (1971).


=== Interlingua approach ===
===1975 and beyond===
A different, but related approach would be translating articles into a machine translation interlingua like [[w:Universal Networking Language|UNL]], then writing software modules to translate automatically from that interlingua into each target language. The initial translation could be created fully by hand, or machine translated with humans verifying accuracy of the translation and choosing between multiple alternatives. This only saves work with respect to direct translation if there are several target languages whose modules are well enough developed, but those modules are much easier to write, and expectably more accurate, than full real-language to real-language automatic translation systems.
[[SYSTRAN]], which "pioneered the field under contracts from the U.S. government"<ref name="MT1998.EmptyAtlantic">{{Cite magazine |last=Budiansky |first=Stephen  |date=December 1998 |title=Lost in Translation |magazine=[[Atlantic Magazine]] |pages=81–84}}</ref> in the 1960s, was used by Xerox to translate technical manuals (1978). Beginning in the late 1980s, as [[computation]]al power increased and became less expensive, more interest was shown in [[statistical machine translation|statistical models for machine translation]]. MT became more popular after the advent of computers.<ref>{{Cite book|title=Conceptual Information Processing|last=Schank|first=Roger C.|date=2014|publisher=Elsevier|isbn=9781483258799|location=New York|pages=5}}</ref> SYSTRAN's first implementation system was implemented in 1988 by the online service of the [[La Poste (France)|French Postal Service]] called Minitel.<ref>{{Cite book|title=Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas, AMTA'98, Langhorne, PA, USA, October 28–31, 1998 Proceedings|last1=Farwell|first1=David|last2=Gerber|first2=Laurie|last3=Hovy|first3=Eduard|date=2003-06-29|publisher=Springer|isbn=3540652590|location=Berlin|pages=276}}</ref> Various computer based translation companies were also launched, including Trados (1984), which was the first to develop and market Translation Memory technology (1989), though this is not the same as MT. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University (1991).


=== Translating between closely related languages ===
By 1998, "for as little as $29.95" one could "buy a program for translating in one direction between English and a major European language of
your choice" to run on a PC.<ref name=MT1998.EmptyAtlantic/>


I would imagine it would be an easier task to translate between similar languages than non-similar ones. For example, we have Wikipedias in Catalan and Spanish and Macedonian and Bulgarian, perhaps even Dutch and Afrikaans (some more studies would have to be done to evaluate which would be most appropriate). There is some free software being produced in Spain called [[:en:Apertium]] that might be useful here.
MT on the web started with SYSTRAN offering free translation of small texts (1996) and then providing this via [[Babel Fish (website)|AltaVista Babelfish]],<ref name=MT1998.EmptyAtlantic/> which racked up 500,000 requests a day (1997).<ref>{{Cite web |url=https://digital.com/about/babel-fish/ |title=Babel Fish: What Happened To The Original Translation Application?: We Investigate |last1=Barron |first1=Brenda |date=November 18, 2019 |website=Digital.com |language=en-US |access-date=2019-11-22 |archive-date=20 November 2019 |archive-url=https://web.archive.org/web/20191120032732/https://digital.com/about/babel-fish/ |url-status=live }}</ref> The second free translation service on the web was [[Lernout & Hauspie]]'s GlobaLink.<ref name=MT1998.EmptyAtlantic/> ''Atlantic Magazine'' wrote in 1998 that "Systran's Babelfish and GlobaLink's Comprende" handled
"Don't bank on it" with a "competent performance."<ref>and gave other examples too</ref>


=== Suggested statistical approach ===
[[Franz Josef Och]] (the future head of Translation Development AT Google) won DARPA's speed MT competition (2003).<ref>{{Cite book
|title=Routledge Encyclopedia of Translation Technology |last=Chan |first=Sin-Wai
|date=2015 |publisher=Routledge |isbn=9780415524841 |location=Oxon |pages=385}}</ref> More innovations during this time included MOSES, the open-source statistical MT engine (2007), a text/SMS translation service for mobiles in Japan (2008), and a mobile phone with built-in speech-to-speech translation functionality for English, Japanese and Chinese (2009). In 2012, Google announced that [[Google Translate]] translates roughly enough text to fill 1 million books in one day.


* In a first step, a series of language corpora and statistical models are generated from the various Wikipedias. The results are particularly interesting because, besides the extraction of text needed for the project, they also allow us to make public under a permissive license a kind of data generally unavailable for smaller languages, or at most only available after expensive purchases. Many of these have already been released and are hosted at SourceForge:
==Approaches==
{{See also|Hybrid machine translation|Example-based machine translation|}}


{| class="wikitable"
Before the advent of [[deep learning]] methods, statistical methods required a lot of rules accompanied by [[morphology (linguistics)|morphological]], [[syntax|syntactic]], and [[semantics|semantic]] annotations.
|-
!  Language code
!  Dump date
!  Native name
!  English name
!  Raw corpus
!  Clean corpus
!  Reduced corpus
!  Language model
|-
|  [//af.wikipedia.org/ af]
|  2010-03-10
|  afrikaans
|  [[w:Afrikaans|Afrikaans]]
|  [https://sourceforge.net/projects/hermes/files/corpus/af.raw.gz/download link] (7.8 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/af.clean.gz/download link] (7.7 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/af.plm.gz/download link] (3.1 MB)
|-
|  [//ca.wikipedia.org/ ca]
|  2010-02-19
|  català
[[w:Catalan language|Catalan]]
[https://sourceforge.net/projects/hermes/files/corpus/ca.raw.gz/download link] (86.6 MB)
[https://sourceforge.net/projects/hermes/files/corpus/ca.clean.gz/download link] (85.0 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/ca.plm.gz/download link] (7.1 MB)
|-
|  [//en.wikipedia.org/ en]
|  2010-01-30
|  English
|  [[w:English language|English]]
[https://sourceforge.net/projects/hermes/files/corpus/en.raw.gz/download link] (1.6 GB)
[https://sourceforge.net/projects/hermes/files/corpus/en.clean.gz/download link] (1.5 GB)
| [https://sourceforge.net/projects/hermes/files/corpus/en.rclean.gz/download link] (266.9 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/en.plm.gz/download link] (18.4 MB)
|-
|  [//eo.wikipedia.org/ eo]
|  2010-03-11
|  esperanto
[[w:Esperanto|Esperanto]]
|  [https://sourceforge.net/projects/hermes/files/corpus/eo.raw.gz/download link] (33.0 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/eo.clean.gz/download link] (32.6 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/eo.plm.gz/download link] (6.2 MB)
|-
|  [//eu.wikipedia.org/ eu]
|  2010-02-22
|  euskara
|  [[w:Basque language|Basque]]
|  [https://sourceforge.net/projects/hermes/files/corpus/eu.raw.gz/download link] (12.9 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/eu.clean.gz/download link] (12.8 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/eu.plm.gz/download link] (4.0 MB)
|-
|  [//gl.wikipedia.org/ gl]
|  2010-02-20
|  galego
|  [[w:Galician language|Galician]]
|  [https://sourceforge.net/projects/hermes/files/corpus/gl.raw.gz/download link] (25.2 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/gl.clean.gz/download link] (24.7 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/gl.plm.gz/download link] (3.8 MB)
|-
|  [//is.wikipedia.org/ is]
|  2010-03-11
|  íslenska
|  [[w:Icelandic language|Icelandic]]
|  [https://sourceforge.net/projects/hermes/files/corpus/is.raw.gz/download link] (6.8 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/is.clean.gz/download link] (6.7 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/is.plm.gz/download link] (3.8 MB)
|-
|  [//it.wikipedia.org/ it]
|  2010-02-18
|  italiano
|  [[w:Italian language|Italian]]
|  [https://sourceforge.net/projects/hermes/files/corpus/it.raw.gz/download link] (329.9 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/it.clean.gz/download link] (323.3 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/it.rclean.gz/download link] (305.7 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/it.plm.gz/download link] (15.7 MB)
|-
|  [//nap.wikipedia.org/ nap]
|  2010-02-21
|  napulitano
|  [[w:Neapolitan language|Neapolitan]]
|  [https://sourceforge.net/projects/hermes/files/corpus/nap.raw.gz/download link] (614.3 KB)
|  [https://sourceforge.net/projects/hermes/files/corpus/nap.clean.gz/download link] (580.2 KB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/nap.plm.gz/download link] (1.4 MB)
|-
|  [//pms.wikipedia.org/ pms]
|  2010-03-07
|  piemontèis
|  [[w:Piedmontese language|Piedmontese]]
|  [https://sourceforge.net/projects/hermes/files/corpus/pms.raw.gz/download link] (1.7 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/pms.clean.gz/download link] (1.6 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/pms.plm.gz/download link] (2.4 MB)
|-
|  [//pt.wikipedia.org/ pt]
|  2010-03-08
|  português
|  [[w:Portuguese language|Portuguese]]
|  [https://sourceforge.net/projects/hermes/files/corpus/pt.raw.gz/download link] (185.0 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/pt.clean.gz/download link] (180.9 MB)
|
|  (not yet)
|-
|  [//qu.wikipedia.org/ qu]
|  2010-02-25
|  runa simi
|  [[w:Quechua language|Quechua]]
|  [https://sourceforge.net/projects/hermes/files/corpus/qu.raw.gz/download link] (685.5 KB)
|  [https://sourceforge.net/projects/hermes/files/corpus/qu.clean.gz/download link] (639.6 KB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/qu.plm.gz/download link] (1.6 MB)
|-
|  [//sl.wikipedia.org/ sl]
|  2010-02-25
|  slovenščina
|  [[w:Slovenian language|Slovenian]]
|  [https://sourceforge.net/projects/hermes/files/corpus/sl.raw.gz/download link] (22.1 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/sl.clean.gz/download link] (21.8 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/sl.plm.gz/download link] (4.6 MB)
|-
|  [//sw.wikipedia.org/ sw]
|  2010-02-24
|  kiswahili
|  [[w:Swahili language|Swahili]]
|  [https://sourceforge.net/projects/hermes/files/corpus/sw.raw.gz/download link] (2.9 MB)
|  [https://sourceforge.net/projects/hermes/files/corpus/sw.clean.gz/download link] (2.8 MB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/sw.plm.gz/download link] (3.5 MB)
|-
|  [//yo.wikipedia.org/ yo]
|  2010-02-25
|  yorùbá
|  [[w:Yoruba language|Yoruba]]
|  [https://sourceforge.net/projects/hermes/files/corpus/yo.raw.gz/download link] (433.6 KB)
|  [https://sourceforge.net/projects/hermes/files/corpus/yo.clean.gz/download link] (375.5 KB)
|
|  [https://sourceforge.net/projects/hermes/files/corpus/yo.plm.gz/download link] (1.1 MB)
|}


(for more information on these data, see my talk page at the English Wikipedia [[User:Tresoldi|Tresoldi]] 16:22, 13 March 2010 (UTC))
===Rule-based===
{{Main|Rule-based machine translation}}


* As most of the minor Wikipedias would likely be populated, at least at first, by articles in the English one, the first pairs to be developed would be the English/Foreign language ones. Thus, for each language an initial list of random sentences is drawn from the English corpus (as above) and, if such a system is available, translated with some existing machine translation software (such as Google Translate or Apertium). Hopefully, wikipedians will start revising the translations collaboratively, just like the normal Wikipedia articles.
The rule-based machine translation approach was used mostly in the creation of [[dictionaries]] and grammar programs. Its biggest downfall was that everything had to be made explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with it, and lexical selection rules must be written for all instances of ambiguity.


* After the small list of random sentences is adequately translated/revised, a number of other sentences will be selected to be gradually added to the developing parallel corpus. By using statistics collected with the language models built above, particularly with the English one, two different approaches will be followed:
====Transfer-based machine translation====
** The first is to gradually cover all common n-grams of all orders (descending from 5-grams), so that the most common structures in the English wikipedia will be covered by the corpus (in other words, more pages should be translated with fewer problems -- think about very similar pages such as the ones about towns and cities, short biographies, etc.).
{{Main|Transfer-based machine translation}}
** The second will be to gradually cover the n-grams found in the parallel corpus, in order to cover the different contexts of both those n-grams and particularly of the contexts of them in the already included sentences.


* After the translation of about 1,000 sentences, an actual system will start to be build weekly. Everyone with some experience in statistical machine translation will agree that 1,000 sentences is a ridiculously low number for statistical translation, but the idea is to have a baseline set up and gradually increment it. Besides that, while the adding of sentences as described above would go on, from time to time the system would be used to translate some of top articles of the English Wikipedia not covered by the foreign one, correct it (with a lot of pain in the first times) and add it back to the corpus. After some time, we should finally be able to [[w:Eat your own dog food|eat our own dog food]], i.e., to do the first raw translation with our statistical systems, not relying anymore in non free systems (however, the usage of free software like Apertium for related languages is likely to still be a better alternative in the foreseeable future).
Transfer-based machine translation was similar to [[interlingual machine translation]] in that it created a translation from an intermediate representation that simulated the meaning of the original sentence. Unlike interlingual MT, it depended partially on the language pair involved in the translation.


* The corpora will be gradually tagged with part-of-speech information, lemmas and eventually syntactic information.
====Interlingual====
{{Main|Interlingual machine translation}}


* There will be a gradual integration with the Wiktionary and Wikipedia's intralinguistic links, covering not only the the basic lemmas but, hopefully, the most common inflected forms for each language -- the results could be then retributed to Wiktionary with carefully set bots.
Interlingual machine translation was one instance of rule-based machine-translation approaches.  In this approach, the source language, i.e. the text to be translated, was transformed into an interlingual language, i.e. a "language neutral" representation that is independent of any language. The target language was then generated out of the [[interlinguistics|interlingua]]. The only interlingual machine translation system that was made operational at the commercial level was the KANT system (Nyberg and Mitamura, 1992), which was designed to translate Caterpillar Technical English (CTE) into other languages.


== Evaluating with Wikipedias ==
====Dictionary-based====
{{Main|Dictionary-based machine translation}}


''Originally found in the [http://wiki.apertium.org/wiki/Evaluating_with_Wikipedia Apertium Wiki].''
Machine translation used a method based on [[dictionary]] entries, which means that the words were translated as they are by a dictionary.


One of the ways of improving an MT system, and at the same time improve and add content in Wikipedias, is to use Wikipedias as a test bed. You can translate text from one Wikipedia to another, then either [[w:en:Postediting#Postediting_and_machine_translation|post-edit]] yourself, or wait for, or ask other people to post-edit the text. One of the nice things is that MediaWiki (the software Wikipedia is based on) allows you to view diffs between the versions (see the 'history' tab).
===Statistical===
{{main|Statistical machine translation}}
Statistical machine translation tried to generate translations using [[statistical methods]] based on bilingual text corpora, such as the [[Hansard#Translation|Canadian Hansard]] corpus, the English-French record of the Canadian parliament and [[Europarl corpus|EUROPARL]], the record of the [[European Parliament]]. Where such corpora were available, good results were achieved translating similar texts, but such corpora were rare for many language pairs. The first statistical machine translation software was [[CANDIDE]] from [[IBM]]. In 2005, Google improved its internal translation capabilities by using approximately 200 billion words from United Nations materials to train their system; translation accuracy improved.<ref>{{cite web |url=http://blog.outer-court.com/archive/2005-05-22-n83.html |title=Google Translator: The Universal Language |publisher=Blog.outer-court.com |date=25 January 2007 |access-date=2012-06-12 |archive-date=20 November 2008 |archive-url=https://web.archive.org/web/20081120030225/http://blog.outer-court.com/archive/2005-05-22-n83.html |url-status=live }}</ref>


This strategy is beneficial both to Wikipedia and to any machine translation system, such as Apertium or a statistical one based in Moses. Wikipedia gets new articles in languages which might not otherwise have them, and the machine translation system gets information on how we can improve the software. It is important to note that Wikipedia is a community effort, and that rightly people can be concerned about machine translation. To get an idea of this, put yourself in the place of people having to fix a lot of "hit and run" SYSTRAN (a.k.a. BabelFish) or Google Translate translations, with little time and not much patience.
SMT's biggest downfall included it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating ''into'' such languages), and its inability to correct singleton errors.


===Guidelines===
Some work has been done in the utilization of multiparallel [[text corpus|corpora]], that is a body of text that has been translated into 3 or more languages. Using these methods, a text that has been translated into 2 or more languages may be utilized in combination to provide a more accurate translation into a third language compared with if just one of those source languages were used alone.<ref>{{Cite conference |last=Schwartz |first=Lane |date=2008 |title=Multi-Source Translation Methods |url=https://dowobeha.github.io/papers/amta08.pdf |conference=Paper presented at the 8th Biennial Conference of the Association for Machine Translation in the Americas |archive-url=https://web.archive.org/web/20160629171944/http://dowobeha.github.io/papers/amta08.pdf |archive-date=29 June 2016 |access-date=3 November 2017 |url-status=live}}</ref><ref>{{Cite conference |last1=Cohn |first1=Trevor |last2=Lapata |first2=Mirella |date=2007 |title=Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora |url=http://homepages.inf.ed.ac.uk/mlap/Papers/acl07.pdf |conference=Paper presented at the 45th Annual Meeting of the Association for Computational Linguistics, June 23–30, 2007, Prague, Czech Republic |archive-url=https://web.archive.org/web/20151010171334/http://homepages.inf.ed.ac.uk/mlap/Papers/acl07.pdf |archive-date=10 October 2015 |access-date=3 February 2015 |url-status=live}}</ref><ref>{{Cite journal |last1=Nakov |first1=Preslav |last2=Ng |first2=Hwee Tou |date=2012 |title=Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages |url=https://jair.org/index.php/jair/article/view/10764 |journal=Journal of Artificial Intelligence Research |volume=44 |pages=179–222 |arxiv=1401.6876 |doi=10.1613/jair.3540 |doi-access=free}}</ref>


*Don't just start translating texts and waiting for people to fix them. The first thing you should do, is create an account on the Wikipedia, and then find the "Community notice board". Ask there how regular contributors would feel about you using the Wikipedia for tests. The community notice board should be linked from the front page. It might be called something like "La tavèrna" in Occitan, or "Geselshoekie" in Afrikaans. When you are asking them,  make the following clear:
=== Neural MT ===
{{Main|Neural machine translation}}


:* This is free software / open source machine translation.
A [[deep learning]]-based approach to MT, [[neural machine translation]] has made rapid progress in recent years. However, the current consensus is that the so-called human parity achieved is not real, being based wholly on limited domains, language pairs, and certain test benchmarks<ref>Antonio Toral, Sheila Castilho, Ke Hu, and Andy
:* You would like to help the community and are doing these translations both to help their Wikipedia expand the range of articles, and to improve the translation software.
Way. 2018. Attaining the unattainable? reassessing claims of human parity in neural machine translation. CoRR, abs/1808.10432.</ref> i.e., it lacks statistical significance power.<ref>{{Cite arXiv |eprint=1906.09833 |first1=Graham |last1=Yvette |first2=Haddow |last2=Barry |title=Translationese in Machine Translation Evaluation |date=2019 |last3=Koehn |first3=Philipp|class=cs.CL }}</ref>
:* The translations will be added only with the consent of the community, you do not intend to flood them with poorly translated articles.
:* The translations will be added by a '''human''' not by a bot.
:* Ask them if there are any subjects that they prefer you would cover, perhaps they have a page of "requested translations".
:* One way of looking at it might be as a non-native speaker of the language trying to learn the language. Point out that the initial translation will be done by machine, then you will try and fix the translation, but anything that you don't fix you would be grateful for other people to fix.


An example of the kind of conversation you might have is found [//af.wikipedia.org/wiki/Wikipedia:Geselshoekie/MT here].
Translations by neural MT tools like [[DeepL Translator]], which is thought to usually deliver the best machine translation results as of 2022, typically still need post-editing by a human.<ref>{{cite journal |last1=Katsnelson |first1=Alla |title=Poor English skills? New AIs help researchers to write better |journal=Nature |pages=208–209 |language=en |doi=10.1038/d41586-022-02767-9 |date=29 August 2022|volume=609 |issue=7925 |pmid=36038730 |bibcode=2022Natur.609..208K |s2cid=251931306 |doi-access=free }}</ref><ref>{{cite web |last1=Korab |first1=Petr |title=DeepL: An Exceptionally Magnificent Language Translator |url=https://towardsdatascience.com/deepl-an-exceptionally-magnificent-language-translator-78e86d8062d3 |website=Medium |access-date=9 January 2023 |language=en |date=18 February 2022}}</ref><ref>{{cite news |title=DeepL outperforms Google Translate – DW – 12/05/2018 |url=https://www.dw.com/en/deepl-cologne-based-startup-outperforms-google-translate/a-46581948 |access-date=9 January 2023 |work=Deutsche Welle |language=en}}</ref>


===How to translate===
Instead of training specialized translation models on parallel datasets, one can also [[Prompt engineering|directly prompt]] generative [[large language model]]s like [[Generative pre-trained transformer|GPT]] to translate a text.<ref name="Hendy2023">{{cite arXiv |last1=Hendy |first1=Amr |last2=Abdelrehim |first2=Mohamed |last3=Sharaf |first3=Amr |last4=Raunak |first4=Vikas |last5=Gabr |first5=Mohamed |last6=Matsushita |first6=Hitokazu |last7=Kim |first7=Young Jin |last8=Afify |first8=Mohamed |last9=Awadalla |first9=Hany |title=How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation |date=2023-02-18 |eprint=2302.09210 |class=cs.CL}}</ref><ref>{{cite news |last1=Fadelli |first1=Ingrid |title=Study assesses the quality of AI literary translations by comparing them with human translations |url=https://techxplore.com/news/2022-11-quality-ai-literary-human.html |access-date=18 December 2022 |work=techxplore.com |language=en}}</ref><ref name="arxiv221014250">{{Cite arXiv|last1=Thai |first1=Katherine |last2=Karpinska |first2=Marzena |last3=Krishna |first3=Kalpesh |last4=Ray |first4=Bill |last5=Inghilleri |first5=Moira |last6=Wieting |first6=John |last7=Iyyer |first7=Mohit |title=Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature |date=25 October 2022|class=cs.CL |eprint=2210.14250 }}</ref> This approach is considered promising,<ref name="WMT2023">{{cite conference |last1=Kocmi |first1=Tom |last2=Avramidis |first2=Eleftherios |last3=Bawden |first3=Rachel |last4=Bojar |first4=Ondřej |last5=Dvorkovich |first5=Anton |last6=Federmann |first6=Christian |last7=Fishel |first7=Mark |last8=Freitag |first8=Markus |last9=Gowda |first9=Thamme |last10=Grundkiewicz |first10=Roman |last11=Haddow |first11=Barry |last12=Koehn |first12=Philipp |last13=Marie |first13=Benjamin |last14=Monz |first14=Christof |last15=Morishita |first15=Makoto |date=2023 |editor-last=Koehn |editor-first=Philipp |editor2-last=Haddow |editor2-first=Barry |editor3-last=Kocmi |editor3-first=Tom |editor4-last=Monz |editor4-first=Christof |title=Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet |url=https://aclanthology.org/2023.wmt-1.1 |journal=Proceedings of the Eighth Conference on Machine Translation |location=Singapore |publisher=Association for Computational Linguistics |pages=1–42 |doi=10.18653/v1/2023.wmt-1.1|doi-access=free }}</ref> but is still more resource-intensive than specialized translation models.


In order to be more useful, when you create the page, first paste in the unedited machine translation output. Save the page with an ''edit summary'' saying that you're still working on it. Then proceed to post-edit the output. After you've finished, save the page again. If you go to the history tab at the top of the page and do "Compare selected versions" you will see the differences (diff) between the machine translation and the post-edited output. This gives a good indication of how good the original Apertium output was.
==Issues==
: It's also helpful if you first paste the ''input''. Then you can compare 1. input, 2. MT output, 3. post-edit (keeping the input text in the article history might be useful if you want to compare old MT-output with a newer version of the machine translator)
[[File:Stir Fried Wikipedia.jpg|thumb|right|250px|Machine translation could produce some non-understandable phrases, such as "{{lang|zh|鸡枞}}" (''[[Macrolepiota albuminosa]]'') being rendered as "wikipedia".]]
[[File:Machine translation in Bali.jpg|thumb|right|250px|Broken Chinese "{{lang|zh|沒有進入}}" from machine translation in [[Bali, Indonesia]]. The broken Chinese sentence sounds like "there does not exist an entry" or "have not entered yet".]]
Studies using human evaluation (e.g. by professional literary translators or human readers) have [[problem-solving|systematically identified various issues]] with the latest advanced MT outputs.<ref name="arxiv221014250"/> Common issues include the translation of ambiguous parts whose correct translation requires common sense-like semantic language processing or context.<ref name="arxiv221014250"/> There can also be errors in the source texts, missing high-quality training data and the severity of frequency of several types of problems may not get reduced with techniques used to date, requiring some level of human active participation.


== Existing free software ==
===Disambiguation===
* [[w:Apertium|Apertium]]
{{Main|Word-sense disambiguation|Syntactic disambiguation}}
** Apertium is an open-source platform (engine, tools) to build machine translation systems, mainly between related languages. Code, documentation, language pairs available from the [http://apertium.org Apertium website].
Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by [[Yehoshua Bar-Hillel]].<ref>[http://ourworld.compuserve.com/homepages/WJHutchins/Miles-6.htm Milestones in machine translation – No.6: Bar-Hillel and the nonfeasibility of FAHQT] {{webarchive|url=https://web.archive.org/web/20070312062051/http://ourworld.compuserve.com/homepages/WJHutchins/Miles-6.htm |date=12 March 2007 }} by John Hutchins</ref> He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word.<ref>Bar-Hillel (1960), "Automatic Translation of Languages". Available online at http://www.mt-archive.info/Bar-Hillel-1960.pdf {{Webarchive|url=https://web.archive.org/web/20110928112348/http://www.mt-archive.info/Bar-Hillel-1960.pdf |date=28 September 2011 }}</ref> Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches.
** See Niklas Laxström, ''[http://laxstrom.name/blag/2013/05/22/on-course-to-machine-translation/ On course to machine translation]'', May 2013.
* [http://lingwarium.org/ Ariane]
** Ariane (Ariane-H / Heloise version) is an online environment for developing machine translation systems. It is fully compatible with the original Ariane-G5 from the [https://fr.wikipedia.org/wiki/Groupe_d'%C3%A9tude_pour_la_traduction_automatique_et_le_traitement_automatis%C3%A9_des_langues_et_de_la_parole GETA] research group of Grenoble University (France).
* Moses (Moses is licensed under the LGPL.)
**" Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus)." (from their website [http://www.statmt.org/moses/ Moses website].
* <s>[http://greenman.co.za/translate Wikipedia translation]</s> -- defunct
** Tool designed to help populate smaller Wikipedias, for example translating country templates quickly


== Attempts ==
Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.<ref>{{Cite book|title=Hybrid approaches to machine translation|others=Costa-jussà, Marta R., Rapp, Reinhard, Lambert, Patrik, Eberle, Kurt, Banchs, Rafael E., Babych, Bogdan|date=21 July 2016|isbn=9783319213101|location=Switzerland|oclc=953581497}}</ref>


Several projects were or have been started to use computer assisted translation on Wikimedia projects. An incomplete list follows, for projects conducted ''on'' the Wikimedia projects themselves.
[[Claude Piron]], a long-time translator for the United Nations and the [[World Health Organization]], wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve [[ambiguity|ambiguities]] in the [[source text]], which the [[grammatical]] and [[Lexical (semiotics)|lexical]] exigencies of the [[Translation|target language]] require to be resolved:


*[[w:Google Translator Toolkit|Google Translator Toolkit]]
{{Blockquote|Why does a translator need a whole workday to translate five pages, and not an hour or two? ..... About 90% of an average text corresponds to these simple conditions.  But unfortunately, there's the other 10%. It's that part that requires six [more] hours of work. There are ambiguities one has to resolve. For instance, the author of the source text, an Australian physician, cited the example of an epidemic which was declared during World War II in a "Japanese prisoners of war camp". Was he talking about an American camp with Japanese prisoners or a Japanese camp with American prisoners?  The English has two senses. It's necessary therefore to do research, maybe to the extent of a phone call to Australia.<ref name="piron">[[Claude Piron]], ''Le défi des langues'' (The Language Challenge), Paris, L'Harmattan, 1994. <!-- GFDL translation by Jim Henry --></ref>
** [foundation-l] [http://thread.gmane.org/gmane.org.wikimedia.foundation/39477 Google Translate now assists with human translations of Wikipedia articles] (June 2009)
}}
** [[w:en:Wikipedia:Wikipedia Signpost/2015-06-24/Special report|Some numbers on the Telugu Wikipedia edition]] (2009–2011)
** [https://www.blog.google/products/search/expanding-knowledge-access-wikimedia-foundation/ Google's 2019 announcement] on [[Supporting Indian Language Wikipedias Program|Project Tiger / Growing Local Language Content on Wikipedia]]
* [[w:en:User:Endo999/GoogleTrans#Integration_With_Wikipedia_Beta_Translation_System:_Now_In_Production_Version_Of_Gadget | GoogleTrans Gadget]] (integrates with Content Management System as of October 2015)
*Microsoft's [[mw:Extension:WikiBhasha|WikiBhasha]]
**[[wmfblog:2010/10/18/wikibhasha/|WikiBhasha on Wikimedia blog]] (2010)
**Some discussions in Arab and Indian lists in winter 2010-2011 [http://thread.gmane.org/gmane.science.linguistics.wikipedia.arabic/816] [http://thread.gmane.org/gmane.org.wikimedia.india/1944] [http://thread.gmane.org/gmane.org.wikimedia.india/2689/] [http://thread.gmane.org/gmane.org.wikimedia.india/2966]
**Preceded by [http://amta2010.amtaweb.org/AMTA/papers/7-01-04-KumaranEtal.pdf WikiBABEL: A Wiki - style Platform for Creation of Parallel Data] (2009)
* [[w:en:User:Ebraminio/ArticleTranslator.js]] (2011)
* [[w:eu:Wikiproiektu:OpenMT2 eta Euskal Wikipedia]] (2011–2012) (only 100 articles; [[doi:10.1007/978-3-642-35085-6_4]])
* [http://www.casmacat.eu Casmacat] (Cognitive Analysis and Statistical Methods for Advanced Computer Aided Translation) (2012–2013) cf. [http://ufal.mff.cuni.cz/pbml/100/art-alabau-et-al.pdf one paper], [http://gv.casmacat.eu:9360/translate/wikipedia/en-el/10-ftwu7vcc#197 an interface example]
* [[:w:eo:Projekto:WikiTrans|WikiTrans]] (2013)
* [[w:cs:Wikipedie:WikiProjekt Česko-slovenská Wikipedie/Nástroje|Česko-slovenská Wikipedie]] (2013)
* [[Grants:PEG/Kruusamägi/Minority Translate|Minority Translate]] (2014)
* [[Grants:IdeaLab/External Translate|External Translate]] (2014)
* [[w:ca:Usuari:Leptictidium/Sistema de traducció]] (2014?)
* [[mw:Content translation]] (2014; using Apertium as of July 2014)
** [[mailarchive:wikimedia-l/2017-May/087430.html|[Wikimedia-l] machine translation]] (2017), discussion on Yandex MT
In other cases they were used outside, or not programmatically:
* (maybe) [[Tell us about Vietnamese Wikipedia#Bot creation of articles|Vietnamese Wikipedians editors]]
* Some non-Wikimedia wikis translating Wikimedia projects' content


== Resources ==
The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own; but this would require a higher degree of [[AI]] than has yet been attained.  A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions (based, perhaps, on which kind of prisoner-of-war camp is more often mentioned in a given corpus) would have a reasonable chance of guessing wrong fairly often.  A shallow approach that involves "ask the user about each ambiguity" would, by Piron's estimate, only automate about 25% of a professional translator's job, leaving the harder 75% still to be done by a human.


=== General ===
===Non-standard speech===
One of the major pitfalls of MT is its inability to translate non-standard language with the same accuracy as standard language. Heuristic or statistical based MT takes input from various sources in standard form of a language. Rule-based translation, by nature, does not include common non-standard usages. This causes errors in translation from a vernacular source or into colloquial language. Limitations on translation from casual speech present issues in the use of machine translation in mobile devices.


* [http://translate.google.com Google Translate] - Online gratis (statistical) machine translator.
===Named entities===
* [http://www.microsofttranslator.com/ Bing Translator] - Online gratis (statistical) machine translator.
{{main|Named entity}}
* [http://gramtrans.com/ GramTrans] - Online gratis (rule-based) machine translator, mostly covering Scandinavian languages.
In [[information extraction]], named entities, in a narrow sense, refer to concrete or abstract entities in the real world such as people, organizations, companies, and places that have a proper name: George Washington, Chicago, Microsoft. It also refers to expressions of time, space and quantity such as 1 July 2011, $500.
* [http://www.vertaalmachine.biz VertaalMachine] - Online translator that covers over 80 languages.
* [http://www.online-translator.com/ Promt] - Online gratis (rule-based) machine translator.
* [http://www.worldlingo.com WordLingo] - Online translator, gratis for up to 500 words.
* [http://www.apertium.org/ Apertium.org] – Online free & open source (rule-based) machine translator.
* [http://trans121.com/ Translate and Back] - Online gratis Google based translation, which enables checking correctness by back translation.
* [http://www.okchakko.com Okchakko] - Online gratis (rule-based) translator : French/Italian to Corsican


=== Dictionaries ===
In the sentence "Smith is the president of Fabrionix" both ''Smith'' and ''Fabrionix'' are named entities, and can be further qualified via first name or other information; "president" is not, since Smith could have earlier held another position at Fabrionix, e.g. Vice President.
The term [[rigid designator]] is what defines these usages for analysis in statistical machine translation.


* [[w:Wiktionary|Wiktionary]] (and [[OmegaWiki]]?)
Named entities must first be identified in the text; if not, they may be erroneously translated as common nouns, which would most likely not affect the [[Bilingual evaluation understudy|BLEU]] rating of the translation but would change the text's human readability.<ref>{{Cite conference |last1=Babych |first1=Bogdan |last2=Hartley |first2=Anthony |date=2003 |title=Improving Machine Translation Quality with Automatic Named Entity Recognition |url=http://www.cl.cam.ac.uk/~ar283/eacl03/workshops03/W03-w1_eacl03babych.local.pdf |conference=Paper presented at the 7th International EAMT Workshop on MT and Other Language Technology Tools... |archive-url=https://web.archive.org/web/20060514031411/http://www.cl.cam.ac.uk/~ar283/eacl03/workshops03/W03-w1_eacl03babych.local.pdf |archive-date=14 May 2006 |access-date=4 November 2013 |url-status=dead}}</ref> They may be omitted from the output translation, which would also have implications for the text's readability and message.
** See also M Matuschek, CM Meyer, I Gurevych: "Multilingual Knowledge in Aligned Wiktionary and OmegaWiki for Translation Applications" [http://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2013/tc3-article-MiM-ChM-final.pdf], summarised in [[Research:Newsletter/2013/July]].


=== Corpora ===
[[Transliteration]] includes finding the letters in the target language that most closely correspond to the name in the source language.  This, however, has been cited as sometimes worsening the quality of translation.<ref>Hermajakob, U., Knight, K., & Hal, D. (2008). [http://www.aclweb.org/old_anthology/P/P08/P08-1.pdf#page=433 Name Translation in Statistical Machine Translation Learning When to Transliterate] {{Webarchive|url=https://web.archive.org/web/20180104073326/http://www.aclweb.org/old_anthology/P/P08/P08-1.pdf#page=433 |date=4 January 2018 }}. Association for Computational Linguistics. 389–397.</ref> For "Southern California" the first word should be translated directly, while the second word should be transliterated.  Machines often transliterate both because they treated them as one entityWords like these are hard for machine translators, even those with a transliteration component, to process.
* [http://www.statmt.org/europarl/ Europarl] - EU12 languages up to 44 million words per language (to be used only with English as source or target language, as many of the non-English sentences are translations of translations).
* [http://langtech.jrc.it/JRC-Acquis.html JRC-Acquis] - EU22 languages.
* [http://www.statmt.org Southest European Times] - English, Turkish, Bulgarian, Macedonian, Serbo-Croatian, Albanian, Greek, Romanian (approx. 200,000 aligned sentences, 4--5 million words).
* [http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt South African Government Services] - English and Afrikaans (approx. 2,500 aligned sentences, 49,375 words).
* [http://nl.ijs.si/elan/ IJS-ELAN] - English-Slovenian.
* [http://urd.let.rug.nl/tiedeman/OPUS/index.php Open Source multilingual corpora] - Despite the name, some resources might not be eligible for Wikipedia given their license.
* [http://www.open-tran.eu OpenTran] - single point of access to translations of open-source software in many languages (downloadable as SQLite databases).
* [http://tatoeba.org/ Tatoeba Project] - Database of example sentences translated into several languages.
* [http://aclweb.org/aclwiki/index.php?title=List_of_resources_by_language ACL Wiki - List of resources by language] (many corpus links etc. here)
* [http://fraze.it FrazeIt] - A search engine for sentences and phrases. Supports six languages, filtered by form, zone, context, and more.


== Bibliography ==
Use of a "do-not-translate" list, which has the same end goal – transliteration as opposed to translation.<ref name="singla">{{Citation |last1=Neeraj Agrawal |title=Using Named Entity Recognition to improve Machine Translation |url=http://nlp.stanford.edu/courses/cs224n/2010/reports/singla-nirajuec.pdf |archive-url=https://web.archive.org/web/20130521075940/http://nlp.stanford.edu/courses/cs224n/2010/reports/singla-nirajuec.pdf |access-date=4 November 2013 |archive-date=21 May 2013 |last2=Ankush Singla |mode=cs1 |url-status=live}}</ref>  still relies on correct identification of named entities.


* [[doi:10.1007/978-3-642-35085-6_3|Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach]] (also mentions some of the attemps above)
A third approach is a class-based model. Named entities are replaced with a token to represent their "class"; "Ted"  and "Erica" would both be replaced with "person" class token. Then the statistical distribution and use of person names, in general, can be analyzed instead of looking at the distributions of "Ted" and "Erica" individually, so that the probability of a given name in a specific language will not affect the assigned probability of a translation. A study by Stanford on improving this area of translation gives the examples that different probabilities will be assigned to "David is going for a walk" and "Ankit is going for a walk" for English as a target language due to the different number of occurrences for each name in the training data. A frustrating outcome of the same study by Stanford (and other attempts to improve named recognition translation) is that many times, a decrease in the [[Bilingual evaluation understudy|BLEU]] scores for translation will result from the inclusion of methods for named entity translation.<ref name="singla" />


== See also ==
==Applications==
While no system provides the ideal of fully automatic high-quality machine translation of unrestricted text, many fully automated systems produce reasonable output.<ref>{{Cite book |url=http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=BTL%2014 |title=Melby, Alan. The Possibility of Language (Amsterdam:Benjamins, 1995, 27–41) |publisher=Benjamins.com |year=1995 |isbn=9789027216144 |access-date=2012-06-12 |archive-url=https://web.archive.org/web/20110525234319/http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=BTL%2014 |archive-date=25 May 2011 |url-status=live}}</ref><ref>{{Cite web |last=Wooten |first=Adam |date=14 February 2006 |title=A Simple Model Outlining Translation Technology |url=http://tandibusiness.blogspot.com/2006/02/simple-model-outlining-translation.html |archive-url=https://archive.today/20120716095630/http://tandibusiness.blogspot.de/2006/02/simple-model-outlining-translation.html |archive-date=16 July 2012 |access-date=2012-06-12 |website=T&I Business}}</ref><ref>{{cite web |url=http://www.mt-archive.info/Bar-Hillel-1960-App3.pdf |title=Appendix III of 'The present status of automatic translation of languages', Advances in Computers, vol.1 (1960), p.158-163. Reprinted in Y.Bar-Hillel: Language and information (Reading, Mass.: Addison-Wesley, 1964), p.174-179. |access-date=2012-06-12 |archive-date=28 September 2018 |archive-url=https://web.archive.org/web/20180928203641/http://www.mt-archive.info/Bar-Hillel-1960-App3.pdf |url-status=dead }}</ref> The quality of machine translation is substantially improved if the domain is restricted and controlled.<ref>{{cite web |url=http://tauyou.com/blog/?p=47 |title=Human quality machine translation solution by Ta with you |language=es |publisher=Tauyou.com |date=15 April 2009 |access-date=2012-06-12 |archive-date=22 September 2009 |archive-url=https://web.archive.org/web/20090922094140/http://tauyou.com/blog/?p=47 |url-status=live }}</ref> This enables using machine translation as a tool to speed up and simplify translations, as well as producing flawed but useful low-cost or ad-hoc translations.


* [[Machine translation for small Wikipedias]] (old page)
===Travel===
*Translation memories
Machine translation applications have also been released for most mobile devices, including mobile telephones, pocket PCs, PDAs, etc. Due to their portability, such instruments have come to be designated as [[mobile translation]] tools enabling mobile business networking between partners speaking different languages, or facilitating both foreign language learning and unaccompanied traveling to foreign countries without the need of the intermediation of a human translator.
** [[strategy:Proposal:Free Translation Memory]] ([[strategy:List of things that need to be free]]) (2010)
** [foundation-l] [http://thread.gmane.org/gmane.org.wikimedia.foundation/47253 Free translation memory], [http://thread.gmane.org/gmane.org.wikimedia.foundation/47138 Push translations], [http://thread.gmane.org/gmane.org.wikimedia.foundation/47154 Is Google translation good for Wikipedias?] (July 2010)
* [[Meta:Translate extension|Translate extension]] [//translatewiki.net/w/api.php?action=ttmserver&sourcelanguage=en&targetlanguage=fi&text=january&format=jsonfm]
** [[mw:Help:Extension:Translate/Translation memories#TTMServer_API]]
** [http://laxstrom.name/blag/2012/09/07/translation-memory-all-wikimedia-wikis/ Efficient translation: Translation memory enabled on all Wikimedia wikis]
** [[translatewiki:Translate Roll]]
* [http://thread.gmane.org/gmane.org.wikimedia.foundation/65605 The case for supporting open source machine translation] (April 2013){{Dead link}}
* [[wm2013:Submissions/Supporting translation of Wikipedia content]] (WM 2013)
* [[Collaborative Machine Translation for Wikipedia]] (July 2013)


=== Generic English Wikipedia articles ===
For example, the Google Translate app allows foreigners to quickly translate text in their surrounding via [[augmented reality]] using the smartphone camera that overlays the translated text onto the text.<ref>{{cite news |title=Google Translate Adds 20 Languages To Augmented Reality App |url=https://www.popsci.com/google-translate-adds-augmented-reality-translation-app/ |access-date=9 January 2023 |work=Popular Science |date=30 July 2015}}</ref> It can also [[speech recognition|recognize speech]] and then translate it.<ref>{{cite news |last1=Whitney |first1=Lance |title=Google Translate app update said to make speech-to-text even easier |url=https://www.cnet.com/tech/services-and-software/google-translation-app-may-better-recognize-certain-languages/ |access-date=9 January 2023 |work=CNET |language=en}}</ref>


* [[w:Machine Translation|Machine Translation]]
===Public administration===
** [[w:Comparison of machine translation applications|Comparison of machine translation applications]]
Despite their inherent limitations, MT programs are used around the world. Probably the largest institutional user is the [[European Commission]]. In 2012, with an aim to replace a rule-based MT by newer, statistical-based MT@EC, The European Commission contributed 3.072 million euros (via its ISA programme).<ref>{{cite web|url=http://ec.europa.eu/isa/actions/02-interoperability-architecture/2-8action_en.htm|title=Machine Translation Service|date=5 August 2011|access-date=13 September 2013|archive-date=8 September 2013|archive-url=https://web.archive.org/web/20130908232212/http://ec.europa.eu/isa/actions/02-interoperability-architecture/2-8action_en.htm|url-status=live}}</ref>
** [[w:Rule-based machine translation|Rule-based machine translation]]
** [[w:Transfer-based machine translation|Transfer-based machine translation]]
** [[w:Interlingual machine translation|Interlingual machine translation]]
** [[w:Dictionary-based machine translation|Dictionary-based machine translation]]
** [[w:Example-based machine translation|Example-based machine translation]]
** [[w:Statistical machine translation|Statistical machine translation]] and [[w:Moses (machine translation)|Moses]]


[[Category:Proposed projects - language]]
===Wikipedia===
[[Category:Machine translation]]
Machine translation has also been used for translating [[Wikipedia]] articles and could play a larger role in creating, updating, expanding, and generally improving articles in the future, especially as the MT capabilities may improve. There is a "content translation tool" which allows editors to more easily translate articles across several select languages.<ref>{{cite news |last1=Wilson |first1=Kyle |title=Wikipedia has a Google Translate problem |url=https://www.theverge.com/2019/5/8/18526739/wikipedia-translation-tool-machine-learning-ai-english |access-date=9 January 2023 |work=The Verge |date=8 May 2019}}</ref><ref>{{cite news |title=Wikipedia taps Google to help editors translate articles |url=https://venturebeat.com/ai/wikipedia-taps-google-to-help-editors-translate-articles/ |access-date=9 January 2023 |work=VentureBeat |date=9 January 2019}}</ref><ref>{{cite web |title=Content translation tool helps create over half a million Wikipedia articles |url=https://wikimediafoundation.org/news/2019/09/23/content-translation-tool-helps-create-over-half-a-million-wikipedia-articles/ |website=Wikimedia Foundation |access-date=10 January 2023 |date=23 September 2019}}</ref> English-language articles are thought to usually be more comprehensive and less biased than their non-translated equivalents in other languages.<ref>{{cite web |last1=Magazine |first1=Undark |title=Wikipedia Has a Language Problem. Here's How To Fix It. |url=https://undark.org/2021/08/12/wikipedia-has-a-language-problem-heres-how-to-fix-it/ |website=Undark Magazine |access-date=9 January 2023 |date=12 August 2021}}</ref> As of 2022, [[English Wikipedia]] has over 6.5 million articles while, for example, the [[German Wikipedia|German]] and [[Swedish Wikipedia]]s each only have over 2.5 million articles,<ref>{{cite web |title=List of Wikipedias - Meta |url=https://meta.wikimedia.org/wiki/List_of_Wikipedias |website=meta.wikimedia.org |access-date=9 January 2023 |language=en}}</ref> each often far less comprehensive.
 
===Surveillance and military===
Following terrorist attacks in Western countries, including [[9-11]], the U.S. and its allies have been most interested in developing  [[Arabic machine translation]] programs, but also in translating [[Pashto language|Pashto]] and [[Dari (Eastern Persian)|Dari]] languages.{{Citation needed|date=February 2007}} Within these languages, the focus is on key phrases and quick communication between military members and civilians through the use of mobile phone apps.<ref>{{cite journal |last=Gallafent |first=Alex |title=Machine Translation for the Military |journal=PRI's the World |date=26 Apr 2011 |access-date=17 Sep 2013 |url=http://www.theworld.org/2011/04/machine-translation-military/ |archive-date=9 May 2013 |archive-url=https://web.archive.org/web/20130509171415/http://www.theworld.org/2011/04/machine-translation-military/ |url-status=live }}</ref> The Information Processing Technology Office in [[DARPA]] hosted programs like [[DARPA TIDES program|TIDES]] and [[Babylon translator]]. US Air Force has awarded a $1 million contract to develop a language translation technology.<ref>{{cite web |last=Jackson |first=William |url=http://gcn.com/articles/2003/09/09/air-force-wants-to-build-a-universal-translator.aspx |title=GCN – Air force wants to build a universal translator |publisher=Gcn.com |date=9 September 2003 |access-date=2012-06-12 |archive-date=16 June 2011 |archive-url=https://web.archive.org/web/20110616052943/http://gcn.com/articles/2003/09/09/air-force-wants-to-build-a-universal-translator.aspx |url-status=live }}</ref>
 
===Social media===
The notable rise of [[social networking]] on the web in recent years has created yet another niche for the application of machine translation software – in utilities such as [[Facebook]], or [[instant messaging]] clients such as [[Skype]], [[Google Talk]], [[MSN Messenger]], etc. – allowing users speaking different languages to communicate with each other.
 
==== Online games ====
[[Lineage W]] gained popularity in Japan because of its machine translation features allowing players from different countries to communicate.<ref>{{Cite web |last=Young-sil |first=Yoon |date=2023-06-26 |title=Korean Games Growing in Popularity in Tough Japanese Game Market |url=http://www.businesskorea.co.kr/news/articleView.html?idxno=117105 |access-date=2023-08-08 |website=BusinessKorea |language=}}</ref>
 
===Medicine===
Despite being labelled as an unworthy competitor to human translation in 1966 by the Automated Language Processing Advisory Committee put together by the United States government,<ref>{{Cite report |url=http://www.nap.edu/html/alpac_lm/ARC000005.pdf |title=Language and Machines: Computers in Translation and Linguistics |last=Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council |date=1966 |publisher=National Research Council, National Academy of Sciences |location=Washington, D. C. |access-date=21 October 2013 |archive-url=https://web.archive.org/web/20131021044934/http://www.nap.edu/html/alpac_lm/ARC000005.pdf |archive-date=21 October 2013 |url-status=live}}</ref> the quality of machine translation has now been improved to such levels that its application in online collaboration and in the medical field are being investigated. The application of this technology in medical settings where human translators are absent is another topic of research, but difficulties arise due to the importance of accurate translations in medical diagnoses.<ref>{{cite journal|url=http://www.cfp.ca/content/59/4/382.full|title=Using machine translation in clinical practice|journal=Canadian Family Physician|date=April 2013|volume=59|issue=4|pages=382–383|access-date=21 October 2013|archive-date=4 May 2013|archive-url=https://web.archive.org/web/20130504040732/http://www.cfp.ca/content/59/4/382.full|url-status=live|last1=Randhawa|first1=Gurdeeshpal|last2=Ferreyra|first2=Mariella|last3=Ahmed|first3=Rukhsana|last4=Ezzat|first4=Omar|last5=Pottie|first5=Kevin|pmid=23585608|pmc=3625087}}</ref>
 
Researchers caution that the use of machine translation in medicine could risk mistranslations that can be dangerous in critical situations.<ref name=":02">{{Cite journal |last1=Vieira |first1=Lucas Nunes |last2=O’Hagan |first2=Minako |last3=O’Sullivan |first3=Carol |date=2021-08-18 |title=Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases |journal=Information, Communication & Society |language=en |volume=24 |issue=11 |pages=1515–1532 |doi=10.1080/1369118X.2020.1776370 |s2cid=225694304 |issn=1369-118X|doi-access=free |hdl=1983/29727bd1-a1ae-4600-9e8e-018f11ec75fb |hdl-access=free }}</ref><ref>{{Cite journal |last1=Khoong |first1=Elaine C. |last2=Steinbrook |first2=Eric |last3=Brown |first3=Cortlyn |last4=Fernandez |first4=Alicia |date=2019-04-01 |title=Assessing the Use of Google Translate for Spanish and Chinese Translations of Emergency Department Discharge Instructions |journal=JAMA Internal Medicine |language=en |volume=179 |issue=4 |pages=580–582 |doi=10.1001/jamainternmed.2018.7653 |issn=2168-6106 |pmc=6450297 |pmid=30801626}}</ref> Machine translation can make it easier for doctors to communicate with their patients in day to day activities, but it is recommended to only use machine translation when there is no other alternative, and that translated medical texts should be reviewed by human translators for accuracy.<ref>{{Cite journal |last=Piccoli |first=Vanessa |date=2022-07-05 |title=Plurilingualism, multimodality and machine translation in medical consultations: A case study |url=http://www.jbe-platform.com/content/journals/10.1075/tis.21012.pic |journal=Translation and Interpreting Studies |language=en |volume=17 |issue=1 |pages=42–65 |doi=10.1075/tis.21012.pic |s2cid=246780731 |issn=1932-2798}}</ref><ref>{{Cite journal |last1=Herrera-Espejel |first1=Paula Sofia |last2=Rach |first2=Stefan |date=2023-11-20 |title=The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review |journal=JMIR Public Health and Surveillance |language=en |volume=9 |pages=e50814 |doi=10.2196/50814 |pmid=37983078 |issn=2369-2960 |doi-access=free |pmc=10696499 }}</ref>
 
=== Law ===
[[Legal English|Legal language]] poses a significant challenge to machine translation tools due to its precise nature and atypical use of normal words. For this reason, specialized algorithms have been developed for use in legal contexts.<ref name=":1">{{Cite web |last=legalj |date=2023-01-02 |title=Man v. Machine: Social and Legal Implications of Machine Translation |url=https://legaljournal.princeton.edu/man-v-machine-social-and-legal-implications-of-machine-translation/ |access-date=2023-12-04 |website=Princeton Legal Journal |language=en-US}}</ref> Due to the risk of mistranslations arising from machine translators, researchers recommend that machine translations should be reviewed by human translators for accuracy, and some courts prohibit its use in [[Legal proceeding|formal proceedings]].<ref>{{Cite journal |last=Chavez |first=Edward L. |date=2008 |title=New Mexico's Success with Non-English Speaking Jurors |url=https://heinonline.org/HOL/Page?handle=hein.journals/jrlci1&id=307&div=&collection= |journal=Journal of Court Innovation |volume=1 |pages=303}}</ref>
 
The use of machine translation in law has raised concerns about translation errors and [[client confidentiality]]. Lawyers who use free translation tools such as Google Translate may accidentally violate client confidentiality by exposing private information to the providers of the translation tools.<ref name=":1" /> In addition, there have been arguments that consent for a police search that is obtained with machine translation is invalid, with different courts issuing different verdicts over whether or not these arguments are valid.<ref name=":02"/>
 
=== Ancient languages ===
The advancements in [[convolutional neural network]]s in recent years and in low resource machine translation (when only a very limited amount of data and examples are available for training) enabled machine translation for ancient languages, such as [[Akkadian language|Akkadian]] and its dialects Babylonian and Assyrian.<ref>{{Cite journal |last1=Gutherz |first1=Gai |last2=Gordin |first2=Shai |last3=Sáenz |first3=Luis |last4=Levy |first4=Omer |last5=Berant |first5=Jonathan |date=2023-05-02 |editor-last=Kearns |editor-first=Michael |title=Translating Akkadian to English with neural machine translation |url=https://academic.oup.com/pnasnexus/article/doi/10.1093/pnasnexus/pgad096/7147349 |journal=PNAS Nexus |language=en |volume=2 |issue=5 |pages=pgad096 |doi=10.1093/pnasnexus/pgad096 |issn=2752-6542 |pmc=10153418 |pmid=37143863}}</ref>
 
==Evaluation==
{{Main|Evaluation of machine translation}}
There are many factors that affect how machine translation systems are evaluated. These factors include the intended use of the translation, the nature of the machine translation software, and the nature of the translation process.
 
Different programs may work well for different purposes. For example, [[statistical machine translation]] (SMT) typically outperforms [[example-based machine translation]] (EBMT), but researchers found that when evaluating English to French translation, EBMT performs better.<ref name="Way 295–309">{{cite journal|last=Way|first=Andy|author2=Nano Gough|title=Comparing Example-Based and Statistical Machine Translation|journal=Natural Language Engineering|date=20 September 2005|volume=11|issue=3|pages=295–309|doi=10.1017/S1351324905003888|doi-broken-date=1 November 2024 |s2cid=3242163}}</ref> The same concept applies for technical documents, which can be more easily translated by SMT because of their formal language.
 
In certain applications, however, e.g., product descriptions written in a [[controlled language]], a [[dictionary-based machine translation|dictionary-based machine-translation]] system has produced satisfactory translations that require no human intervention save for quality inspection.<ref>Muegge (2006), "[http://www.mt-archive.info/Aslib-2006-Muegge.pdf Fully Automatic High Quality Machine Translation of Restricted Text: A Case Study] {{Webarchive|url=https://web.archive.org/web/20111017043848/http://www.mt-archive.info/Aslib-2006-Muegge.pdf |date=17 October 2011 }}," in ''Translating and the computer 28. Proceedings of the twenty-eighth international conference on translating and the computer, 16–17 November 2006, London'', London: Aslib. {{ISBN|978-0-85142-483-5}}.</ref>
 
There are various means for evaluating the output quality of machine translation systems. The oldest is the use of human judges<ref>{{cite web |url=http://www.morphologic.hu/public/mt/2008/compare12.htm |title=Comparison of MT systems by human evaluation, May 2008 |publisher=Morphologic.hu |access-date=2012-06-12 |archive-url=https://web.archive.org/web/20120419072313/http://www.morphologic.hu/public/mt/2008/compare12.htm |archive-date=19 April 2012 |url-status=dead |df=dmy-all }}</ref> to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable method to compare different systems such as rule-based and statistical systems.<ref>Anderson, D.D. (1995). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.961.5377&rep=rep1&type=pdf Machine translation as a tool in second language learning] {{Webarchive|url=https://web.archive.org/web/20180104073518/http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.961.5377&rep=rep1&type=pdf |date=4 January 2018 }}. CALICO Journal. 13(1). 68–96.</ref> [[Automate]]d means of evaluation include [[Bilingual evaluation understudy|BLEU]], [[NIST (metric)|NIST]], [[METEOR]], and [[LEPOR]].<ref>Han et al. (2012), "[http://repository.umac.mo/jspui/bitstream/10692/1747/1/10205_0_%5B2012-12-08~15%5D%20C.%20%28COLING2012%29%20LEPOR.pdf LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors] {{Webarchive|url=https://web.archive.org/web/20180104073506/http://repository.umac.mo/jspui/bitstream/10692/1747/1/10205_0_%5B2012-12-08~15%5D%20C.%20%28COLING2012%29%20LEPOR.pdf |date=4 January 2018 }}," in ''Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450'', Mumbai, India.</ref>
 
Relying exclusively on unedited machine translation ignores the fact that communication in [[natural language|human language]] is context-embedded and that it takes a person to comprehend the [[context (language use)|context]] of the original text with a reasonable degree of probability. It is certainly true that even purely human-generated translations are prone to error. Therefore, to ensure that a machine-generated translation will be useful to a human being and that publishable-quality translation is achieved, such translations must be reviewed and edited by a human.<ref>J.M. Cohen observes (p.14): "Scientific translation is the aim of an age that would reduce all activities to [[Technology|techniques]]. It is impossible however to imagine a literary-translation machine less complex than the human brain itself, with all its knowledge, reading, and discrimination."</ref> The late [[Claude Piron]] wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve [[ambiguity|ambiguities]] in the [[source text]], which the [[grammatical]] and [[Lexical (semiotics)|lexical]] exigencies of the target language require to be resolved. Such research is a necessary prelude to the pre-editing necessary in order to provide input for machine-translation software such that the output will not be [[garbage in garbage out|meaningless]].<ref name="NIST">See the [https://www.nist.gov/speech/tests/mt/ annually performed NIST tests since 2001] {{Webarchive|url=https://web.archive.org/web/20090322202656/http://nist.gov/speech/tests/mt/ |date=22 March 2009 }} and [[Bilingual Evaluation Understudy]]</ref>
 
In addition to disambiguation problems, decreased accuracy can occur due to varying levels of training data for machine translating programs. Both example-based and statistical machine translation rely on a vast array of real example sentences as a base for translation, and when too many or too few sentences are analyzed accuracy is jeopardized. Researchers found that when a program is trained on 203,529 sentence pairings, accuracy actually decreases.<ref name="Way 295–309"/> The optimal level of training data seems to be just over 100,000 sentences, possibly because as training data increases, the number of possible sentences increases, making it harder to find an exact translation match.
 
Flaws in machine translation have been noted for [[Humour in translation|their entertainment value]]. Two videos uploaded to [[YouTube]] in April 2017 involve two Japanese [[hiragana]] characters えぐ (''[[E (kana)|e]]'' and ''[[Ku (kana)|gu]]'') being repeatedly pasted into Google Translate, with the resulting translations quickly degrading into nonsensical phrases such as "DECEARING EGG" and "Deep-sea squeeze trees", which are then read in increasingly absurd voices;<ref>{{Cite web|url=https://www.businessinsider.com/google-translate-fails-2017-11|title=4 times Google Translate totally dropped the ball|first=Mark|last=Abadi|website=Business Insider}}</ref><ref>{{Cite web|url=https://nlab.itmedia.co.jp/nl/articles/1704/16/news013.html|title=回数を重ねるほど狂っていく Google翻訳で「えぐ」を英訳すると奇妙な世界に迷い込むと話題に|website=ねとらぼ}}</ref> the full-length version of the video currently has 6.9 million views {{as of|lc=y|March 2022|post=.}}<ref>{{Cite web|url=https://www.youtube.com/watch?v=3-rfBsWmo0M|title=えぐ|date=12 April 2017 |via=www.youtube.com}}</ref>
 
==Machine translation and signed languages==
{{main|Machine translation of sign languages}}
In the early 2000s, options for machine translation between spoken and signed languages were severely limited. It was a common belief that deaf individuals could use traditional translators. However, stress, intonation, pitch, and timing are conveyed much differently in spoken languages compared to signed languages. Therefore, a deaf individual may misinterpret or become confused about the meaning of written text that is based on a spoken language.<ref name="Zhao, L. 2000">Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000). [http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=hms A Machine Translation System from English to American Sign Language] {{Webarchive|url=https://web.archive.org/web/20180720012839/https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=hms |date=20 July 2018 }}. Lecture Notes in Computer Science, 1934: 54–67.</ref>
 
Researchers Zhao, et al. (2000), developed a prototype called TEAM (translation from English to ASL by machine) that completed English to [[American Sign Language]] (ASL) translations. The program would first analyze the syntactic, grammatical, and morphological aspects of the English text. Following this step, the program accessed a sign synthesizer, which acted as a dictionary for ASL. This synthesizer housed the process one must follow to complete ASL signs, as well as the meanings of these signs. Once the entire text is analyzed and the signs necessary to complete the translation are located in the synthesizer, a computer generated human appeared and would use ASL to sign the English text to the user.<ref name="Zhao, L. 2000"/>
 
==Copyright==
Only [[creative work|work]]s that are [[originality|original]] are subject to [[copyright]] protection, so some scholars claim that machine translation results are not entitled to copyright protection because MT does not involve [[creativity]].<ref>{{cite web|url=http://www.seo-translator.com/machine-translation-no-copyright-on-the-result/|title=Machine Translation: No Copyright On The Result?|access-date=24 November 2012|publisher=SEO Translator, citing [[Zimbabwe Independent]]|archive-date=29 November 2012|archive-url=https://web.archive.org/web/20121129042959/http://www.seo-translator.com/machine-translation-no-copyright-on-the-result/|url-status=live}}</ref> The copyright at issue is for a [[derivative work]]; the author of the [[originality|original work]] in the original language does not lose his [[rights]] when a work is translated: a translator must have permission to [[publishing|publish]] a translation.{{Citation needed|date=July 2024}}
 
==See also==
{{div col|colwidth=18em}}
*[[AI-complete]]
*[[Cache language model]]
*[[Comparison of machine translation applications]]
*[[Comparison of different machine translation approaches]]
*[[Computational linguistics]]
*[[Computer-assisted translation]] and [[Translation memory]]
*[[Controlled language in machine translation]]
*[[Controlled natural language]]
*[[Foreign language writing aid]]
*[[Fuzzy matching (computer-assisted translation)]]
*[[History of machine translation]]
*[[Human language technology]]
*[[Humour in translation]] ("howlers")
*[[Language and Communication Technologies]]
*[[Language barrier]]
*[[List of emerging technologies]]
*[[List of research laboratories for machine translation]]
*[[Mobile translation]]
*[[Neural machine translation]]
*[[OpenLogos]]
*[[Phraselator]]
*[[Postediting]]
*[[Pseudo-translation]]
*[[Round-trip translation]]
*[[Statistical machine translation]]
*{{section link|Translation#Machine translation}}
*[[Translation memory]]
*[[ULTRA (machine translation system)]]
*[[Universal Networking Language]]
*[[Universal translator]]
{{div col end}}
 
==Notes==
{{reflist}}
 
==Further reading==
* {{Citation |last=Cohen |first=J. M. |contribution=Translation |title=Encyclopedia Americana |year=1986 |volume=27 |pages=12–15 |ref=none|title-link=Encyclopedia Americana }}
* {{Cite book |last1=Hutchins |first1=W. John |url=https://archive.org/details/introductiontoma0000hutc |title=An Introduction to Machine Translation |last2=Somers |first2=Harold L. |publisher=Academic Press |year=1992 |isbn=0-12-362830-X |location=London |author-link=W. John Hutchins |url-access=registration}}
* {{Cite magazine |last=Lewis-Kraus |first=Gideon |date=7 June 2015 |title=Tower of Babble |magazine=New York Times Magazine |pages=48–52}}
* {{Cite journal |last1=Weber |first1=Steven |last2=Mehandru |first2=Nikita |date=2022 |title=The 2020s Political Economy of Machine Translation |journal=Business and Politics |language=en |volume=24 |issue=1 |pages=96–112 |doi=10.1017/bap.2021.17|arxiv=2011.01007 |s2cid=226236853 }}
 
==External links==
{{Wikiversity|Topic:Computational linguistics}}
* [http://www.omniglot.com/language/articles/machinetranslation.htm The Advantages and Disadvantages of Machine Translation]
* [http://www.eamt.org/iamt.php International Association for Machine Translation (IAMT)] {{Webarchive|url=https://web.archive.org/web/20100624162302/http://www.eamt.org/iamt.php |date=24 June 2010 }}
*[http://www.mt-archive.info Machine Translation Archive] {{Webarchive|url=https://web.archive.org/web/20190401232615/http://www.mt-archive.info/ |date=1 April 2019 }} by [[W. John Hutchins|John Hutchins]]. An electronic repository (and bibliography) of articles, books and papers in the field of machine translation and computer-based translation technology
*[http://www.hutchinsweb.me.uk/ Machine translation (computer-based translation)] – Publications by John Hutchins (includes [[PDF format|PDF]]s of several books on machine translation)
*[https://web.archive.org/web/20080120192259/http://bowland-files.lancs.ac.uk/monkey/ihe/mille/paper2.htm Machine Translation and Minority Languages]
*[http://www.foreignword.com/Technology/art/Hutchins/hutchins99.htm John Hutchins 1999] {{Webarchive|url=https://web.archive.org/web/20070907182416/http://www.foreignword.com/Technology/art/Hutchins/hutchins99.htm |date=7 September 2007 }}
*[https://slator.com/machine-translation/ Slator News & analysis of the latest developments in machine translation]
*[https://www.machinetranslation.com/blog/how-machine-translation-is-changing-the-landscape-of-foreign-language-learning From Classroom to Real World: How Machine Translation is Changing the Landscape of Foreign Language Learning]
{{Natural Language Processing}}
{{Approaches to machine translation}}
{{emerging technologies|topics=yes|infocom=yes}}
 
{{Authority control}}
 
[[Category:Machine translation| ]]
[[Category:Applications of artificial intelligence]]
[[Category:Computational linguistics]]
[[Category:Computer-assisted translation]]
[[Category:Tasks of natural language processing]]
[[Category:Automation software]]

Latest revision as of 16:54, 10 May 2025

Template:Short description Template:Distinguish Template:Use dmy dates

File:WordLensDemo5Feb2012.jpg
A mobile phone app translating Spanish text into English

Template:Translation sidebar Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

Early approaches were mostly rule-based or statistical. These methods have since been superseded by neural machine translation<ref>Template:Cite web</ref> and large language models.<ref>Template:Cite web</ref>

History

[edit]

Template:Main

Origins

[edit]

The origins of machine translation can be traced back to the work of Al-Kindi, a ninth-century Arabic cryptographer who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and probability and statistics, which are used in modern machine translation.<ref>Template:Cite web</ref> The idea of machine translation later appeared in the 17th century. In 1629, René Descartes proposed a universal language, with equivalent ideas in different tongues sharing one symbol.<ref>Template:Cite book</ref>

The idea of using digital computers for translation of natural languages was proposed as early as 1947 by England's A. D. Booth<ref>Template:Cite book</ref> and Warren Weaver at Rockefeller Foundation in the same year. "The memorandum written by Warren Weaver in 1949 is perhaps the single most influential publication in the earliest days of machine translation."<ref>Template:Cite book</ref><ref>Template:Cite web</ref> Others followed. A demonstration was made in 1954 on the APEXC machine at Birkbeck College (University of London) of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals (for example an article by Cleave and Zacharov in the September 1955 issue of Wireless World). A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer.

1950s

[edit]

The first researcher in the field, Yehoshua Bar-Hillel, began his research at MIT (1951). A Georgetown University MT research team, led by Professor Michael Zarechnak, followed (1951) with a public demonstration of its Georgetown-IBM experiment system in 1954. MT research programs popped up in Japan<ref>Template:Cite book</ref><ref>Template:Cite web</ref> and Russia (1955), and the first MT conference was held in London (1956).<ref name="Nye">Template:Cite journal</ref><ref name="Babel">Template:Cite book</ref>

David G. Hays "wrote about computer-assisted language processing as early as 1957" and "was project leader on computational linguistics at Rand from 1955 to 1968."<ref>Template:Cite news</ref>

1960–1975

[edit]

Researchers continued to join the field as the Association for Machine Translation and Computational Linguistics was formed in the U.S. (1962) and the National Academy of Sciences formed the Automatic Language Processing Advisory Committee (ALPAC) to study MT (1964). Real progress was much slower, however, and after the ALPAC report (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced.<ref name="ueno">Template:Cite book</ref> According to a 1972 report by the Director of Defense Research and Engineering (DDR&E), the feasibility of large-scale MT was reestablished by the success of the Logos MT system in translating military manuals into Vietnamese during that conflict.

The French Textile Institute also used MT to translate abstracts from and into French, English, German and Spanish (1970); Brigham Young University started a project to translate Mormon texts by automated translation (1971).

1975 and beyond

[edit]

SYSTRAN, which "pioneered the field under contracts from the U.S. government"<ref name="MT1998.EmptyAtlantic">Template:Cite magazine</ref> in the 1960s, was used by Xerox to translate technical manuals (1978). Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation. MT became more popular after the advent of computers.<ref>Template:Cite book</ref> SYSTRAN's first implementation system was implemented in 1988 by the online service of the French Postal Service called Minitel.<ref>Template:Cite book</ref> Various computer based translation companies were also launched, including Trados (1984), which was the first to develop and market Translation Memory technology (1989), though this is not the same as MT. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University (1991).

By 1998, "for as little as $29.95" one could "buy a program for translating in one direction between English and a major European language of your choice" to run on a PC.<ref name=MT1998.EmptyAtlantic/>

MT on the web started with SYSTRAN offering free translation of small texts (1996) and then providing this via AltaVista Babelfish,<ref name=MT1998.EmptyAtlantic/> which racked up 500,000 requests a day (1997).<ref>Template:Cite web</ref> The second free translation service on the web was Lernout & Hauspie's GlobaLink.<ref name=MT1998.EmptyAtlantic/> Atlantic Magazine wrote in 1998 that "Systran's Babelfish and GlobaLink's Comprende" handled "Don't bank on it" with a "competent performance."<ref>and gave other examples too</ref>

Franz Josef Och (the future head of Translation Development AT Google) won DARPA's speed MT competition (2003).<ref>Template:Cite book</ref> More innovations during this time included MOSES, the open-source statistical MT engine (2007), a text/SMS translation service for mobiles in Japan (2008), and a mobile phone with built-in speech-to-speech translation functionality for English, Japanese and Chinese (2009). In 2012, Google announced that Google Translate translates roughly enough text to fill 1 million books in one day.

Approaches

[edit]

Template:See also

Before the advent of deep learning methods, statistical methods required a lot of rules accompanied by morphological, syntactic, and semantic annotations.

Rule-based

[edit]

Template:Main

The rule-based machine translation approach was used mostly in the creation of dictionaries and grammar programs. Its biggest downfall was that everything had to be made explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with it, and lexical selection rules must be written for all instances of ambiguity.

Transfer-based machine translation

[edit]

Template:Main

Transfer-based machine translation was similar to interlingual machine translation in that it created a translation from an intermediate representation that simulated the meaning of the original sentence. Unlike interlingual MT, it depended partially on the language pair involved in the translation.

Interlingual

[edit]

Template:Main

Interlingual machine translation was one instance of rule-based machine-translation approaches. In this approach, the source language, i.e. the text to be translated, was transformed into an interlingual language, i.e. a "language neutral" representation that is independent of any language. The target language was then generated out of the interlingua. The only interlingual machine translation system that was made operational at the commercial level was the KANT system (Nyberg and Mitamura, 1992), which was designed to translate Caterpillar Technical English (CTE) into other languages.

Dictionary-based

[edit]

Template:Main

Machine translation used a method based on dictionary entries, which means that the words were translated as they are by a dictionary.

Statistical

[edit]

Template:Main Statistical machine translation tried to generate translations using statistical methods based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the European Parliament. Where such corpora were available, good results were achieved translating similar texts, but such corpora were rare for many language pairs. The first statistical machine translation software was CANDIDE from IBM. In 2005, Google improved its internal translation capabilities by using approximately 200 billion words from United Nations materials to train their system; translation accuracy improved.<ref>Template:Cite web</ref>

SMT's biggest downfall included it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating into such languages), and its inability to correct singleton errors.

Some work has been done in the utilization of multiparallel corpora, that is a body of text that has been translated into 3 or more languages. Using these methods, a text that has been translated into 2 or more languages may be utilized in combination to provide a more accurate translation into a third language compared with if just one of those source languages were used alone.<ref>Template:Cite conference</ref><ref>Template:Cite conference</ref><ref>Template:Cite journal</ref>

Neural MT

[edit]

Template:Main

A deep learning-based approach to MT, neural machine translation has made rapid progress in recent years. However, the current consensus is that the so-called human parity achieved is not real, being based wholly on limited domains, language pairs, and certain test benchmarks<ref>Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018. Attaining the unattainable? reassessing claims of human parity in neural machine translation. CoRR, abs/1808.10432.</ref> i.e., it lacks statistical significance power.<ref>Template:Cite arXiv</ref>

Translations by neural MT tools like DeepL Translator, which is thought to usually deliver the best machine translation results as of 2022, typically still need post-editing by a human.<ref>Template:Cite journal</ref><ref>Template:Cite web</ref><ref>Template:Cite news</ref>

Instead of training specialized translation models on parallel datasets, one can also directly prompt generative large language models like GPT to translate a text.<ref name="Hendy2023">Template:Cite arXiv</ref><ref>Template:Cite news</ref><ref name="arxiv221014250">Template:Cite arXiv</ref> This approach is considered promising,<ref name="WMT2023">Template:Cite conference</ref> but is still more resource-intensive than specialized translation models.

Issues

[edit]
File:Stir Fried Wikipedia.jpg
Machine translation could produce some non-understandable phrases, such as "Template:Lang" (Macrolepiota albuminosa) being rendered as "wikipedia".
File:Machine translation in Bali.jpg
Broken Chinese "Template:Lang" from machine translation in Bali, Indonesia. The broken Chinese sentence sounds like "there does not exist an entry" or "have not entered yet".

Studies using human evaluation (e.g. by professional literary translators or human readers) have systematically identified various issues with the latest advanced MT outputs.<ref name="arxiv221014250"/> Common issues include the translation of ambiguous parts whose correct translation requires common sense-like semantic language processing or context.<ref name="arxiv221014250"/> There can also be errors in the source texts, missing high-quality training data and the severity of frequency of several types of problems may not get reduced with techniques used to date, requiring some level of human active participation.

Disambiguation

[edit]

Template:Main Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by Yehoshua Bar-Hillel.<ref>Milestones in machine translation – No.6: Bar-Hillel and the nonfeasibility of FAHQT Template:Webarchive by John Hutchins</ref> He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word.<ref>Bar-Hillel (1960), "Automatic Translation of Languages". Available online at http://www.mt-archive.info/Bar-Hillel-1960.pdf Template:Webarchive</ref> Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches.

Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.<ref>Template:Cite book</ref>

Claude Piron, a long-time translator for the United Nations and the World Health Organization, wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the source text, which the grammatical and lexical exigencies of the target language require to be resolved:

Template:Blockquote

The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own; but this would require a higher degree of AI than has yet been attained. A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions (based, perhaps, on which kind of prisoner-of-war camp is more often mentioned in a given corpus) would have a reasonable chance of guessing wrong fairly often. A shallow approach that involves "ask the user about each ambiguity" would, by Piron's estimate, only automate about 25% of a professional translator's job, leaving the harder 75% still to be done by a human.

Non-standard speech

[edit]

One of the major pitfalls of MT is its inability to translate non-standard language with the same accuracy as standard language. Heuristic or statistical based MT takes input from various sources in standard form of a language. Rule-based translation, by nature, does not include common non-standard usages. This causes errors in translation from a vernacular source or into colloquial language. Limitations on translation from casual speech present issues in the use of machine translation in mobile devices.

Named entities

[edit]

Template:Main In information extraction, named entities, in a narrow sense, refer to concrete or abstract entities in the real world such as people, organizations, companies, and places that have a proper name: George Washington, Chicago, Microsoft. It also refers to expressions of time, space and quantity such as 1 July 2011, $500.

In the sentence "Smith is the president of Fabrionix" both Smith and Fabrionix are named entities, and can be further qualified via first name or other information; "president" is not, since Smith could have earlier held another position at Fabrionix, e.g. Vice President. The term rigid designator is what defines these usages for analysis in statistical machine translation.

Named entities must first be identified in the text; if not, they may be erroneously translated as common nouns, which would most likely not affect the BLEU rating of the translation but would change the text's human readability.<ref>Template:Cite conference</ref> They may be omitted from the output translation, which would also have implications for the text's readability and message.

Transliteration includes finding the letters in the target language that most closely correspond to the name in the source language. This, however, has been cited as sometimes worsening the quality of translation.<ref>Hermajakob, U., Knight, K., & Hal, D. (2008). Name Translation in Statistical Machine Translation Learning When to Transliterate Template:Webarchive. Association for Computational Linguistics. 389–397.</ref> For "Southern California" the first word should be translated directly, while the second word should be transliterated. Machines often transliterate both because they treated them as one entity. Words like these are hard for machine translators, even those with a transliteration component, to process.

Use of a "do-not-translate" list, which has the same end goal – transliteration as opposed to translation.<ref name="singla">Template:Citation</ref> still relies on correct identification of named entities.

A third approach is a class-based model. Named entities are replaced with a token to represent their "class"; "Ted" and "Erica" would both be replaced with "person" class token. Then the statistical distribution and use of person names, in general, can be analyzed instead of looking at the distributions of "Ted" and "Erica" individually, so that the probability of a given name in a specific language will not affect the assigned probability of a translation. A study by Stanford on improving this area of translation gives the examples that different probabilities will be assigned to "David is going for a walk" and "Ankit is going for a walk" for English as a target language due to the different number of occurrences for each name in the training data. A frustrating outcome of the same study by Stanford (and other attempts to improve named recognition translation) is that many times, a decrease in the BLEU scores for translation will result from the inclusion of methods for named entity translation.<ref name="singla" />

Applications

[edit]

While no system provides the ideal of fully automatic high-quality machine translation of unrestricted text, many fully automated systems produce reasonable output.<ref>Template:Cite book</ref><ref>Template:Cite web</ref><ref>Template:Cite web</ref> The quality of machine translation is substantially improved if the domain is restricted and controlled.<ref>Template:Cite web</ref> This enables using machine translation as a tool to speed up and simplify translations, as well as producing flawed but useful low-cost or ad-hoc translations.

Travel

[edit]

Machine translation applications have also been released for most mobile devices, including mobile telephones, pocket PCs, PDAs, etc. Due to their portability, such instruments have come to be designated as mobile translation tools enabling mobile business networking between partners speaking different languages, or facilitating both foreign language learning and unaccompanied traveling to foreign countries without the need of the intermediation of a human translator.

For example, the Google Translate app allows foreigners to quickly translate text in their surrounding via augmented reality using the smartphone camera that overlays the translated text onto the text.<ref>Template:Cite news</ref> It can also recognize speech and then translate it.<ref>Template:Cite news</ref>

Public administration

[edit]

Despite their inherent limitations, MT programs are used around the world. Probably the largest institutional user is the European Commission. In 2012, with an aim to replace a rule-based MT by newer, statistical-based MT@EC, The European Commission contributed 3.072 million euros (via its ISA programme).<ref>Template:Cite web</ref>

Wikipedia

[edit]

Machine translation has also been used for translating Wikipedia articles and could play a larger role in creating, updating, expanding, and generally improving articles in the future, especially as the MT capabilities may improve. There is a "content translation tool" which allows editors to more easily translate articles across several select languages.<ref>Template:Cite news</ref><ref>Template:Cite news</ref><ref>Template:Cite web</ref> English-language articles are thought to usually be more comprehensive and less biased than their non-translated equivalents in other languages.<ref>Template:Cite web</ref> As of 2022, English Wikipedia has over 6.5 million articles while, for example, the German and Swedish Wikipedias each only have over 2.5 million articles,<ref>Template:Cite web</ref> each often far less comprehensive.

Surveillance and military

[edit]

Following terrorist attacks in Western countries, including 9-11, the U.S. and its allies have been most interested in developing Arabic machine translation programs, but also in translating Pashto and Dari languages.Template:Citation needed Within these languages, the focus is on key phrases and quick communication between military members and civilians through the use of mobile phone apps.<ref>Template:Cite journal</ref> The Information Processing Technology Office in DARPA hosted programs like TIDES and Babylon translator. US Air Force has awarded a $1 million contract to develop a language translation technology.<ref>Template:Cite web</ref>

Social media

[edit]

The notable rise of social networking on the web in recent years has created yet another niche for the application of machine translation software – in utilities such as Facebook, or instant messaging clients such as Skype, Google Talk, MSN Messenger, etc. – allowing users speaking different languages to communicate with each other.

Online games

[edit]

Lineage W gained popularity in Japan because of its machine translation features allowing players from different countries to communicate.<ref>Template:Cite web</ref>

Medicine

[edit]

Despite being labelled as an unworthy competitor to human translation in 1966 by the Automated Language Processing Advisory Committee put together by the United States government,<ref>Template:Cite report</ref> the quality of machine translation has now been improved to such levels that its application in online collaboration and in the medical field are being investigated. The application of this technology in medical settings where human translators are absent is another topic of research, but difficulties arise due to the importance of accurate translations in medical diagnoses.<ref>Template:Cite journal</ref>

Researchers caution that the use of machine translation in medicine could risk mistranslations that can be dangerous in critical situations.<ref name=":02">Template:Cite journal</ref><ref>Template:Cite journal</ref> Machine translation can make it easier for doctors to communicate with their patients in day to day activities, but it is recommended to only use machine translation when there is no other alternative, and that translated medical texts should be reviewed by human translators for accuracy.<ref>Template:Cite journal</ref><ref>Template:Cite journal</ref>

Law

[edit]

Legal language poses a significant challenge to machine translation tools due to its precise nature and atypical use of normal words. For this reason, specialized algorithms have been developed for use in legal contexts.<ref name=":1">Template:Cite web</ref> Due to the risk of mistranslations arising from machine translators, researchers recommend that machine translations should be reviewed by human translators for accuracy, and some courts prohibit its use in formal proceedings.<ref>Template:Cite journal</ref>

The use of machine translation in law has raised concerns about translation errors and client confidentiality. Lawyers who use free translation tools such as Google Translate may accidentally violate client confidentiality by exposing private information to the providers of the translation tools.<ref name=":1" /> In addition, there have been arguments that consent for a police search that is obtained with machine translation is invalid, with different courts issuing different verdicts over whether or not these arguments are valid.<ref name=":02"/>

Ancient languages

[edit]

The advancements in convolutional neural networks in recent years and in low resource machine translation (when only a very limited amount of data and examples are available for training) enabled machine translation for ancient languages, such as Akkadian and its dialects Babylonian and Assyrian.<ref>Template:Cite journal</ref>

Evaluation

[edit]

Template:Main There are many factors that affect how machine translation systems are evaluated. These factors include the intended use of the translation, the nature of the machine translation software, and the nature of the translation process.

Different programs may work well for different purposes. For example, statistical machine translation (SMT) typically outperforms example-based machine translation (EBMT), but researchers found that when evaluating English to French translation, EBMT performs better.<ref name="Way 295–309">Template:Cite journal</ref> The same concept applies for technical documents, which can be more easily translated by SMT because of their formal language.

In certain applications, however, e.g., product descriptions written in a controlled language, a dictionary-based machine-translation system has produced satisfactory translations that require no human intervention save for quality inspection.<ref>Muegge (2006), "Fully Automatic High Quality Machine Translation of Restricted Text: A Case Study Template:Webarchive," in Translating and the computer 28. Proceedings of the twenty-eighth international conference on translating and the computer, 16–17 November 2006, London, London: Aslib. Template:ISBN.</ref>

There are various means for evaluating the output quality of machine translation systems. The oldest is the use of human judges<ref>Template:Cite web</ref> to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable method to compare different systems such as rule-based and statistical systems.<ref>Anderson, D.D. (1995). Machine translation as a tool in second language learning Template:Webarchive. CALICO Journal. 13(1). 68–96.</ref> Automated means of evaluation include BLEU, NIST, METEOR, and LEPOR.<ref>Han et al. (2012), "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors Template:Webarchive," in Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450, Mumbai, India.</ref>

Relying exclusively on unedited machine translation ignores the fact that communication in human language is context-embedded and that it takes a person to comprehend the context of the original text with a reasonable degree of probability. It is certainly true that even purely human-generated translations are prone to error. Therefore, to ensure that a machine-generated translation will be useful to a human being and that publishable-quality translation is achieved, such translations must be reviewed and edited by a human.<ref>J.M. Cohen observes (p.14): "Scientific translation is the aim of an age that would reduce all activities to techniques. It is impossible however to imagine a literary-translation machine less complex than the human brain itself, with all its knowledge, reading, and discrimination."</ref> The late Claude Piron wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the source text, which the grammatical and lexical exigencies of the target language require to be resolved. Such research is a necessary prelude to the pre-editing necessary in order to provide input for machine-translation software such that the output will not be meaningless.<ref name="NIST">See the annually performed NIST tests since 2001 Template:Webarchive and Bilingual Evaluation Understudy</ref>

In addition to disambiguation problems, decreased accuracy can occur due to varying levels of training data for machine translating programs. Both example-based and statistical machine translation rely on a vast array of real example sentences as a base for translation, and when too many or too few sentences are analyzed accuracy is jeopardized. Researchers found that when a program is trained on 203,529 sentence pairings, accuracy actually decreases.<ref name="Way 295–309"/> The optimal level of training data seems to be just over 100,000 sentences, possibly because as training data increases, the number of possible sentences increases, making it harder to find an exact translation match.

Flaws in machine translation have been noted for their entertainment value. Two videos uploaded to YouTube in April 2017 involve two Japanese hiragana characters えぐ (e and gu) being repeatedly pasted into Google Translate, with the resulting translations quickly degrading into nonsensical phrases such as "DECEARING EGG" and "Deep-sea squeeze trees", which are then read in increasingly absurd voices;<ref>Template:Cite web</ref><ref>Template:Cite web</ref> the full-length version of the video currently has 6.9 million views Template:As of<ref>Template:Cite web</ref>

Machine translation and signed languages

[edit]

Template:Main In the early 2000s, options for machine translation between spoken and signed languages were severely limited. It was a common belief that deaf individuals could use traditional translators. However, stress, intonation, pitch, and timing are conveyed much differently in spoken languages compared to signed languages. Therefore, a deaf individual may misinterpret or become confused about the meaning of written text that is based on a spoken language.<ref name="Zhao, L. 2000">Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000). A Machine Translation System from English to American Sign Language Template:Webarchive. Lecture Notes in Computer Science, 1934: 54–67.</ref>

Researchers Zhao, et al. (2000), developed a prototype called TEAM (translation from English to ASL by machine) that completed English to American Sign Language (ASL) translations. The program would first analyze the syntactic, grammatical, and morphological aspects of the English text. Following this step, the program accessed a sign synthesizer, which acted as a dictionary for ASL. This synthesizer housed the process one must follow to complete ASL signs, as well as the meanings of these signs. Once the entire text is analyzed and the signs necessary to complete the translation are located in the synthesizer, a computer generated human appeared and would use ASL to sign the English text to the user.<ref name="Zhao, L. 2000"/>

[edit]

Only works that are original are subject to copyright protection, so some scholars claim that machine translation results are not entitled to copyright protection because MT does not involve creativity.<ref>Template:Cite web</ref> The copyright at issue is for a derivative work; the author of the original work in the original language does not lose his rights when a work is translated: a translator must have permission to publish a translation.Template:Citation needed

See also

[edit]

Template:Div col

Template:Div col end

Notes

[edit]

Template:Reflist

Further reading

[edit]
[edit]

Template:Wikiversity

Template:Natural Language Processing Template:Approaches to machine translation Template:Emerging technologies

Template:Authority control