Machine translation

The purpose of the Wikipedia Machine Translation Project is to develop ideas, methods and tools that can help translate Wikipedia to non-english languages.

Motivation
Small languages can't produce articles as fast as english wikipedia because the number of wikipedians is too low. The solution for this problem is the translation of english wikipedia. But, some languages will not have enough translators. Machine Translation can improve the productivity of the community.

But manual translation can be added later, for a more accurate text.

TradWiki/WikiTran
TradWiki/WikiTran (WikipediaTranslator/WikiTranslator/BabelWiki) is a to be coded wiki that helps wikipedians to translate articles from english to other languages.

I rather like WikiTran myself. --Stephen Gilbert

License All code and data should be released under a free licence GFDL

Advantages

faster translation of wikipedia
generation of large amounts of useful data (corpora).
creation of a useful tool

Lexical, syntactic and semantic analysis of wikipedia content
The first step for wikipedia translation is the analysis of wikipedia's content. This analysis will determine:

Number of words and sentences
Words distribution
Frequency of the most popular sentences and expressions
Semantic relations between words and between sentences
Syntactic analysis of all sentences

It would be interesting the user could click on every word in an article to link to the wiktionary definition, if there is not an inside wikipedia article. And indicate to the software to translate the word into another language ( using the right mouse clicking).

Information about the most popular sentences and expressions can be used to create a translation database of such expressions so translators don't need to repeat a translation.

Yes, a database of idioms

You mean like a w:translation memory system?

Resources:

General
- Fagan Finder Translation Wizard - single interface to many free online translators
Controlled Translation
- BabelCode Project
Dictionaries
- Dutch to English Translation Tools (source available)
- English dictionary
- Portuguese dictionary
  - Ispell
- English-Portuguese dictionary
- Ergane (free dictionary, several languages)
  - http://download.travlang.com/Ergane/frames-en.html

Translation rules
Code
- GPLTran (Translator under GPL)
  - http://www.translator.cx
    - Supposed to translate paragraphs or entire webpapges
    - Paragraph translation is spotty and buggy
    - Web translation doesn't seem to work at all.
  - Actually this isn't machine translation, it is a literal word-for-word translation
  - Download code at http://www.translator.cx/dist/
- Linguaphile (Translator under GPL)
  - http://linguaphile.sourceforge.net/
    - open source, platform independent, and programmed in perl
    - simplistic and easy to use command line translator
    - 56 languages
  - Download code at http://www.translator.cx/dist/
- Traduki
  - C/Lua-based project, uses the metalanguage approach with Esperanto for lexycal content (to some extent)
  - Project restarted in 2003, current being developed
  - http://traduki.sourceforge.net (version 0.2 released, and translates "The dog eats the apple" to Esperanto: "La hundo mangxas la pomon")

I like the idea use traduki. One can use traduki keys to stablish relations between words in different languages. I.e. hundo is the key to en:dog, es:perro and so on. So, going to hundo, you can add another translation to other lnnguages, without add language: links in the es:perro article, for example.

- http://www.link.cs.cmu.edu/link/ -- Link Grammar
Databases
- http://www.cogsci.princeton.edu/~wn/links/ -- WordNet, a lexical database for the English language.

TradWiki/WikiTran - Translation memory aproach
A Translation Memory is a computer program that uses a database of old translations to help a human translator. If this aproach is followed, WikipediaTranslator will need the following features:

visualization of translated and original versions
split of original versions on several parts for individual translation

Links

general
- Links on Machine translation (MT): http://www.ife.dk/url-mt.htm
- Machine translation (MT), and the future of the translation industry http://accurapid.com/journal/15mt.htm
- Machine Translation: an Introductory Guide: http://clwww.essex.ac.uk/MTbook/
Visual Interactive Syntax Learning: http://visl.sdu.dk/visl/
wikipedia articles
Free translations on the web
- http://www.google.com/language_tools (uses Systran software)
- http://www.freetranslation.com/
- http://www.systransoft.com/
- http://babelfish.altavista.com/ (uses Systran software)
- http://www.babylon.com/
- http://www.translator.cx (GPLTran)
- http://www.reverso.net/textonly/default_ie.asp
- http://www.worldlingo.com/en/microsoft/computer_translation.html Works good, many languages (at least somewhat Systran software)
Neural nets
Machine translation
Translations memories
wired magazine
Portuguese
- Processamento Computacional do Português http://www.linguateca.pt/index.html
Meta-language
- http://www.unl.ias.unu.edu A United Nations project based on an artificial, machine-readable language (UNL). The idea is to semi-automatically create a UNL text from, say, English, then have it fully-automatically translated in up to 150 languages on-the-fly.
The World Wide Translator (The Tragedy of the Anticommons of translations memories)
- http://www.technologyreview.com/web/leo/leo092101.asp?nt=le921t

References:

A Word-to-Word Model of Translational Equivalence

Are you sure the other language wikipedias would rather translate text than write it themselves? It seems to me that it's almost more effort to translate text than to write an article yourself. For instance, I run the Esperanto wikipedia or eo: and I think we appreciate the international nature of our articles. I was wondering if other second-language wikipedias would feel the same way. I mean do the other language wikipedias want it?

--Chuck Smith

Initially, it´s better have something that nothing. So, I prefer a translation if there is not an esperanto original language article. Remember, later you can wiki (modify) the translated article.

I'm an active contributor/translator for the Vietnamese Wikipedia, and I find that it's okay to simply translate most articles to Vietnamese, but with articles that relate more to Vietnamese-speakers (such as articles on Vietnam, Vietnamese, Vietnamese-Americans, etc.), it's better to write an article (almost) from scratch. Because the international Wikipedias are more than internationalization – they're also about localization. – Minh Nguyễn 22:50, 15 Apr 2004 (UTC)

It's true that the effort to write articles is almost the same as the effort to translate. But there are some exceptions. If you are not an expert in the topic, it's easier to translate than to write. On the other hand, it's clear to me that the number of contributers to the portuguese encyclopedia is very small. You must consider the fact that portuguese is not a second-language, but the first language of milions of people. Unfortunaly very few of those millions have access to the internet and/or have an education. A free encyclopedia would be an extraordinary resource for tose people, so every effort to speed the creation of portugurese version is welcomed. Of course, people that write to a second language wikipedia like Vikipedio, have diferent purposes, do it for fun and are not interested in Machine Translation.

PS-You may be interested in knowing that the Traduki project uses esperanto for the deeper word representation to achieve machine translation. user:joao.

Point well made. It would be especially good for the minority languages. I was aware of the Traduki project and it looks interesting although it looks like nothing has happened on the project lately... maybe I'm wrong. I actually looked at the pages again yesterday. ...and since I'm going to start learning Portuguese soon (I plan to visit Brazil next August), I'll probably take a closer look at it later. I now know Sim, N~ao and Obrigado. :)

--Chuck Smith

Now that I think more about it I'd like to see auto-translation so I can get rough translations of encyclopedias in non-English, non-Esperanto language wikipedias. Seems like in a future version we could have a drop down list on each page that could translate a page for us and also give a link to the article on another language wikipedia if it exists. I'm think there's already free services that do this, does anyone know? --Chuck Smith

There are some links to such services above under "Free translations on the web" Joao

Has anyone seen Google Translate at http://translate.google.com/translate_t ?

Would a automatical translation script be run only once for each article, multiple times at an interval, immediately when changes are made or immediately on demand by a reader? If only once or by an interval, how would article conficts be handled? 24.198.63.192 03:52 Oct 18, 2002 (UTC)

How about never, because machine translation produces utterly crap results? There are plenty of crap online translators out there that people can feed pages into if they simply must; I don't think we should encourage it that much. --Brion VIBBER 06:01 Oct 18, 2002 (UTC)

Seconded! -- Tarquin

Yeah, I'm not so big on the idea anymore either. I do think it's interesting as an extremely long term project, though.

Machine translation can give the best of both worlds:

point your machine translator at the english version (probably) and get local language version with the most content, with a slight US/European bias.
point your normal browser at your own language version, and get (sometimes) less content, but more readable, and with local bias, as well as local-specific content. User:Willsmith

I've noticed that machine translation is adequate for getting the general meaning across but isn't very pleasing to the eye. If it has to be used, it'd probably best be used to populate a blank page so that native speakers of the language can clean it up in normal wikiwiki style. -- Daniel Thomas

Mi nur bedauwras ke la tuta diskuto estas nur en la angla kaj ke la diskutantoj deiras de la punkto kvazaux la anglalingva vikipedio estus la cxefa kulturfonto. Ja valorus traduki artikolojn jam ekzistantajn sed en cxiujn direktojn (ne nepre nur de la angla). Kaj mi gxis nun spertis ke la auxtomataj tradukiloj donas acxegajn rezultojn. Arno Lagrange

When I use automated translation, I usually observe two problems:

The sentence stucture cannot be parsed correctly, and
the meaning of certain words is misunderstood by the translation program.

Both seems to be caused by ambinguities.

So, my idea is:

Run the texts through a parser that displays all ambinguities.
Create a disambingued version (maybe using an artificially enhanced grammar) and recheck it using the parser.
Then, automatically translate the disambingued text into every other language.
Finally, for each language, merge the translated text with already existing text (and maybe correct some oddities).

This would require for each language to add two additional wikis to the 'presentable' version: one for disambingued texts, and one as a collection pool for raw translations. Sloyment 12:47, 22 Oct 2003 (UTC)

Some examples how the above procedure could work:

The German sentence "Die Katze hatte Klaus bereits verspeist" can mean both: "Klaus had already eaten the cat" or "The cat had already eaten Klaus". So, in this case, the parser would say: "In the sentence blahblah, I don't understand who eats whom." So, for these situations, there could be case marks, like {1}: nominative, {2}: genitive, {3}: dative, {4}: accusative. So if we change the sentence to: "Die Katze{4} hatte Klaus{1} bereits verspeist", it will be clear that the cat gets eaten. (BTW: Google translates "verspeisen" with "feed" -- which is wrong).

In some cases, it might be neccessary to just disambingue the hierarchy within complex expressions. The structure could be disambingued using brackets, e.g. "{{{parallel port} {{flat bed} scanner}} {{reverse engineering} howto}}".

Words with several meanings (e.g. "port") could be clarified in a definition section.

The assumption behind this idea is that it would be easier to disambingue a text than to translate it, and that it is easier to correct an automated translation that has only few mistakes in it, than to correct the rubbish that current translation programs produce. Sloyment 14:59, 22 Oct 2003 (UTC)

There are other problems. Some languages may not have words or phrases for certain technical concepts because no native speaker has ever needed them before. This is particularly true of languages with small numbers of native speakers in rural settings. It may be difficult to automatically translate an article on co-routines, for instance, because ideas like subroutine, co-routine, time-sharing and multi-tasking have never been put into words in that particular language before. A human translator can normally use a bit of imagination to invent a new term or reuse a term previously used for an analogous existing concept and if the translator is any good, the result will fit into the language reasonably well. However a machine can do little better than to leave the untranslatable term untranslated and mark it for human attention. -- Derek Ross 16:05, 26 Mar 2004 (UTC)

Three main things I'm wondering about.

I don't remember where it was, but I know for sure that recently I saw an article on using neural networking so that "computers can learn languages" and to translate between them; the whole purpose was originally translation from/to minority languages or ancient languages for which not many people see the point of making a translation program for.

Any viable MT project would do itself much good to include the Reverso method in some way. (Reverso method = find a translation that will back-translate as close as possible to the original)

I am a member of the UNDL foundation or whatever the hell they call it. I have access to all their crap and everything. Really a very nice thing. If anybody wants the documents, I think I'm allowed to give them out provided you promise to use them for personal/not-for-profit purposes only. I don't think we'd be allowed the use of the Enconverters/Deconverters, but the programming for them themselves should be fairly easy; the main thing that the people working on that project are working on is the encoversion/deconversion RULES for different languages. I think that an open-source program which encorporated UNL would be perfectly legal. If anybody is interested, though, I'll have to check all the licencing crap they sent me.

So essentially, if I knew any programming language other than HTML (hey, I'm only 14, though I am going to begin taking CC courses in C or some crap like that over the summer) and I were to make MT software, it would incorporate all 3 of these. I think that a lot of the programming behind neural networks is availible for free online to plug into whatever you want, so that (afaik) wouldn't be very hard, except maybe the customization part.

UNL, at its best, claims a 99% accuracy rate. I have seen UNL at work. The English deconversions are fantastic, though they do leave something to be desired. As far as I can tell from what others have told me, though, the deconversions for languages such as Russian and Italian are - though one can get what they say - totally ungrammatical.--Node_ue 03:11, 7 Apr 2004 (UTC)

Then imagine what deconversions to Asian languages would be like… :) – Minh Nguyễn

Yes, I have seen the results of Japanese and Chinese deconversions. They're actually OK (in the case of Chinese; the Japanese ones are grammatically OK but most of the vocabulary seems to be missing), but the emphasis here is that UNL is a work-in-progress and should not be judged in its present form. It may be at least semi-sucky at teh stage it is in right now, but hopefully it will improve. Also, I've noticed lots of users on this page say stuff about how MT produces sucky results. Well, if you think about the advance of MT technology like the advance of computers, then it makes more sense. If you just look up each word in a dictionary, well, that is the original method of machine translation. When you start to add some grammar, that is what comes next. When you start to add more grammar and even some context-sensitivity, that's even better. Then there come the more advanced things: the "reverso method" (trying to get a translation so that the back translation matches the original as closely as possible), neural networks, UNL, etc... These methods produce much better results than those before them, and those before them produce much better results than those before THEM... ASASF... Node