Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Wiktionary
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Wiktionary data in natural language processing== Wiktionary has [[semi-structured data]].{{sfn|Meyer|Gurevych|2012|p=140}} Wiktionary [[Lexicography|lexicographic]] data can be converted to [[Machine-readable data|machine-readable format]] in order to be used in [[natural language processing]] tasks.{{sfn|Zesch|Müller|Gurevych|2008|p=4|loc=Figure 1}}{{sfn|Meyer|Gurevych|2010|p=40}}{{sfn|Krizhanovsky, Transformation|2010|p=1}} Wiktionary's [[data mining]] is a complex task. There are the following difficulties:{{sfn|Hellmann|Auer|2013|p=302|loc=p. 16 in PDF}} * (1) the constant and frequent changes to data and schemata * (2) the heterogeneity in Wiktionary language edition schemata{{efn|E.g. compare the entry structure and formatting rules in [[wikt:Wiktionary:Entry layout explained|English Wiktionary]] and [[wikt:ru:Викисловарь:Правила оформления статей|Russian Wiktionary]].}} and * (3) the human-centric nature of a [[wiki]]. There are several [[Parsing|parsers]] for different Wiktionary language editions:{{sfn|Hellmann|Brekle|Auer|2012|p=3|loc=Table 1}} * DBpedia Wiktionary :<ref>{{Cite web|url=http://dbpedia.org/Wiktionary|archiveurl=https://web.archive.org/web/20130504235547/http://dbpedia.org/Wiktionary|url-status=dead|title=DBpedia Wiktionary|archivedate=May 4, 2013}}</ref> a subproject of [[DBpedia]], the data are extracted from English, French, German, and Russian Wiktionaries; the data includes language, [[Part of speech|parts of speech]], definitions, [[Semantic relationship|semantic relations]] and translations. The declarative description of the [[Page schematic|page schema]],{{sfn|Hellmann|Brekle|Auer|2012|pp=8–9}} [[regular expression]]s{{sfn|Hellmann|Brekle|Auer|2012|p=10}} and [[finite state transducer]]{{sfn|Hellmann|Brekle|Auer|2012|p=11}} are used in order to extract information. * JWKTL ([[Java (programming language)|Java]] Wiktionary Library) :<ref>{{Cite web|url=https://dkpro.github.io/dkpro-jwktl/|title=Welcome|website=DKPro JWKTL|access-date=June 23, 2019|archive-date=January 23, 2021|archive-url=https://web.archive.org/web/20210123133521/https://dkpro.github.io/dkpro-jwktl/|url-status=live}}</ref> provides access to English Wiktionary and German Wiktionary dumps via a Java [[Ubiquitous Knowledge Processing Lab#Wiktionary API|Wiktionary API]].{{sfn|Zesch|Müller|Gurevych|2008}} The data includes language, parts of speech, definitions, quotations, semantic relations, etymologies and translations. JWKTL is distributed under the [[Apache License]]. * wikokit :<ref>{{Cite web|url=https://github.com/componavt/wikokit|title=Wikokit – Machine-readable Wiktionary|date=December 19, 2022|via=GitHub|access-date=November 7, 2015|archive-date=October 2, 2020|archive-url=https://web.archive.org/web/20201002225056/https://github.com/componavt/wikokit|url-status=live}}</ref> the [[parser]] of English Wiktionary and Russian Wiktionary.{{sfn|Krizhanovsky, Transformation|2010}} The parsed data includes language, parts of speech, definitions, quotations,{{sfn|Smirnov et al.|2012}}{{efn|Quotations are extracted only from Russian Wiktionary.{{sfn|Smirnov et al.|2012}}}} semantic relations{{sfn|Krizhanovsky, Comparison|2010}} and translations. This is a [[Multi-licensing#License compatibility|multi-licensed]] [[Open source|open-source]] software. * [[Etymology|Etymological]] entries have been parsed in the Etymological [[WordNet]] project.<ref>{{Cite web|url=http://gerard.demelo.org/berkeley/|title=Gerard de Melo's Research at ICSI, Berkeley|website=gerard.demelo.org|access-date=March 6, 2023|archive-date=March 27, 2023|archive-url=https://web.archive.org/web/20230327013529/http://gerard.demelo.org/berkeley/|url-status=live}}</ref> Examples of [[natural language processing]] tasks which have been solved with the help of Wiktionary data include: * [[Rule-based machine translation]] between [[Dutch language]] and [[Afrikaans]]; data of English Wiktionary, Dutch Wiktionary and Wikipedia were used with the [[Apertium]] [[machine translation]] platform.{{sfn|Otte|Tyers|2011}} * Construction of [[machine-readable dictionary]] by the parser NULEX, which integrates open linguistic resources: English Wiktionary, [[WordNet]], and [[VerbNet]].{{sfn|McFate|Forbus|2011}} The parser NULEX [[Web scraping|scrapes]] English Wiktionary for tense information (verbs), plural form and parts of speech (nouns). * [[Speech recognition]] and [[Speech synthesis|synthesis]], where Wiktionary was used to automatically create pronunciation dictionaries.{{sfn|Schlippe|Ochs|Schultz|2012}} Word-pronunciation pairs were retrieved from 6 Wiktionary language editions ([[Czech language|Czech]], English, French, [[Spanish language|Spanish]], Polish, and German). Pronunciations are in terms of the [[International Phonetic Alphabet]].{{efn|If there are several IPA notations on a Wiktionary page – either for different languages or for pronunciation variants, then the first pronunciation was extracted.{{sfn|Schlippe|Ochs|Schultz|2012|p=4802}}}} The [[Speech recognition|ASR]] system based on English Wiktionary has the highest word error rate, where each third [[phoneme]] has to be changed.{{sfn|Schlippe|Ochs|Schultz|2012|p=4804}} * [[Ontology engineering]]{{sfn|Meyer|Gurevych|2012}} and [[semantic network]] constructing.<ref>{{Cite web |title=ConceptNet 5 |url=http://conceptnet5.media.mit.edu/ |url-status=dead |archive-url=https://web.archive.org/web/20111019152920/http://conceptnet5.media.mit.edu/ |archive-date=2011-10-19 |access-date=2023-09-23 |website=conceptnet5.media.mit.edu}}</ref> * [[Ontology alignment|Ontology matching]].{{sfn|Lin|Krizhanovsky|2011}} * [[Text simplification]]. Medero & [[Mari Ostendorf|Ostendorf]]{{sfn|Medero|Ostendorf|2009}} assessed vocabulary difficulty ([[Readability|reading level]] detection) with the help of Wiktionary data. Properties of words extracted from Wiktionary entries (definition length and [[Part of speech|POS]], sense, and translation counts) were investigated. Medero & Ostendorf expected that ** (1) very common words will be more likely to have multiple parts of speech, ** (2) common words will be more likely to have multiple senses, ** (3) common words will be more likely to have been translated into multiple languages. These features extracted from Wiktionary entries were useful in distinguishing word types that appear in [[Simple English Wikipedia]] articles from words that only appear in the Standard English comparable articles. * [[Part-of-speech tagging]]. Li et al. (2012){{sfn|Li|Graça|Taskar|2012}} built multilingual POS-taggers for eight resource-poor languages on the basis of English Wiktionary and [[Part-of-speech tagging#Use of hidden Markov models|hidden Markov models]].{{efn|The source code and the results of POS-tagging are available at https://code.google.com/p/wikily-supervised-pos-tagger}} * [[Sentiment analysis]].{{sfn|Chesley|Vincent|Xu|Srihari|2006}} "[[Wikidata]]:Lexicographical data" was started in 2018 to provide structured data support to Wiktionaries. It stores word data of all languages in a machine readable data model, under a dedicated "[[Lexeme]]" namespace in Wikidata. As of October 2021, the project has amassed over 600,000 lexeme entries of various languages.<ref>{{cite web|url=https://www.wikidata.org/w/index.php?title=Wikidata:Wiktionary&oldid=1510363143|title=Wikidata:Wiktionary|access-date=12 October 2012|archive-date=January 3, 2023|archive-url=https://web.archive.org/web/20230103132433/https://www.wikidata.org/w/index.php?title=Wikidata:Wiktionary&oldid=1510363143|url-status=live}}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Wiktionary
(section)
Add topic