One of the important copyedit tasks is to correct misspellings in Wikipedia text. Previous exploratory analysis suggests that using lists of common misspellings could be an effective approach to this problem at scale (e.g. across many languages). However, one of the main challenges is how to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example English or German). But for the vast majority of languages, no such lists are readability available.
In this project, we use Wiktionary to automatically extract misspellings in different languages. Specifically, we take advantage of the misspelling_of template which provides structured information about words that are misspelling, indicating the language as well as the correct spelling. Several other language Wiktionaries contain this template Q50368067. The main idea is to parse wiktionary to collect the misspellings for all languages and use this list to detect misspellings in the text of Wikipedia articles.
Each page in Wiktionary defines a word (e.g accessive). If the word defined by a page is a misspelling, the page identifies it using the misspelling_of template. The template contains the correct word and the language of the word. Note that a word can be both a correct word and a misspelling for another word. Also note that English Wiktionary is not restricted to English words, and a English word can be a misspelling for a non-English word as well (as identified by the language parameter in the template). Same applies to other language Wiktionaries.
misspelling_of template is available in 16 languages, Q50368067 lists most of them. The issue is that the name of the template could be different in different languages, e.g bn - টেমপ্লেট:এর ভুল বানান, ca- Plantilla:forma-inc, en - Template:misspelling of. We collect:
- Templates from Q50368067
- Templates with the name "Template:misspelling of" (even in other languages)
- Templates that are redirects of "Template:misspelling of" (e.g Template:misspelling of redirects to টেমপ্লেট:এর ভুল বানান in Bengali)
Although we only work with misspelling_of template in this work, there are several other related templates. See this Issue for a non-comprehensive list. Besides, there could be more templates in other languages as well.
Now that we have the list of templates in various Wiktionaries, we can parse the Wiktionary pages to extract the misspellings.
- We consider all variations of the misspelling of template (e.g missp, misspell) in question using the redirect links.
- We collect all the Wiktionary pages that contain these templates.
- We use mwparserfromhell to parse the collected pages. Level-2 headings are typically the name of a language. We consider each L2 heading separately since a word could be misspelling in one language but not in another. Within each section:
- Numbered items (starting with #) typically contain definitions and misspelling templates. If a word contains definitions, we can assume it is a correct spelling for some word even if it is a misspelling for another (e.g inducably).
- So we collect the number of list items, and the number of misspelling_of templates in those list items. If number of list items == number of templates, we can assume the word is a ‘proper’ misspelling, meaning it is not a correct form for any existing word.
- We merge the misspellings collected from all languages and remove duplicates.
A list of all the collected misspellings can be found in here: Misspellings parsed from all Wiktionaries. This includes the wiktionary it was collected from (dbname), the misspellings (page_id, page_title), language (ln), and the correct word (word).
Detecting misspellings in Wikipedia
So far, we have collected a list of misspellings from Wiktionary. Now we use this list to detect misspellings in Wikipedia. Each collected misspelling (irrespective of the language of the Wiktionary) has a language and the associated correct word. We use this language to detect misspellings in the corresponding language Wikipedia.
- We parse Wikipedia pages using mwparserfromhell.
- For now we tokenize with space delimiter, this will not work well with languages that do not use space as word delimiters.
- We look for misspelled words using the list collected using Wiktionary.
- We also indicate where the misspelling occurred in the text and whether the misspelled word is a different language altogether. The most frequent tags include:
Misspellings detected from all Language Wikipedia can be found here: List of detected misspellings in all language Wikipedias.
Having collected misspellings from all available languages and template variations, we find that most misspellings are listed in enwiktionary. The following table lists the of number of unique "misspelling - correct word" pair in available languages. For non-English Wiktionary, it also lists misspellings not in enwiktionary.
|Wiktionary||Misspellings Count||Not in enwiktionary|
We analyze the usefulness of the collected list by comparing it with some existing misspelling lists. The misspelling list collected from enwiktionary detects enough new misspellings. For English, in the worst case, 87% of the words in our list are undetected by any other means (80% for French). So, the wiktionary misspellings can be a good complement to the existing systems and help detect more misspellings.
Comparison to Lists_of_common_misspellings:
- Ratio of misspellings in community curated list covered in our list: 7.36%
- Ratio of misspellings in our list that is from community curated list: 5.98% (94% of our misspellings are undetected)
English AutoWikiBrowser comparison:
- Ratio of misspellings in AWB list covered in our list: 15.92%
- Ratio of misspellings in our list that is from AWB list 12.38% (87% of our misspellings are undetected)
French AutoWikiBrowser comparison:
- Ratio of misspellings in AWB list covered in our list: 1.35%
- Ratio of misspellings in our list that is from AWB list: 20.78% (80% of our misspellings are undetected)
Detecting misspellings in Wikipedia
Using the list of common misspellings collected using Wiktionary, we surface misspellings in Wikipedia. The following table represents a few detected words from enwiktionary. The table contains detected misspelling, suggested correct word, which page it occurs in, as well as some text surrounding. We also try to detect if the detected misspelling is actually a different language and provide the language detected by the model and the confidence score.
|3523667||Charly (song)||external links||Youtube||['YouTube']||['is_externallink', 'is_list', 'is_first_cap']||7||Charly video on Youtube|
|51123936||Holy Key||background||youre||["you're"]||['is_text_formatting', 'is_quote']||5||Whilst in the process of recording "Holy Key", Khaled told Sean; " Yo. This record? I don’t want no rules, no regulations, I don’t want no regular song structure. I want you to just go bad. Catch the Holy Ghost and just… This is your chance to spit them bars. Even though you spit them all the time. Big this is a special one we gonna do it on. " Khaled approached Kendrick Lamar about doing a feature on the song at a basketball game in LA they were both attending. Lamar responded to the idea of a possible collaboration; " No doubt, Khaled. I love what youre doing right now. I love how you're inspiring the world and the kids with your music. Send it through. " Khaled was hesitant to send the song immediately as he "wanted to make sure it was the right record", and with Lamar's schedule being busy, Khaled feared the record not being done on time. Lamar's verse was sent two days after Khaled spoke to him. In the days leading up to the release of "Holy Key", DJ Khaled went on SnapChat and teased by yelling " Ay Juan, did the Kendrick vocals get here yet? " and that Kendrick would have the most talked about verse on his album|
|859594||1992 NHL Entry Draft||round nine||Malmo||['Malmö']||['is_table', 'is_first_cap']||5||199 Jonas Hakansson Right Wing Philadelphia Flyers Malmo IF (Sweden) 200 Daniel Paradis Centre|
|18737478||1964 Gabonese coup d'état||notes||fait||['fate']||['is_text', 'is_list', 'is_quote', 'is_different_language']||4||0.9799692||[a] "Tout Gabonais a deux patries : la France et le Gabon." [b] "Se voulant et se croyant sincèrement démocrate, au point qu'aucune accusation ne l'irrite davantage que celle d'être un dictateur, il n'en a pas moins eu de cesse qu'il n'ait fait voter une constitution lui accordant pratiquement tous les pouvoirs et réduisant le parlement au rôle d'un décor coûteux que l'on escamote même en cas de besoin." [c] "Le jour J est arrivé, les injustices ont dépassé la mesure, ce peuple est patient, mais sa patience a des limites... il est arrivé à bout."||fr||"Tout Gabonais a deux patries : la France et le Gabon." [b] "Se voulant et se croyant sincèrement démocrate, au point qu'aucune accusation ne l'irrite davantage que celle d'être un dictateur, il n'en a pas moins eu de cesse qu'il n'ait fait voter une constitution lui accordant pratiquement tous les pouvoirs et réduisant le parlement au rôle d'un décor coûteux que l'on escamote même en cas de besoin." [c] "Le jour J est arrivé, les injustices ont dépassé la mesure, ce peuple est patient, mais sa patience a des limites... il est arrivé à bout."|
|25458415||Awaé||references||primature||['premature']||['is_externallink', 'is_list', 'is_different_language']||9||0.91887516||Site de la primature – Élections municipales 2002 Contrôle de gestion et performance des services publics communaux des villes camerounaises - Thèse de Donation Avele, Université Montesquieu Bordeaux IV Charles Nanga, La réforme de l’administration territoriale au Cameroun à la lumière de la loi constitutionnelle n° 96/06 du 18 janvier 1996 , Mémoire ENA.||fr||Site de la primature – Élections municipales 2002|
Editors can peruse the provided list to find misspellings they would like to correct in Wikipedia.
formatting can be used to further filter the list. For example, we can avoid correcting misspellings that occur in quotes, part of tables, or lists etc. The
surrounding_text can be read to view the word in context. When sure, editors can use the
page_title to navigate to the Wikipedia page and correct the word.
The following image shows the frequency of misspellings in English Wikipedia. The image shows words with occurrence > 10 for visibility.
Sometimes some misspellings are not meant to be corrected. For example, if the word is part of a paper/book title, or a poem or any other quoted text. If the misspelled word has some text formatting, such as bold, italicized, underlined, etc, we can assume the user has put a bit more attention to that word through the process of formatting and would have detected the misspelling. If it remained as a misspelling we would like to assume it was supposed to remain that way. Of course none of these are concrete rules, but simple heuristics to ensure that when we flag a misspelling, it really is a misspelling in the true sense of the word and is worth correcting. Following this idea, we ignore detected misspellings that had the following tags: "is_list", "is_table", "is_quote", "is_text_formatting", "is_different_language".
The figure above shows the total number of misspellings and the unique number of misspellings detected. enwiki and simplewiki (simplified english wikipedia) contain the most unique detected misspellings, which possibly comes from the fact that English has the most collected misspellings. Several other language Wikipedias also have misspellings as detected with our tool, even though the number of misspellings in our collected list is not too large.
In a given language Wikipedia, very few misspelled words might occur a lot while rest of the misspellings might not occur as much (long tail distribution). To understand the distribution of the occurrence of misspelling we calculate the gini coefficient – A Gini coefficient of 0 reflects perfect equality, where all occurrence is the same for all words, while a Gini coefficient of 1 reflects maximal inequality among values, i.e only 1-2 words account for all of the detected misspelling.
From the figure above, we find that in almost all languages, the gini coefficient is very high, indicating that few misspellings account for most detected misspellings. After introducing the filtering as discussed above, the gini coefficient reduces for some wikis, i.e, we are now detecting more variety of misspellings.
This CSV file contains stats for all Wikipedias including the number of misspellings detected, the number of unique misspellings detected, gini coefficients etc.
- Wiktionary can be used as a good starting point to start listing common misspellings in various languages.
- Currently most detected misspellings come from English Wiktionary. As the Wiktionary in other languages develop, we can expect better outcome from other languages as well.
- By comparing the wiktionary based misspellings list with existing lists in English and French, we see that wiktionary contains a lot of misspellings not covered by the existing lists. This demonstrates the usefulness of our approach.
- Finally, using our collected list, we surface misspellings in all language Wikipedias. Currently the detected misspellings are in CSV files. This can be used by editors to find misspellings to correct (looking at the context, language, and text formatting).
- Repository: gitlab:repos/research/copyedit-common-misspellings
- List of all misspellings collected from Wiktionary: gitlab:repos/research/copyedit-common-misspellings/-/blob/main/resources/all_wiki_common_misspellings.tsv
- Misspellings detected in various language Wikipedias: gitlab:repos/research/copyedit-common-misspellings/-/tree/main/outputs
- Parser Used: mwparserfromhell