Small wiki audit/audits/Malagasy Wiktionary

From Meta, a Wikimedia project coordination wiki

Audit of Malagasy Wiktionary


Written by Metaknowledge, with help from Surjection, AryamanA, Erutuon, and Smashhoof, along with input from a fluent speaker of Malagasy who wishes to remain anonymous.


Bot-Jagwar is a bot account run by Jagwar. At mg.wikt, it has made 22,828,226 edits (and counting), catapulting mg.wikt to be the second-biggest Wiktionary, with a total of 6,103,961 entries (and counting). (Note that as bot edits are continuing, all these numbers will be outdated.) Jagwar has a secondary bot account, Bot-Jagwar II, which has only made 6,976 edits. Another major bot contributing to mg.wikt, making the exact same type of edit, is Ikotobaity, with 2,456,748 edits (run by Lohataona until 2017; now inactive). These three bots have created 6,076,769 new mainspace pages (and counting), which is 99.23% of all mainspace pages on mg.wikt. (Jagwar also ran bot edits on his main account, so the true number of bot-created entries is about 50,000 higher.)

In this blog post, he details the history of his bot and mg.wikt. He uses NLP and automated translation in order to generate new entries, without any human intervention or oversight. To quote Jagwar himself: "But as time passes a lot of pages get created, and even with a lot rate of error, you end up with thousands of pages of potentially wrong information." (emphasis not mine) So he knows these entries are wrong, but simply doesn't care.

The reason that no action has been taken at mg.wikt is that Jagwar is the sole admin who has made edits, and there is no active editing community. Jagwar himself has only made 6 edits in the last 90 days, of which only 3 were in mainspace. Even an editing community of the size of the biggest Wiktionary, en.wikt, would not be able to clean up after these bots by hand.

Problems with non-Malagasy entries on mg.wikt[edit]

Of the 4,953,779 (and counting) non-Malagasy entries on mg.wikt, the vast majority were created by these bots based on automatic translation from other Wiktionaries, chiefly en.wikt and fr.wikt. These translations can be wanting in various ways. Some of them have nearly correct definitions, but are missing important lexicographical information that makes the entry as a whole misleading, e.g. mg:wikt:nigger is translated as mainty, which is an adjective that simply means "black" — this is obviously problematic coverage of a highly offensive word. Others are incorrect because only one part of the entry is translated, e.g. mg:wikt:cirugía plástica (Spanish for "plastic surgery") is translated as fandidiana, which just means "surgery". Still others are incorrect because the entry was parsed incorrectly, e.g. mg:wikt:match#Espaniola (Spanish for "match", as in a sporting match) is translated as mahaleo, afokasoka, which is nonsensical — the first word means "to be equal (to), to match" and the second "match [device used to light a fire]". Here the bot was trying to hedge its bets by giving multiple, mutually exclusive interpretations of what English "match" could mean, and yet both are incorrect! Many others are not wildly wrong, but still useless, e.g. mg:wikt:duniani (Swahili locative form meaning "in/on the world") is translated as giloby, which means "globe".

Inflected forms of words in non-Malagasy languages were bot-created for various languages, including Spanish. Many of these are basically correct in their content, but the presentation is misleading at best; at mg:wikt:afilan, two definitions are given, but one points to the suffix -ar rather than the word itself, and the other uses the English word "default" in the definition, inaccurately. However, a significant portion of non-lemma entries seem to be incorrect, due to bizarre bot errors, e.g. mg:wikt:consorcíate, which tries to link to an obviously incorrect entry "sense=affirmative". There are 24,953 entries linking to "formal=n", 17,847 entries linking to "formal=y", 23,337 entries linking to "person=1", and many thousands more with similar errors.

Some entries were not created based on other Wiktionaries, but seemingly based on dictionary entries, causing bizarre errors like mg:wikt:singing traditional sakalava accompany the drum, which is claimed to be a word in French (!).

When an entry on another Wiktionary is deleted, renamed, or corrected, the copy of it made on mg.wikt is never modified, leading to yet another source of error, although likely a much smaller one. For example, the Kinyarwanda section on mg:wikt:bogobogo is incorrect, because it is based on fr:wikt:bogobogo, which was deleted earlier this year, but the bot had already created an entry on mg.wikt.

Relatively few of these entries are marked in any way for the reader to beware. Of the entries so marked, there are 406,725 entries marked as translated from en.wikt, 107,307 entries marked as translated from fr.wikt, and 119,294 entries from other Wiktionaries. The reason for this appears to be that this categorisation for entries needing to be verified, which is accompanied by a template that warns the reader that the entry has been translated, is a recent addition, as older translated entries lack it.

Quantifying the rate of error[edit]

Only a careful inspection can reveal the extent of errors, which is not possible for all the millions of entries on mg.wikt. I assessed a random subsample of 100 pages with at least one non-Malagasy lemma entry. The full list of entries with their assessments, including details on any problems, is at Small wiki audit/Malagasy Wiktionary/100. I found that 49/100 were essentially unusable, as they had serious errors or omissions. A further 29/100 were only partially usable, due to significant omissions that did not rise to the level of being outright errors. Only 22/100 appear to be fully correct and usable, of which 2 are uncertain and included to be generous. Assuming this is a representative subsample, as there is no reason not to do so, this suggests that around half of all non-Malagasy lemma entries are incorrect, and only around a fifth are fully usable (and even many of these have minor errors!). This kind of consistently low quality would be grounds for blocking if done by a human editor on any Wiktionary.

Problems with Malagasy entries on mg.wikt[edit]

There are 41,902 entries categorised as lacking any definition, most of which seem to be Malagasy entries, and around 30,000 of which are the result of the definitions being removed due to copyright violation many years ago. Although there are 1,150,182 Malagasy entries in total, most of these are inflected forms, which can generally be safely created by bots. These definitionless entries are not strictly speaking incorrect, but a definition is the most central function of a dictionary, so these entries fail to be a useful part of the dictionary as a whole.

Additionally, there are 6,319 effectively definitionless Malagasy entries not counted in the table below, like mg:wikt:matoanteny, where the word to be defined is given as the definition, instead of giving an actual definition, or mg:wikt:mahaketrona, where the definition is blank. Some cases, like mg:wikt:tamboho, have two identically duplicated Malagasy sections, each of which simply gives the word to be defined (listed twice) as the definition. This kind of entry is not even categorised as needing a definition, but is equally useless as a dictionary entry, and the duplication of section reflects the bots' inability to follow basic Wiktionary formatting.

The bot-added translation sections in Malagasy entries are also largely incorrect. For example, mg:wikt:ny#Malagasy means "the", but among the translations given are "so, her, him, them" for English, "Internet" for Afrikaans, "Herodianus, sarawakensis, bogotensis, beijingensis, herous, colon, parasceve" for Latin, "orchestra, banana, ataraxia" for Romanian, and many, many more examples of absurd mistranslation on that one entry alone.

A fluent Malagasy speaker was consulted in order to assess the correctness and grammaticality of the Malagasy used in definitions. He concurred with the basic problems identified here, and stated that some Malagasy entries, like mg:wikt:ady fom-pananana, are defined with incomplete sentences. In regard to both Malagasy and non-Malagasy entries, he said that they are "hit or miss on whether the information is useful or not", without assessing the accuracy of the information. In addition to these content issues, some bot-created Malagasy entries, like mg:wikt:navadika, may have correct content but are so misformatted that they are hardly recognisable as Wiktionary entries.

Quantifying the rate of definitionless entries[edit]

There are at least 47,379 definitionless Malagasy entries in total (along with 24,626 definitionless non-Malagasy entries). This total and the table below do not include about 1,423 Malagasy entries of a type shown by mg:wikt:ambaratonga, where the definitions are circular and therefore the dictionary provides synonyms, but the entries themselves are effectively definitionless.

Part of Speech Entries No Definition Notes
Nouns 71,757 8,036
Verbs 31,347 4,989
Phrases 6,637 2,843
Adjectives 4,146 687
Proper nouns 1,187 3 All of these are defined as "name of person", "name of place", etc.
Roots 304 0 All defined as root forms from verbs.
Adverbs 14 0
Infixes 3 0

Recommendations[edit]

So far, no external action has been taken because despite discussions, Jagwar continues to run his bot without consequences. To quote him, "But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before." We need to change this.

I strongly recommend that all non-Malagasy entries created on mg.wikt by Bot-Jagwar, Bot-Jagwar II, Ikotobaity, and Jagwar's bot run under his own account be deleted, and all the translation sections in Malagasy entries be removed. I further strongly recommend that the owners of these bots, Jagwar and Lohataona, be warned not to use them to create more entries at any Wiktionary ever again, or else the bots will be globally blocked.

I weakly recommend that all definitionless Malagasy entries on mg.wikt created by these bots be deleted. This is not actively harming the dictionary in the same way as incorrect content, but it is lowering the signal-to-noise ratio and usefulness of the dictionary.

Further work: Problems at other Wiktionaries[edit]

Jagwar ran his bot at some other language Wiktionaries, in some cases using the same automated translations and producing questionable content that those Wiktionaries have not checked.

  • 218,156 edits at chr.wikt from 2012 to 2014, almost all unedited by humans. These populate the category chr:wikt:Category:Entry to be checked, which currently contains 185,434 entries. There are no active editors at chr.wikt.
  • 127,389 edits at ku.wikt from 2012 to 2013, almost all unedited by humans. The Malagasy entries here include a large number of verbs that are simply defined as the present tense of that very verb, thus lacking any actual definition, e.g. ku:wikt:mivoendre. Although not incorrect, these are essentially undefined entries.

Edits on other Wiktionaries were primarily adding Malagasy lemmas (at fr.wikt and at en.wikt, where he used his main account to run bot edits) or adding interwiki links, so no major harm seems to have been done elsewhere. However, editors at those Wiktionaries should still be advised to look over his edits, as they still contain frequent errors in definitions, part of speech assignment, and more.

Previous discussions[edit]

See: