Requests for comment/Cross-wiki management of Wiktionary headwords using a Wikidata-like approach

From Meta, a Wikimedia project coordination wiki

This is a subpage; for more information, see the Requests for comments page.


Proposal[edit]

I propose an item-based approach to manage entries in multiple Wiktionaries. This can be done by assigning unique identifiers to Wiktionary headwords whereby each headword in each language is assigned a unique identifier, eg. "D1" for English "aardvark" ... "D1000" for Spanish "verde", "D1001" for Portuguese "verde", "D1002" for Italian "verde" ... "D7575757" for Zulu "zungeza" (see wikt:Wiktionary:Statistics for an estimation of how many headwords currently exist on the English Wiktionary).

See also phabricator:T987 and related objects of that task that have an open status.

Why use such an approach?[edit]

Although mw:Extension:Cognate already exists to handle cross-wiki links, Cognate only treats words based on their page title and is unable to provide more specific information on entries, e.g. which Wiktionary has an entry for a word in a particular language, . This is because Wiktionary places headwords from different languages that have the same spelling under the same wiki page. For example, the word "agua" has entries in 14 languages on the English Wiktionary. However, if someone wants to know which Wiktionary has an entry for Swahili "agua", they would need to visit all the "agua" pages in different Wiktionaries to find the Swahili entry.

By having a centralized place to check whether a language-specific entry exists on other Wiktionaries, editors can work on expanding content in other Wiktionaries. For example, some Wiktionaries may only have Spanish "agua" but not Swahili "agua". Perhaps it is even possible to load the entry for Swahili "agua" from different Wiktionaries on the same page so editors can easily copy citations from one Wiktionary to another.

Examples of how this can be useful[edit]

Imagine if someone had accidentally created an English entry for "seperating" on en.wiktionary.org and ten other xx.wiktionary.org have also duplicated the same entry without realizing the mistake. What would happen after "seperating" has been renamed to its correct spelling "separating" on en.wiktionary.org? The other Wiktionaries would still contain the error. However, if the erratic English word "seperating" has a unique identifier D666666, editors from one wiki can ping admins from other wikis to deal with the erratic entry.

By synchronizing the existence of language-specific entries across multiple Wiktionaries, it would be easier to detect instances of fake words, vandalism and incidents of cross-wiki abuse. Other enhancements include making these possible: phab:T13996 phab:T13998 phab:T14213 phab:T38881. If users can select only the languages they are interested in, pages would load faster because only specific languages are requested. This would involve transclusion of entries using the unique identifier and individual wikis can choose whether they want to do it or not.

It is hard to imagine words that only exist on one Wiktionary but not another. Some of these words may be genuine mistakes, while some others may be caused by inconsistencies in the choice of Unicode characters. If an item-based approach is available, words using different spelling conventions or different Unicode spellings can be marked as similar items. This would make it easier to locate and view cross-wiki entries that have different conventions in spelling words, e.g. usage of combining forms.

Comparison with Lexeme namespace on Wikidata[edit]

This proposal is about cross-wiki management for existing words on Wiktionary and is not related with the Lexeme namespace on Wikidata which is a separate project that involves the representation of dictionary entries (lexical items) using structured data. The suggestion here is for an automated bot to record all entries available on all Wiktionaries, sort them by their language headers, and assign one unique identifier for each headword in each language. For instance, English "car", English "cars", French "car", French "cars" would all get separate identifiers.

Since the Wikidata infrastructure already exists, I suggest using a separate Wikidata namespace for the cross-wiki management of Wiktionary headwords. KevinUp (talk) 06:20, 20 June 2020 (UTC)[reply]

Discussion[edit]

Two strategy were framed about related issues:
A wikibase could be an interesting option to have a better interwiki transmission of templates and inflections frames. For definitions of lexemes, it is more complex and heavy to rely purely on a strict structure of data. For the purpose exposed here, that imply a huge database with few relations in one side (similar as Cognate but heavier) and a large modification of pages in Wiktionaries on the other side. I think the last part make this project too complicated for the benefits planed. -- Noé (talk) 09:12, 23 June 2020 (UTC)[reply]
The large-scale modification of pages in Wiktionaries can be made optional. The idea is to use a bot to document all headwords that exist in different Wiktionaries and store them in a central wikibase so that editors from different Wiktionaries can go to the wikibase to discover new words that can be added to their own Wiktionaries or to resolve ambiguities that exist in different Wiktionaries. KevinUp (talk) 11:06, 23 June 2020 (UTC)[reply]

Could we get most of the benefit by adding a property to Wikidata that states on a Lexeme "represented in Wiktionary" and takes a URL pointing to the right section? E.g. imagine such a new property, and on d:Lexeme:L3760 it would link to https://de.wiktionary.org/wiki/land#Substantiv_2 , https://en.wiktionary.org/wiki/land#Noun , https://fr.wiktionary.org/wiki/land#Nom_commun_2 , https://es.wiktionary.org/wiki/land#Sustantivo , etc.? --denny (talk) 16:08, 23 June 2020 (UTC)[reply]

Such a property can be created on Wikidata. However, the concept of a "lexeme" and a "headword" is not the same. A lexeme such as "swim" would also represent the past tense "swam", the present infinitive form "swimming", so multiple headwords would be placed under the same lexeme. I don't think this is suitable for languages such as Pali that are written in multiple scripts. Having multiple verb forms written in multiple scripts linked to multiple Wiktionaries all under the same lexeme page would lead to a very cluttered appearance. There is also the issue of Wikidata lexemes lagging behind the number of entries that are already available on Wiktionary. Currently, it is not possible to automatically import Wiktionary entries as Wikidata lexemes, and it would take years before every Wiktionary entry is available as a Wikidata lexeme. KevinUp (talk) 17:20, 23 June 2020 (UTC)[reply]
@KevinUp why do you claim that it is not possible? How about projects like Lexicator by @Yurik (https://github.com/nyurik/lexicator)? Gower (talk) 16:39, 10 November 2022 (UTC)[reply]
I made a similar suggestion here -- Noé (talk) 17:35, 23 June 2020 (UTC)[reply]
This approach would solve many issues (T150841) is another example), so I support it. Pamputt (talk) 05:45, 24 June 2020 (UTC)[reply]
Support Support It would solve this one too: T165061. Taylor 49 (talk) 17:41, 21 May 2021 (UTC)[reply]
Support Support as I have run into similar problems during my (occasional) inter-wiktionary manual standardisation fixes myself. Zezen (talk) 02:53, 29 November 2021 (UTC)[reply]
Support Support. Let us stop excessively duplicating the same things across different Wiktionaries.--Jusjih (talk) 01:24, 3 October 2022 (UTC)[reply]