Abstract Wikipedia/Updates/2021-09-17

Abstract Wikipedia Updates

Lexemes and paradigms

Last week we discussed how to implement paradigms in Wikifunctions. This week, let’s discuss a few ideas on how this could be used.

One may ask why this is useful, given that we are collecting all the different forms in the lexicographic data in Wikidata anyway. We don’t need to generate the forms if we have a full set of forms in Wikidata, surely?

There are several possible use cases:

First, we will probably never achieve a full coverage in Wikidata of all forms in all languages. In some languages, the number of forms may be prohibitively high, and we, like every other dictionary, might need to make a selection of forms to store. Often the forms not stored are highly regular.

Second, even if we have really good coverage, occasionally you will need to introduce words that are not in the dictionary: when displaying neologisms, when generating a new lexeme by conversion from another grammatical category (for example: verbing nouns in English, or using place names to make demonyms), or when using loanwords from other languages. Fortunately, such words are often regular, and having smart paradigms as described last time can take us pretty far.

Third, the paradigms can be used in Wikidata to connect to the actual lexemes. For example, on a lexeme such as "cat" we could link to the paradigm that we developed last week, either the add s function or the English regular plural function. Linking the lexeme with the function allows individual forms to be re-generated, which in turn means they can be checked for correctness, thus ensuring data quality. The English regular plural function can tell us that the plural for "pasty" should be "pasties", but that Wikidata lexeme previously defined it as "pastiest". The plural of "strawman" should be "strawmen", not "strawmans"; the plural for "Frenchwoman" should be "Frenchwomen" not "Frenchwoman".

One question is: if we have a paradigm that can create the forms, why even create and store the forms in Wikidata in the first place? That’s a great question, and a decision that can indeed be revisited by the community. Personally, I think we need both forms stored explicitly in Wikidata and generative paradigms. Without the former, it's not clear how we would handle irregular forms — would the onus lie on the paradigms? That seems messy. Likewise, paradigms are crucial when, for example, a Lexeme has thousands of possible forms. If these forms are always regular, the community might decide not to materialize them all — especially if many Lexemes cleave to the same regular morphological pattern.

This seems also to be the case for English nouns: almost all of the English nouns in Wikidata have two forms, even though one could argue that English nouns have four forms (including the possessive forms); however, the English possessive forms seem to be generated so regularly that, so far, Wikidata contributors seem to consider them unnecessary and usually omit them.

Fourth, the paradigms can also be used to propose a starting point when entering the data. Imagine the Wikidata Lexeme Forms allowing you to select a function on Wikifunctions that, given the lemma, generates all likely forms for an entry. The Lexeme Forms tool has already improved the creation of Lexemes considerably, making the entries much more consistent and expansive. If, in addition, we could also automatically generate most of the forms, this would increase the speed of entering the data by a lot - and at the same time reduce the likelihood of data entry errors.

Besides all these immediate improvements, there might be many further advantages. For example, storing an offline dictionary would require much less storage space if we use paradigms. Developing paradigms for currently under-resourced languages might create aids for working with those languages. Having a knowledge base of paradigms across languages may be interesting from the perspective of linguistic research.

Once Wikifunctions has launched, we hope that the community will develop a library of morphological paradigms and their connection with the lexicographical data in Wikidata. Besides this being a very helpful step on our path to Abstract Wikipedia, we think that this will considerably expand the content of the lexicographical data in Wikidata. That — together with enabling access to the lexicographic data from within the Wiktionaries — will help with significantly empowering the contributors to Wiktionary, particularly to the smaller Wiktionaries and to the languages with fewer contributors in all Wiktionaries.

Thanks to User:YULdigitalpreservation, who created EntitySchema E327 on Wikidata for English Nouns with Genitives, and to User:VIGNERON for creating French plural morphology on NotWikiLambda, and User:Strobilomyces for collaborating on that.