Abstract Wikipedia/Updates/2021-09-03

Abstract Wikipedia Updates

Generating text with Ninai and Udiron

The update this week has been written by Mahir Morshed. Mahir is a long time contributor to Wikidata and particularly also to the lexicographical data on Wikidata. He has developed a prototype that generates natural language from an abstract content representation in Bengali and Swedish, a prototype with the goal that this could be implementable within Wikifunctions. In this newsletter, Mahir describes the prototype.

Discussion around Abstract Wikipedia's natural language generation capabilities has revolved around the presence of abstract constructors and concrete renderers per language, while also noting the use of Wikidata items and lexemes as a basis for mapping concepts to language. In the interest of making this connection a bit clearer to imagine, I have started to build a text generation system. This uses items, lexemes, and wrappers for them as building blocks, and these blocks are then assembled into syntactic trees, based in part on the Universal Dependencies syntactic annotation scheme.

(If this seems like a different approach from what was discussed in a newsletter two months prior, that's because it is. Feel free to drop me a message if you'd like to discuss it.)

The system is composed of three parts, where the last is likely to be something we could skip in a port to Wikifunctions:

Ninai (from the Classical Tamil for "to think") holds all constructors, logic at a sufficiently high level for renderers, and a resolution system from items (each wrapped in a "Concept" object) to sense IDs for a given language. Decisions and actions in Ninai are meant to be agnostic to the methods for text formation underneath, which are supplied by...
Udiron (from the Bengali pronunciation of the Sanskrit for "communicating, saying, speaking"), which holds lower-level text manipulation functions for specific languages. These functions operate on syntactic trees of lexemes (each lexeme wrapped in a "Clause" object). These lexemes are imported via...
tfsl (from "twofivesixlex"), a lexeme manipulation tool, which is intended to be akin to pywikibot but with a specific focus on the handling of Wikibase objects. Both of the above components depend on this one, although if 'native' item and lexeme access and manipulation becomes possible with Wikifunctions built-ins then tfsl could possibly be omitted.

Some design choices in this system worth noting are as follows:

Constructors, while being language-agnostic and falling within some portion of a class hierarchy, are purely containers for their arguments, carrying no other logic within. This means, for example, that an instance of a constructor Existence(subject), to indicate that the subject in question exists, only holds that subject within that instance, and does nothing else until a renderer encounters that constructor.
Every constructor allows, in addition to any required inputs, a list of extra modifiers in any order (the 'scope' of the idea represented by that constructor). This means, for example, that a constructor Benefaction(benefactor, beneficiary) might be invoked with extra arguments for the time, place, mode, and other specifiers after the beneficiary.
When one 'renders' a composition of constructors, a Clause object (representing the root of a syntactic tree) is returned; turning it into a string of text is done with Python's str() built-in applied to that object.

At the moment, there are just enough constructors to represent Sentence 1.1 from the Jupiter examples, as well as renderers in Bengali and Swedish for those constructors (thanks to Bodhisattwa, Jan, and Dennis for feedback on those). Building up to the Jupiter sentence should demonstrate how these work:

Building up to the Jupiter sentence step by step
Constructor text	Bengali output	Swedish output	Gloss (not renderer output!)	Notes
Identification( Concept(Q(319)), Concept(Q(634)))	বৃহস্পতি গ্রহ।	Jupiter är klot.	Jupiter is planet.	We start by simply identifying the two concepts of Jupiter (Q319) and planet (Q634) as being equal.
Identification( Concept(Q(319)), Instance( Concept(Q(634))))	বৃহস্পতি একটা গ্রহ।	Jupiter är ett klot.	Jupiter is a planet.	Instead of equating the concepts alone, we might instead equate "Jupiter" with an instance of "planet".
Identification( Concept(Q(319)), Instance( Concept(Q(634)), Definite()))	বৃহস্পতি গ্রহটি।	Jupiter är klotet.	Jupiter is the planet.	We may further refine that by making clear that "Jupiter" is a definite instance of "planet".
Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Concept(Q(59863338))), Definite()))	বৃহস্পতি বড় গ্রহটা।	Jupiter är det stora klotet.	Jupiter is the large planet.	Now we might ascribe an attribute to the definite planet instance in question, this attribute being large (Q59863338).
Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Superlative( Concept(Q(59863338)))), Definite()))	বৃহস্পতি সবচেয়ে বড় গ্রহটি।	Jupiter är det största klotet.	Jupiter is the largest planet.	This attribute being superlative for Jupiter can be marked by modifying the attribute.
Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Superlative( Concept(Q(59863338)), Locative( Concept(Q(544))))), Definite()))	বৃহস্পতি সৌরমণ্ডলে সবচেয়ে বড় গ্রহ।	Jupiter är den största planeten i solsystemet.	Jupiter is the largest planet in the solar system.	Once we specify the location where Jupiter being the largest applies (that is, in the Solar System (Q544)), we're done!

Note that the sense resolution system does not have enough information to choose which of '-টা' or '-টি' (for Bengali) or of 'klot' or 'planet' (for Swedish) to use in some of these examples, so currently in the prototype one is chosen at random. This therefore means that re-rendering any examples which pull those in might use something different.

Besides this, there is clearly a lot more functionality to be added, and because Bengali and Swedish are both Indo-European languages (however distant), there are likely linguistic phenomena that won't be considered simply by developing renderers for those two languages alone. If there's something particular in your language that isn't present in those two languages, this may then raise the question: what can you do for your language?

I can think of at least four things, not in any particular order:

Create lexemes and add senses to them! What matters most to the system is that words have meanings (possibly in some context, and possibly with equivalents in other languages or to Wikidata items) so that those words can be properly retrieved based on those equivalences; that these words might have a second-person plural negative past conditional form is largely secondary!
Think about how you might perform some basic grammatical tasks in your language: how do you inflect adjectives? add objects to verbs? indicate in a sentence where something happened?
Think about how you might perform higher-level tasks involving meaning: what do you do to indicate that something exists? to indicate that something happened in the past but is no longer the case? to change a simple declarative sentence into a question?
If you have some ideas on how to render the Jupiter sentence in your language, and the lexemes you would need to build that sentence exist on Wikidata, and those lexemes have senses for the meanings those lexemes take in that sentence, let me know!

We'd love to hear your thoughts on this prototype, and what it might mean for realizing Abstract Wikipedia through Wikidata's lexicographic data and Wikifunctions's platform.

Thank you Mahir for the great update! If you too want to contribute to the weekly, get in touch. This is a project we all build together.

In addition, this week Slate published a great explaining article on the goals of Abstract Wikipedia and Wikifunctions: Wikipedia Is Trying to Transcend the Limits of Human Language