Abstract Wikipedia/Updates/2022-08-19

Abstract Wikipedia Updates

Template language draft and Wikimania 2022

Our Google.org fellow, Ariel Gutman, together with Prof. Maria Keet, who is devoting part of her sabbatical year to work with the Abstract Wikipedia's Natural Language Generation workstream, have recently authored a detailed specification of a template language. This aims to allow Wikifunctions contributors to easily create renderers of abstract content. For instance, entity Q7259 has property P106 pointing to Q5482740 asserted in Wikidata, and with all the machinery in place, it may render as, e.g., “Ada Lovelace was a programmer.” The template language seeks to assist with specifying the structure for generating sentences so that the structured content will be displayed as text in a natural language of one’s choice.

You may recall from the architecture proposal that every constructor (which typically aims to capture the meaning of a single phrase or sentence structure) will be matched with a specific template to render that constructor as text. The templates will reside in Wikifunctions, and will be parsed into Composition syntax, so that it can act as a Renderer. An initial version of this parser has already been implemented as part of the Wikifunctions CLI tool, which you can toy around with.

What do these templates look like? A template is a combination of text and slots, where slots can refer to other templates or functions from Wikifunctions, allowing for dynamic content. The specification of grammatical constraints is done through dependency relations (using, for instance, the UD formalism for grammar annotations) specified as labels within the slots. As for the text, it may represent static text, which will be kept untouched throughout the rendering, or it may represent lexemes that can assume different forms according to the neighboring syntactic and phonological constraints.

For starters, let's look at an example template to generate a sentence describing the age of a person, e.g. "Dan is 20 years old.", given a constructor with two fields: entity (the Q-id of the person) and years (the age). In English, this template may look like this:

{Person(entity)} is {nummod:Cardinal(years)} {root:Lexeme(L2505)} old.

There are three slots, which are delimited by curly brackets:

{Person(entity)} resolves to the name of the person.
{nummod:Cardinal(years)} resolves to the number of years. It is marked as the "numeral modifier" of the third slot.
{root:Lexeme(L2505)} fetches from Wikidata Lexeme L2505, which refers to the lemma "year". Since the slot is marked as root, it will be linked to the previous slot, allowing for the selection of the right form of the lexeme: "year" or "years".

The remaining text in the template – "is" and "old" – is in this case static text. In other cases, we might need to specify that the verb is can inflect as well or the number may need some additional processing to render it properly, and we would use similar dependency labels to mark subject-verb agreement and other types of agreement across the sentence’s constituents.

In the document, similar examples – though more complex – are given for 4 other languages (Swedish, French, Hebrew and isiZulu), each presenting its own peculiarities and challenges but that still can be captured successfully with the proposed template language. We invite you to read the document, provide feedback and try to come up with challenging examples in other languages that may prove difficult to render using this formalism, so we can improve on it and achieve the broadest possible applicability to, ideally, all natural languages used.

Wikimania 2022

Last week was Wikimania 2022, the annual event for Wikimedians from all over the world to meet and discuss. There were two sessions on Wikifunctions, one session on Wikifunctions led by the team and one on Ninai and Udiron led by Mahir Morshed.

Our session consisted of a short introduction to Wikifunctions by Denny, followed by a pre-recorded section, where several team members had short deep dives into different topics.

We had:

James Forrester on the technical architecture
Amin Al Hazwani on the design language
Genoveva Galarza Heredero on the content model
Julia Kieserman on Codex
Cory Massaro on knowledge equity
Ariel Gutman on natural language generation, an intro to the first part of the newsletter above
Ali Assaf on formalizing the function model

You can watch this pre-recorded segment on Commons.

As with all Wikimania sessions, collaborative note taking was enabled. The notes on the session also contain all questions that have been asked and answered in the closing part of the session, following the video. A full video of the session is available on YouTube, but note that playing the pre-recorded video faced a number of technical issues. You might want to skip to the Commons video instead. Uploads of individual sessions are expected to become available later.

Mahir Morshed had a Wikimania session about Ninai and Udiron and the recording starts here. Ninai and Udiron are tools for natural language generation, and we have introduced them in earlier newsletters.

Workstream updates as of August 12, 2022

Performance

Started performance analysis methodology documentation
Set up health-check API endpoint for Wikilambda

Natural language generation

Not too much progress due to team members' vacation time. Started adding noun class information for isiZulu, Mboshi, Kiswahili

Meta-data

Finished display of metadata dialog on tester page
Created some new PHP utilities for ZMaps

Experience

Fixed and merged Beta launch blockers
Made great progress on fixing various bugs
Began researching diffing options