Abstract Wikipedia/Natural language generation system architecture proposal

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Proposal by Ariel Gutman

This document describes a proposed architecture for a natural language generation (NLG) system for Abstract Wikipedia. When considering an architecture of an NLG system the following considerations need to be taken into account:

  1. Modularity: the system should be modular, in that various aspects of NLG (e.g. morphosyntactic and phonotactic rules) can be modified independently.
  2. Lexicality: the system should be able to both fetch lexical data (separate from code), and rely on productive language rules to generate such data on the fly (e.g. inflecting English plurals with an -s).
  3. Recursivity: due to the compositional and recursive nature of most languages,[1] an effective NLG system would need to be recursive itself.

In the context of Abstract Wikipedia, another constraint comes into mind:

  1. Extensibility: the system should be amenable to extension both by linguistic experts and technical contributors as well as by non-technical and non-expert contributors, working on different parts of the system.

From the above constraints, it seems reasonable to assume that a single Wikifunctions (=WF) function cannot effectively capture the complexity of such a modular NLG system, but rather multiple such functions need to be involved, each responsible for a different step in an NLG pipeline. In the current design of WF individual functions cannot:

  • Invoke other WF functions.
  • Fetch data from external sources such as Wikidata.
  • Alter some global state of the system.

To overcome these limitations, this document proposes an NLG pipeline to be run by the WF Orchestrator, which is not subject to any of these. Moreover, to enable non-technical contributors to participate, the creation of an in-house templating language is proposed, which could be run by a custom-made WF evaluator.

An alternative approach, would be to remove the design limitations of WF evaluators, in order to encapsulate the entire NLG pipeline within a single WF function (which would then invoke other WF functions). While this approach would change some aspects of the implementation of the system (e.g. the pipeline orchestration would be itself editable by WF contributors), the conceptual architecture would still largely stay the same.

At the end of the document, a short comparison to other suggested approaches is given.

Architecture overview[edit]

As explained above, the full NLG pipeline cannot not be encapsulated within a single Wikifunctions (=WF) function, but rather must be run by the WF orchestrator, which would allow fetching data from external sources (in particular Wikidata), invoking different functions of WF (defined by contributors) and keeping the necessary state while doing so. The envisaged architecture is presented in the following diagram, where the dark blue forms are elements which would be created by contributors to Wikifunctions (rectangles) or Wikidata (rounded rectangles), while the light blue elements represent function or data living within the WF orchestrator, and thus not directly amenable to community contribution.

A proposal of an NLG architecture for Abstract Wikipedia.svg

Let’s detail the steps:

  1. Given a constructor type, a specific renderer is selected,[2] and the data contained in the given constructor is passed to the renderer as its function arguments.
  2. The renderer is basically a template: a combination of static text, and slots which can be filled with the Renderer’s arguments, lexemes from Wikidata, or the output of other renderers. Templates are relatively easy to understand and write, and thus the authoring of renderers will be accessible to non-technical contributors.
  3. The output of the renderer is a dependency syntax tree (using for instance Universal Dependencies (UD) or Surface-Syntactic Universal Dependencies (SUD) formalisms)[3] in which the nodes are non-inflected lexemes (identified by their lemmas), augmented with some morphological constraints.  In practice the tree doesn’t need to be fully specified; in particular, static text doesn’t necessarily need to be part of the tree.
  4. Relying on a language-specific grammar specification, the morphological constraints coupled with structure of the syntactic tree allow the inflection of the lemmas, according to the lexical data present in Wikidata, or using inflectional tables of the grammar specification. The output of this step is a linear sequence of text, minimally annotated with part-of-speech information (i.e. whether a word represents a noun, a verb, a preposition etc.).
  5. At this step phonotactic constraints are being applied, applying language specific sandhi phenomena. These can include the selection of contextual forms (e.g. in English a/an) or contraction/crasis of adjacent forms (e.g. French de + le = du).
  6. As a final clean-up step, spacing, capitalization and punctuation may need to be adjusted in order to render the final text to be stored in a Wikipedia article. This step can be modeled in a language-agnostic way, by using (language-dependent) annotations from the previous steps.  

In the above architecture, there are three components which need to be curated by community members:

  1. Templatic renderers - these make up the bulk of the needed work, as every constructor needs one templatic renderer per language (though re-use of renderers for parts of sentences is possible). Note that the term Renderer is used here in a narrower sense than in Architecture for a Multilingual Wikipedia. In the latter, the term Renderer refers to an end-to-end data-to-text function, while here we use the term Renderer to refer to a specific component of the NLG pipeline, namely a template. This is no coincidence, since in the above architecture, the other parts of the pipeline are relatively fixed, and don’t need constant curation by community members.
  2. Grammar specifications - these would have to specify the relevant morphological features needed for each language, their hierarchy and how these manifest themselves via dependency relations. These specifications may either be stored as data in Wikidata, or as functions in Wikifunctions (to be decided). It is probable that the creation and curation of these grammars will require substantial linguistic and technical knowledge, but since they are created once per (human) language, this is deemed acceptable.
  3. Wikidata lexemes - these will be curated as today, but it would be important that the features they use are inline with the grammar specifications of each language.

Structure of templates[edit]

Since the bulk of needed work by community contributors would be the creation of templatic renderers, it is important to make this task as easy as possible, and in particular avoid requiring any coding experience.

Similar to the Composition “language” in Wikifunctions, we can develop an in-house templating language.[4] The templating language should allow specifying a linguistic tree (with UD annotations) over three types of arguments:[5]

  • Static text
  • Terminal functions fetching lemmas from Wikidata, or creating lemmas on the fly from other arguments (e.g. numbers[6]).
  • Other renderers

The templating language will have a dedicated evaluator module, called by the WF orchestrator. The latter will be responsible for passing the output through the various modules of the NLG pipeline outlined above.


Let’s assume we have a simple Constructor conveying the age of a person:[7]

  Entity: Malala Yousafzai (Q32732)
  Age_in_years: 24

To render such a Constructor in English, we will use a templatic notation similar to the following (being a Z14/Implementation type):

 "type": "implementation",
 "implements": "Age_renderer_en",
 "template": {
   "part": {
     "role": "subject",  # grammatical subject     
     "type": "function call",
     "function": "Resolve_Lexeme",
     "lexeme": {
        "reference": "Entity"
  "part": {
    "role": "root",  # root of the clause     
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "be"  # replace with L-id
  "part": {
    "role": "num",  # numerical modifier
    "of": 4,  # Part 4 (“year”)   
    "type": "function call",
    "function": "Cardinal_number",
    "number": {
        "reference": "age_in_years"  
   "part": {
    "role": "npadvmod", 
    "of": 5,  # Part 5 (“old”)
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "year"  # replace with L-id
  "part": {
    "role": "acomp",
    "type": "string",
        "value": "old"      

Some of the syntactic roles (npadvmod, acomp) have in fact no agreement effect, so one can leave them out.

Structure of grammar[edit]

The grammar needs to include the following information:

  1. What part of speech the language has.
  2. What grammatical features are appropriate for each part of speech
  3. (Possibly) a type hierarchy of the features
  4. How do grammatical relations (i.e. dependency relations) interact with grammatical features and parts-of-speech.

Note that the first points can be inferred from the Wikidata lexemes available for a given language, but it would be useful to make them explicit as part of a grammar definition, which would also enforce/validate the Wikidata lexeme definitions.[8] One could write such a validator per language as a WF function, which would then run on the Wikidata lexemes to mark if they are correctly annotated according to the language's schema.

As for the grammar relations, these can be encoded either as data in Wikidata or as functions in WF. Dependency relations can be implemented as unification of grammatical features of their nodes, one could implement each relation as a Composition WF function, using the Unify operator as a builtin function. For instance, a "subj" relation for English would be implemented as following (using short-hand notation):

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    Unify(noun.number, verb.number);
    Unify(noun.person, verb.person);
    Unify(noun.case, NOMINATIVE);

Note that implemented like this, the subj function is not a pure functional, since it affects its input arguments (and in fact the return value is not used, unless an error of unification occurs). To keep things simple, this special behavior would need to be supported by the function evaluator.

One may want to bundle together features which get unified together. For instance, if we observe that number and person are often unified together, we may define a sub-function such as the following:

agr(left, right):
    Unify(left.number, right.number);
    Unify(left.person, right.person);

Then we can redefine the subj relation above as following:

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    agr(noun, verb);
    Unify(noun.case, NOMINATIVE);

Modular design of grammars and renders[edit]

It is often the case that languages from the same language family exhibit some grammatical and structural similarities. One may take advantage of this phenomenon by defining a hierarchy of languages and language-families,[9] and allow the NLG system to use dynamic dispatch to the most concrete implementation of a (sub)-renderer or a (sub)-relation.

Other approaches[edit]

To date, I'm aware of two other systems that have been proposed to handle the NLG of Abstract Wikipedia.

  1. Grammatical framework (GF) is an established functional programming language intended to support multilingual natural language generation and understanding (see newsletter description). It has a thriving community of computer scientists, linguists and other enthusiasts who contribute to it.
  2. Ninai/Udiron is a Python-based NLG system built by community member Mahir Morshed, which uses lexeme data from Wikidata and combines them using UD trees. The system has been built with the Abstract Wikipedia project in mind. Some interesting examples of constructors and how they are rendered can be found in the Ninai demonstrations.

While the two systems are different, they can be contrasted with the proposal outlined in this document along similar axis:

  • Both systems are geared toward converting relatively abstract & compositional semantic representations into grammatical structure and then text.
  • They require mastering some programming skills, be it a domain-specific language (GF) or a general programming language (Python).
  • The ordering of the words in the output text is determined by the entire NLG pipeline (e.g. adding a Question operator could change the word order in English).
  • Insofar as the grammar definitions are correct, the output is guaranteed to be grammatical

The proposal outlined in this document, on the other hand, is specifically intended to make it as easy as possible for people without prior technical knowledge to make contributions. This implies the following:

  • It can work with concrete, non-compositional, semantic representations (as the Age example above). This however does not exclude handling more abstract representations.
  • At the entry level, almost no programming skills are required to write templatic renderers. Knowledge of linguistics (in particular dependency annotations) can be useful to achieve grammatical output, and is necessary in order to write the grammar specifications themselves.
  • The ordering of words is determined by the templates themselves, and is not changed later in the pipeline.
  • Output can be ungrammatical, if a template has not been designed correctly.


  1. The question whether recursion exists in all languages has seen heated debate in recent years
  2. It may be useful to allow rendering constructor either nominally (e.g. “Marie’s marriage to Pierre”) or verbally (“Marie got married to Pierre”). In that case, more than one renderer per constructor would be needed.
  3. The SUD formalism is simpler and possibly more adequate for NLG tasks. Osborne & Gerdes (2019) provide a discussion of the shortcoming of UD. See also https://surfacesyntacticud.github.io/conversions/ for a comparison of the two formalisms. In either case, we may need to extend the set of dependency relations in order to capture some patterns required for NLG, such as pronominal cross-reference.
  4. The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.
  5. See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.
  6. One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.
  7. In practice the age should probably be calculated from the birthdate, but for the sake of example, it is specified in the constructor. We may moreover envisage a dynamic constructor in which part of the data is calculated on the fly.
  8. Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".
  9. Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.