Абстрактна Вікіпедія/Пропозиція архітектури системи генерування природної мови

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This page is a translated version of the page Abstract Wikipedia/Natural language generation system architecture proposal and the translation is 25% complete.
Other languages:

Пропозиція від Ariel Gutman

Цей документ описує пропоновану архітектуру системи генерування природної мови (NLG) для абстрактної Вікіпедії. Розглядаючи архітектуру системи NLG, треба враховувати такі міркування:

  1. Модульність: система має бути модульною, оскільки різні аспекти NLG (наприклад, морфосинтаксичні та фонотактичні правила) можна модифікувати незалежно.
  2. Лексичність: система повинна мати можливість як отримувати лексичні дані (окремо від коду), так і покладатися на продуктивні правила мови для генерування таких даних на льоту (наприклад, відмінювання англійської множини за допомогою -s).
  3. Рекурсивність: через композиційну та рекурсивну природу більшості мов[1], ефективна система NLG має бути рекурсивною.

У контексті Абстрактної Вікіпедії спадає на думку ще одне обмеження:

  1. Розширюваність: система має бути доступною для розширення як лінгвістичними експертами та технічними учасниками, так і не технічними та не експертними учасниками, які працюють над різними частинами системи.

З огляду на наведені вище обмеження, видається доцільним припустити, що одна функція Вікіфункцій (=WF) не може ефективно охопити складність такої модульної системи NLG, а скоріше потрібно задіяти декілька таких функцій, кожна з яких відповідає за окремий крок у конвеєрі NLG.

У поточному проєкті WF окремі функції не можуть:

  • Викликати інші функції WF.
  • Отримувати дані з зовнішніх джерел, таких як Вікідані.
  • Змінювати деякий глобальний стан системи.

Щоб подолати ці обмеження, у цьому документі пропонується конвеєр NLG, яким керуватиме Диригент (англ. Orchestrator) WF, який не підлягає жодному з цих обмежень. Крім того, щоб дозволити не технічним учасникам брати участь, пропонується створити власну мову шаблонів, якою міг би керувати спеціально створений оцінювач (англ. evaluator) WF.

Альтернативним підходом було б усунути проєктні обмеження оцінювачів WF, щоб інкапсулювати весь конвеєр NLG в одній функції WF (яка потім викликала б інші функції WF). Незважаючи на те, що цей підхід змінить деякі аспекти реалізації системи (наприклад, оркестровка конвеєра сама буде доступною для редагування учасниками WF), концептуальна архітектура все одно залишиться незмінною.

Наприкінці документа наводиться коротке порівняння з іншими запропонованими підходами.

Огляд архітектури

As explained above, the full NLG pipeline cannot not be encapsulated within a single Wikifunctions (=WF) function, but rather must be run by the WF orchestrator, which would allow fetching data from external sources (in particular Wikidata), invoking different functions of WF (defined by contributors) and keeping the necessary state while doing so. The envisaged architecture is presented in the following diagram, where the dark blue forms are elements which would be created by contributors to Wikifunctions (rectangles) or Wikidata (rounded rectangles), while the light blue elements represent function or data living within the WF orchestrator, and thus not directly amenable to community contribution.

A proposal of an NLG architecture for Abstract Wikipedia.svg

Let’s detail the steps:

  1. Given a constructor type, a specific renderer is selected,[2] and the data contained in the given constructor is passed to the renderer as its function arguments.
  2. The renderer is basically a template: a combination of static text, and slots which can be filled with the Renderer’s arguments, lexemes from Wikidata, or the output of other renderers. Templates are relatively easy to understand and write, and thus the authoring of renderers will be accessible to non-technical contributors.
  3. The output of the renderer is a dependency syntax tree (using for instance Universal Dependencies (UD) or Surface-Syntactic Universal Dependencies (SUD) formalisms)[3] in which the nodes are non-inflected lexemes (identified by their lemmas), augmented with some morphological constraints.  In practice the tree doesn’t need to be fully specified; in particular, static text doesn’t necessarily need to be part of the tree.
  4. Relying on a language-specific grammar specification, the morphological constraints coupled with structure of the syntactic tree allow the inflection of the lemmas, according to the lexical data present in Wikidata, or using inflectional tables of the grammar specification. The output of this step is a linear sequence of text, minimally annotated with part-of-speech information (i.e. whether a word represents a noun, a verb, a preposition etc.).
  5. At this step phonotactic constraints are being applied, applying language specific sandhi phenomena. These can include the selection of contextual forms (e.g. in English a/an) or contraction/crasis of adjacent forms (e.g. French de + le = du).
  6. As a final clean-up step, spacing, capitalization and punctuation may need to be adjusted in order to render the final text to be stored in a Wikipedia article. This step can be modeled in a language-agnostic way, by using (language-dependent) annotations from the previous steps.

In the above architecture, there are three components which need to be curated by community members:

  1. Templatic renderers - these make up the bulk of the needed work, as every constructor needs one templatic renderer per language (though re-use of renderers for parts of sentences is possible). Note that the term Renderer is used here in a narrower sense than in Architecture for a Multilingual Wikipedia. In the latter, the term Renderer refers to an end-to-end data-to-text function, while here we use the term Renderer to refer to a specific component of the NLG pipeline, namely a template. This is no coincidence, since in the above architecture, the other parts of the pipeline are relatively fixed, and don’t need constant curation by community members.
  2. Grammar specifications - these would have to specify the relevant morphological features needed for each language, their hierarchy and how these manifest themselves via dependency relations. These specifications may either be stored as data in Wikidata, or as functions in Wikifunctions (to be decided). It is probable that the creation and curation of these grammars will require substantial linguistic and technical knowledge, but since they are created once per (human) language, this is deemed acceptable.
  3. Wikidata lexemes - these will be curated as today, but it would be important that the features they use are inline with the grammar specifications of each language.

Structure of templates

Since the bulk of needed work by community contributors would be the creation of templatic renderers, it is important to make this task as easy as possible, and in particular avoid requiring any coding experience.

Similar to the Composition “language” in Wikifunctions, we can develop an in-house templating language.[4] The templating language should allow specifying a linguistic tree (with UD annotations) over three types of arguments:[5]

  • Static text
  • Terminal functions fetching lemmas from Wikidata, or creating lemmas on the fly from other arguments (e.g. numbers[6]).
  • Other renderers

The templating language will have a dedicated evaluator module, called by the WF orchestrator. The latter will be responsible for passing the output through the various modules of the NLG pipeline outlined above.


Let’s assume we have a simple Constructor conveying the age of a person:[7]

  Entity: Malala Yousafzai (Q32732)
  Age_in_years: 24

To render such a Constructor in English, we will use a templatic notation similar to the following (being a Z14/Implementation type):

 "type": "implementation",
 "implements": "Age_renderer_en",
 "template": {
   "part": {
     "role": "subject",  # grammatical subject     
     "type": "function call",
     "function": "Resolve_Lexeme",
     "lexeme": {
        "reference": "Entity"
  "part": {
    "role": "root",  # root of the clause     
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "be"  # replace with L-id
  "part": {
    "role": "num",  # numerical modifier
    "of": 4,  # Part 4 (“year”)   
    "type": "function call",
    "function": "Cardinal_number",
    "number": {
        "reference": "age_in_years"  
   "part": {
    "role": "npadvmod", 
    "of": 5,  # Part 5 (“old”)
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "year"  # replace with L-id
  "part": {
    "role": "acomp",
    "type": "string",
        "value": "old"      

Some of the syntactic roles (npadvmod, acomp) have in fact no agreement effect, so one can leave them out.

Structure of grammar

The grammar needs to include the following information:

  1. What part of speech the language has.
  2. What grammatical features are appropriate for each part of speech
  3. (Possibly) a type hierarchy of the features
  4. How do grammatical relations (i.e. dependency relations) interact with grammatical features and parts-of-speech.

Note that the first points can be inferred from the Wikidata lexemes available for a given language, but it would be useful to make them explicit as part of a grammar definition, which would also enforce/validate the Wikidata lexeme definitions.[8] One could write such a validator per language as a WF function, which would then run on the Wikidata lexemes to mark if they are correctly annotated according to the language's schema.

As for the grammar relations, these can be encoded either as data in Wikidata or as functions in WF. Dependency relations can be implemented as unification of grammatical features of their nodes, one could implement each relation as a Composition WF function, using the Unify operator as a builtin function. For instance, a "subj" relation for English would be implemented as following (using short-hand notation):

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    Unify(noun.number, verb.number);
    Unify(noun.person, verb.person);
    Unify(noun.case, NOMINATIVE);

Note that implemented like this, the subj function is not a pure functional, since it affects its input arguments (and in fact the return value is not used, unless an error of unification occurs). To keep things simple, this special behavior would need to be supported by the function evaluator.

One may want to bundle together features which get unified together. For instance, if we observe that number and person are often unified together, we may define a sub-function such as the following:

agr(left, right):
    Unify(left.number, right.number);
    Unify(left.person, right.person);

Then we can redefine the subj relation above as following:

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    agr(noun, verb);
    Unify(noun.case, NOMINATIVE);

Modular design of grammars and renders

It is often the case that languages from the same language family exhibit some grammatical and structural similarities. One may take advantage of this phenomenon by defining a hierarchy of languages and language-families,[9] and allow the NLG system to use dynamic dispatch to the most concrete implementation of a (sub)-renderer or a (sub)-relation.

Other approaches

To date, I'm aware of two other systems that have been proposed to handle the NLG of Abstract Wikipedia.

  1. Grammatical framework (GF) is an established functional programming language intended to support multilingual natural language generation and understanding (see newsletter description). It has a thriving community of computer scientists, linguists and other enthusiasts who contribute to it.
  2. Ninai/Udiron is a Python-based NLG system built by community member Mahir Morshed, which uses lexeme data from Wikidata and combines them using UD trees. The system has been built with the Abstract Wikipedia project in mind. Some interesting examples of constructors and how they are rendered can be found in the Ninai demonstrations.

While the two systems are different, they can be contrasted with the proposal outlined in this document along similar axis:

  • Both systems are geared toward converting relatively abstract & compositional semantic representations into grammatical structure and then text.
  • They require mastering some programming skills, be it a domain-specific language (GF) or a general programming language (Python).
  • The ordering of the words in the output text is determined by the entire NLG pipeline (e.g. adding a Question operator could change the word order in English).
  • Insofar as the grammar definitions are correct, the output is guaranteed to be grammatical

The proposal outlined in this document, on the other hand, is specifically intended to make it as easy as possible for people without prior technical knowledge to make contributions. This implies the following:

  • It can work with concrete, non-compositional, semantic representations (as the Age example above). This however does not exclude handling more abstract representations.
  • At the entry level, almost no programming skills are required to write templatic renderers. Knowledge of linguistics (in particular dependency annotations) can be useful to achieve grammatical output, and is necessary in order to write the grammar specifications themselves.
  • The ordering of words is determined by the templates themselves, and is not changed later in the pipeline.
  • Output can be ungrammatical, if a template has not been designed correctly.


  1. Питання про те, чи існує рекурсія в усіх мовах, викликало бурхливі дискусії впродовж останніх років
  2. It may be useful to allow rendering constructor either nominally (e.g. “Marie’s marriage to Pierre”) or verbally (“Marie got married to Pierre”). In that case, more than one renderer per constructor would be needed.
  3. The SUD formalism is simpler and possibly more adequate for NLG tasks. Osborne & Gerdes (2019) provide a discussion of the shortcoming of UD. See also https://surfacesyntacticud.github.io/conversions/ for a comparison of the two formalisms. In either case, we may need to extend the set of dependency relations in order to capture some patterns required for NLG, such as pronominal cross-reference.
  4. The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.
  5. See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.
  6. One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.
  7. In practice the age should probably be calculated from the birthdate, but for the sake of example, it is specified in the constructor. We may moreover envisage a dynamic constructor in which part of the data is calculated on the fly.
  8. Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".
  9. Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.