Abstract Wikipedia/Template Language for Wikifunctions/Transformation of template syntax to Wikifunctions

From Meta, a Wikimedia project coordination wiki

Given a template using the syntax, it can be parsed into composition syntax (and thus a Z7 function invocation object), which can then be evaluated directly by the Wikifunctions Orchestrator.

Output type of the evaluation[edit]

Per the proposed NLG architecture the output of the templatic renderer should be a "Lemma tree". In the current implementation proposed below, the needed output is not a fully specified tree, but rather a lexeme list, a lexeme[Note 1] being defined as a Z-type having the following fields:[Note 2]

  • Lemma: citation form of the lexeme (needed mostly for debugging purposes)
  • Language code (mostly this is identical to the target language code of the template, but may vary in special cases)
  • Part-of-speech
  • List of unifiable grammatical features applicable to all forms
  • List of unifiable grammatical features serving as constraints on the forms (populated by the application of dependency rules)
    • In practice, the two above lists can be implemented as a single list of grammatical features.
  • List of forms, the latter being a Z-type having the following fields:
    • Orthography
    • Space handling (whether spaces should suppressed before/after the form, to handle various types of clitics/affixes)
    • List of unifiable grammatical features specific to the form

The unifiable grammatical feature type mentioned above has three fields:

  • Unification index: serves to identify these features with other lexemes in the derivations.
  • Grammatical category: the type of the feature, e.g. number or gender
  • Grammatical feature (proper): the feature itself, e.g. plural or masculine

The lexeme list type is a list whose items can be either of the lexeme type or be lexeme lists themselves. Moreover, each such list is augmented with the index of the root lexeme, possibly through the Z22 (Pair) type. The result is a partially specified dependency tree of lexemes, in which only arcs between the roots of subtemplates and their supertemplates are recorded, but arcs within any single template are not kept.[Note 3]

Construction of the composed function[edit]

The composed function call is built as follows:[Note 4]

  • Every template element is transformed into a function call, as follows:
    • Textual elements, including punctuation marks, are passed as arguments to a TemplateText function, e.g. TemplateText("is"), TemplateText(".").
      • The TemplateText function transforms every word (or punctuation mark) into a lexeme (or more precisely, a lexeme list of length 1). In most cases that lexeme will contain exactly one form, namely the text provided, effectively simulating static text. However, in select cases, this gives an opportunity to expand certain words to a list of forms (e.g. the English determiner "a" could be expanded to {"a", "an"}), or apply some relevant annotation, for easier access to certain common grammatical patterns.
    • Slots with strings or interpolation of string arguments are also transformed to TemplateText calls, e.g. {"text"} → TemplateText("text"), {string_field} → TemplateText(string_field).
    • Interpolations of other argument types will be handled similarly, but with other functions (TBD).
      • Numeric types, for instance, may be converted to lexemes using an implicit Cardinal function, which would not only convert the numeric value to a lexeme type – e.g., the number 1 to a string 1 – but would also enrich the lexeme with a grammatical number feature (e.g., that a 1 in the sentence, when quantifying count nouns, will force singular rendering of the noun it is associated with, and e.g. 2 for plural, according to the language's grammar for grammatical number) corresponding to the numeric value.
    • Slots with function invocations will simply use the corresponding function call (which may be a sub-template), e.g. {Lexeme(entity}) → Lexeme(entity). Note that these functions must return a lexeme list type.
    • Any conditional function would be applied on top of the previously given function call, e.g. {Lexeme(entity}|Elide_if(ellipsis)} → Elide_if(ellipsis, Lexeme(entity)).
  • The list of resulting function calls is passed into a Template function, together with the 1-based index of the root-labeled slot.[Note 5] For example:
    Hello {root:Person(entity}|Elide_if(ellipsis)}! →
    Template(2, [TemplateText("Hello") Elide_if(ellipsis, Person(entity)) TemplateText("!")])
  • The dependency relations are applied as further function invocations on top of the Template invocation, together with the 1-based indexes of the target and source labels. Note that this means that each dependency role should correspond to a Wikifunctions function transforming the lemma tree to reflect the application of the given role; e.g. an amod role would enforce subject-verb agreement on the lemma tree. For example:
    Bonjour {det:DefiniteArticle()} {amod:Lexeme(L10098)} {root:Person(entity}! →
    amod(3, 4, det(2, 4, Template(4, [
    TemplateText("Bonjour")
    DefiniteArticle()
    Lexeme("L10098") -- adjective "petit"
    Person(entity)
    TemplateText("!")
    ]))

Note that the Template and TemplateText functions as well as those derived from the dependency relation names, are all expected to return a lexeme list. Thus, when transforming a Constructor to a template a further function needs to be applied, tentatively named Render, which should process the lemma-tree through the other modules of the NLG pipeline to return the realized text as a string.

Language-specific function dispatch[edit]

In Wikifunctions, the language-specific function dispatch will be implemented by augmenting every function label with a language-code suffix. For instance, the English, German and Brazilian Portuguese versions of TemplateText will be named (in English) TemplateText_en, TemplateText_de and TemplateText_pt-BR.[Note 6] It will be the task of the template parser, when delabeling the function names, to look up for the most language-specific version of a given function.

Examples of transformation to composition syntax[edit]

We repeat here the examples given in the template specification document, together with their transformation to Composition syntax.

Swedish[edit]

For convenience of reading, we repeat the template syntax:

Age_renderer_sv(Entity, Age_in_years): "{Person(Entity)} är {Age_in_years} år gammal ."

The corresponding composition syntax (short-hand):

Age_renderer_sv(Entity, Age_in_years): Template(1, [
Person(Entity)
TemplateText("är")
TemplateText(Age_in_years)
TemplateText("år")
TemplateText("gammal")
TemplateText(".")
])

French[edit]

Template syntax:

Age_renderer_fr(Entity, Age_in_years):
"{Person(Entity)} a {Year(Age_in_years)}."
Year_fr(years): "{nummod:Cardinal(years)} {root:Lexeme(L10081)}"

Composition syntax:

Age_renderer_fr(Entity, Age_in_years): Template(1, [
Person(Entity)
TemplateText("a")
Year_fr(Age_in_years)
TemplateText(".")
])
Year_fr(years): num(1, 2, Template(2, [
Cardinal_fr(years)
Lexeme_fr(L10081)
]))

Hebrew[edit]

Template syntax:

Age_renderer_he(Entity, Age_in_years):
"{subj:Person(Entity)} {root:GenderedLexeme(L64310, L64399)} {gmod:Year(Age_in_years)}."
Year_he(years):
"{nummod:Cardinal(years)|Elide_if(years<=2)} {root:Lexeme(L68440)|Elide_if(years>2)}"

Composition syntax:

Age_renderer_he(Entity, Age_in_years):
gmod(3, 2, subj(1, 2, Template(2, [
Person(Entity)
GenderedLexeme(L64310, L64399)
Year_he(Age_in_years)
TemplateText(".")
])))
Year_he(years):
nummod(1, 2, Template(2, [
ElideIf(years<=2, Cardinal_he(years))
ElideIf(years>2, Lexeme(L68440)) ]))

Zulu[edit]

Template syntax:

Age_renderer_zu(Entity, Age_in_years):
"{subj:Person(Entity)} {sc:SubjectConcord()}na{Year(Age_in_years)}."
Year_zu(years):
"{root:Lexeme(L686326} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounConcord()}-{nummod:Cardinal(years)}"

Composition syntax:

Age_renderer_zu(Entity, Age_in_years):
subj(1, 2, Template(2, [
Person(Entity)
SubjectConcord()
TemplateText("na")
Year(Age_in_years)
TemplateText(".")
]))
Year_zu(years):
nummod(6, 1, concord(4, 6, concord(2, 1, Template(1, [
Lexeme(L686326)
RelativeConcord()
Copula()
NounConcord()
TemplateText("-")
Cardinal(years)
]))))

Breton[edit]

Template syntax:

Age_renderer_br(Entity, Age_in_years):
"{Cardinal(Age_in_years)} {Lexeme(L45068)} eo {Person(Entity)} ."

Composition syntax:

Age_renderer_br(Entity, Age_in_years):
Template(1, [
Cardinal_br(Age_in_years)
Lexeme(L45068)
TemplateText("eo")
Person(Entity)
TemplateText(".")
])

Footnotes[edit]

  1. The term "lexeme" is used here quite broadly. It may refer to a typical content word of a language, a grammatical morpheme, or even (in certain circumstances) a grammatical phrase. The essential thing is that it exhibits a list of one or more forms, characterized by grammatical features.
  2. Note the close resemblance of the data type definition with the Lexeme definition of Wikidata, yet these are distinct objects.
  3. Since subtemplates do not necessarily correspond to any linguistic unit (although ideally they should correspond to linguistic constituents of a dependency tree), and moreover their use is at the discretion of the template author, this partial tree specification doesn't necessarily correspond to any specific linguistic logic, but rather reflects the organization of the template in terms of subtemplate use.
  4. An initial implementation of this can be found in Gerrit.
  5. If no slot is labeled root (which is allowed only if there are no other labels as well), the first slot will be identified as the root.
  6. Note that this is independent of the fact that every Wikifunctions function may have labels in different languages. All these labels should follow the convention of marking the language code of the implementation, as handled by the implementation.