Abstract Wikipedia/Template Language for Wikifunctions

Note: If you are interested mostly in an implementation of the language, see the implementations section below

The goal of this document is to propose a template language syntax to be used in the Abstract Wikipedia project. This is an elaboration of the Natural language generation (NLG) system architecture document, and of its “templatic renderer” component in particular. It can be implemented on the Wikifunctions platform by transforming it to "Composition" Syntax, or in another programming environment, such as Wikipedia's Scribunto environment, and possibly it can be mapped or transformed into other NLG systems or realisers.

Key changes to the first version are as follows: 1) a generalization of the realization algorithm, with the one tailored to Wikifunctions as an appendix; 2) being more explicit in some design choices and addressing comments, suggestions, and examples raised by community members; 3) further illustrative extensions, such as relating it to other NLG realizers and implementations.

Introduction[edit]

Van Deemter, Krahmer & Theune (2003)^[1] describe template-based systems as follows:

Template-based systems are natural language generating systems that map their non-linguistic input directly (i.e., without intermediate representations) to the linguistic surface structure (cf., Reiter and Dale (1997), p.83-84^[2]). Crucially, this linguistic structure may contain gaps; well-formed output results when the gaps are filled or, more precisely, when all the gaps have been replaced by linguistic structures that do not contain gaps.

In other words, templates can be understood as sequences of (static) text interspersed with slots or gaps to be filled by some structured content (Mahlaza & Keet, 2021)^[3] – be this data from, e.g., a database, knowledge from an ontology, or some other information source with structured content^{[Note 1]} – or the result of some computation (possibly the output of another realized template or function). In the proposed design for the NLG module for WF, an additional element was added (following Gutman et al., 2019^[4]; 2022^[5]), namely UD/SUD annotations linking the slots of the template, so that the resulting output is not merely a string of text, but rather a syntactic tree (or a forest of trees) defined over the slots/text pieces. A slot can correspond to different levels of the organization of a language's syntax, namely a sentence fragment, a word, a sub-word morpheme etc. Depending on the syntax of the target natural language, the dependency link may thus be defined between different linguistic elements of the resulting sentence, according to the granularity needed.

Templates can be reframed as Context-Free Grammar rules, in which the text serves as terminal symbols and the slots as non-terminals.^{[Note 2]} For this reason, templates are at least as expressive as CFG-based systems, which are known to cover most (but not all) phenomena of language. The difference between the two lies in the initial lack of pretension of the designers of a template system to use any intermediate linguistic representations for the task of NLG, but in fact - as a template system grows and becomes domain-independent - it may acquire templates which resemble a formal grammar of the target language.

In the following we present the proposed template language syntax for Wikifunctions, first as in free-text description, and subsequently using formal notation as a CFG, and as a logical model rendered diagrammatically.^{[Note 3]}.

Template Language Syntax[edit]

A template consists of a combination of:

Free Text, including punctuation marks and lexemes (words) with special behavior;
Slots, which are interpolations of arguments, or invocations of lexical functions and sub-templates, possibly annotated with a dependency relation.

For the gist of it, a non-normative diagram with the key features of the template language in ORM notation is as follows:

An illustrative diagram with the key features of the template language in ORM notation: rectangle with solid line = entity type; rectangle with dashed line = value type (roughly: attribute); rectangle with dividers and a label next to it = relationship; arrow = subsumption; purple blob = mandatory constraint; purple line over a rectangle = uniqueness constraint; purple circle with blob = either is mandatory; purple circle with line = external uniqueness

We shall name distinct pieces of text (separated by spaces) and slots collectively as template elements. Each template has, as a minimum, at least one element, though a sub-template might be empty for implementation reasons.^{[Note 4]} The template language allows specifying dependency arcs between the slots in such a way that every template may be interpreted as a tree over the participating slots. Since sub-templates are not necessarily connected to this tree and a sub-template may have its own root indicated, the template language permits specification of a forest over the slots in the various sub-templates, which then can be connected to the super-template’s root, if desired. The intention of this structure is to provide a dependency grammar analysis of the productive elements in the template, to guide realization. Popular dependency grammar frameworks that may be used for this are, e.g., UD or SUD, but note that these typically only provide analysis at the word level, so if the slots are morphemic, rather, then one would either have to use another grammar framework or extend the UD/SUD dependency grammar framework (e.g., as recently proposed for UD^{[Note 5]} or the closely associated catena).

The slots are demarcated by curly brackets and are composed of two parts, in accordance with the following syntax (square brackets indicate optionality):

{[dependencyLabel:]invocation}

dependencyLabel: The optional dependencyLabel part has the following syntax, being composed of three elements, followed by a colon: role[_index][<sourceLabel]:
- role indicates the (incoming) dependency relation of the slot, and is thus a mandatory component of the dependencyLabel.
  - The slot acting as the root (a.k.a. head) of the dependency tree, must have the role root. If dependency labels are used in a template, there eventually must be exactly one root for the sentence.
  - Each role should correspond to a Wikifunctions function transforming the lemma tree to reflect the application of the given role; e.g. a subj role would enforce subject-verb agreement on the lemma tree. These Wikifunctions functions are language-specific and may moreover have labels in the said language (e.g., "onderwerp" for subj in Dutch).
- _index is an optional positive integer following an underscore, which allows differentiating multiple slots with the same role (e.g. two adjectival modifiers: amod_1, amod_2).
  - The combination of role and index must be unique in each template.
  - If role itself is unique within a template, the index part is not needed.
- <sourceLabel is an optional string which allows specifying the source of the incoming relation given as the label of the source slot (which is its role followed by the optional index). If the source is not given, the root slot is taken to be the source implicitly.
- If present, the dependency label part is followed by a colon to separate it from the invocation.
invocation can be one of three types: functionInvocation() or interpolation or "string"
- functionInvocation() is, as its name suggests, an invocation of a function. The function can take between its parenthesis a comma-separated list of arguments, which may themselves be invocations, as defined above. Note that the invoked function can be a sub-template.
- interpolation amounts to filling the slot by a formal argument of the template, which may be either filling the slot directly or filling the argument of a function in the slot.^{[Note 6]} As the argument is generally not of the required return type of a slot, in order for this to work, an implicit conversion function is required. In particular, fields of the Constructor, which are themselves sub-constructors, when interpolated, will cause their corresponding template to be invoked with their fields, to allow for compositionality.
- "string" (any text enclosed by quotation marks) is equivalent to using the same string as free text, but with the difference that it must be labeled with a dependency role.

Formalisation as a CFG[edit]

The above syntax can be formalized as a set of CFG rules. Non-terminal symbols are given in italics, while terminal symbols (as well as sets of terminal symbols) are given in monospace.

Rule	Comment
Template → Element
Element → `{`Slot`}` \| Text \| Element Element
Text → `lexeme` \| `punctuation` \| `string`	`lexeme` is a set of all forms that have the same meaning (e.g., sit, sits and sitting), but eventually leads to a lemma, so take it as Text ending in a terminal – shorthand notation. `punctuation` is also a set, eventually leading to a terminal symbol, and shorthand notation for that too here. `string` can be any combination of characters, excluding spaces and `{ }`.
Slot → DependencyLabel `:` Invocation \| Invocation
DependencyLabel → Label `<` SourceLabel \| Label \| `root`
Label → Role `_` Index \| Role
Role → …	finite set of role names (excluding root), for instance taken from UD/SUD or an extension thereof
Index → `1` \| `2` \| …	finite set of positive integers
SourceLabel → Label	the source label can only be one of those that have been used for Label elsewhere in the template; different variable name here for clarity mainly
Invocation → FunctionInvocation \| Interpolation \| String
FunctionInvocation → F(ArgumentList) \| F()	a function with a variable, possibly empty, number of arguments
ArgumentList → invocation \| invocation , ArgumentList	Any invocation can act as an argument
F → functionName \| templateName	These are names of functions (or templates acting as functions) available in the system
Interpolation → `interpolation`	set of names (fields in constructor), which are only terminals, so shorthand notation here.
String → "`string`"	obviously also a set of terminals (placed within `“` and `”`), shorthand notation.

There are three template-level features that have to be recorded with each template but that do not affect the parts that will be subject to rendering. They are:

The name/identifier of the template (mandatory). In a given scope of templates (which is defined by the implementation), the name must be unique.
The language that it has to be rendered in (mandatory). The language code should match with those used, planned to be used, or may in the future be used in Wikidata.
The preconditions (optional). A constructor may be associated with more than one template, i.e., have a list of variant templates related to it. Such a list of variant templates will have one or more preconditions for selection of the appropriate template. The first variant template that fulfills the precondition(s) is selected and realized. Preconditions could include parameters such as:
- Random seed conditions or ranking of template variants for injecting variation;
- Global, language-specific, or article-specific arguments for the realization, such as level of formality or verbosity;
- Whether specific fields of the constructor are present or not, or conditions on their values;
- The grammatical role of a sub-template within a super-template (as defined by a dependency label).
- And more may be specified as the need arises.

Extension by conditional functions[edit]

One may extend the syntax with so-called ‘syntactic sugar’ that are short-hand notations for more elaborate templates with multiple subtemplates. For instance, the slot's function invocation can be extended by one or more conditional functions, like this:

{[dependencyLabel:](invocation)[|conditionalFunction]*}

The optional conditionalFunction part, which may be repeated by separating it from the previous function via a pipe symbol, specifies a conditional function invocation that takes, in addition to the given arguments, the result of the invocation before the pipe symbol. In other words, it is a function that takes a list of lexemes and returns a modified list of lexemes. Some possible usages are:

This allows for conditional eliding of slots (while possibly still keeping their grammatical features) or conditional pronominalization (referring expression generation); e.g., “Simba the lion … ; he …” cf repetition “Simba the lion … ; Simba …”
It allows for more flexibility in sentence structure, to facilitate choosing between alternate words in a slot, such as common sentence reordering; e.g., “because x, we did y” and “we did y, because of x”.
and possibly other operations.

The reason for this syntax, rather than simple composition of functions, is to increase readability of the template, as the main invocation part should clearly indicate the type of rendered result we expect in the slot. Without this syntactic sugar, one could embed one function within the other, yielding {ConditionalFunction(invocation, conditional_arguments)} or add a conditional layer on top of (sub-)template selection so as to be able to process such preconditions.

Examples of the syntax are included further below.

Core functions[edit]

A core element of the template language are the function invocations within the slots. These function invocations should return either lexemes (paradigms of inflected forms associated with grammatical features, similar to Wikidata lexemes) or partially-specified syntactic trees of lexemes, where the root lexeme of each sub-tree is minimally given.

The core functions of the template language, however, return single lexemes. These core functions will be written by Abstract Wikipedia contributors, so there is no closed list. Examples of functions that, upfront or over time, may be considered as core functions may be:

Lexeme(L-id, …): Fetches a Wikidata lexeme associated with an L-id (in a given language), transforming it to the Lexeme type. Further arguments can be Q-ids of grammatical features that constrain the form choice.
Label(entity) or Label(entity, language): Fetches the label associated with a Q-id in a given language, which defaults to the realization language if not specified.
Cardinal(integer): Create a lexeme from the integer (possibly spelled-out) with the corresponding grammatical number (singular/plural/etc.)^{[Note 7]}
Ordinal(integer): Creates a lexeme corresponding to the desired ordinal number (possibly with inflections, depending on language).
TemplateText(string): Creates a lexeme corresponding to the input string.

Additionally, one may implement functions corresponding to specific semantic domains, for example:

Person(entity): Like Label, but also populates grammatical gender in accordance with the person's gender, possibly creating also pronominalized lexeme forms.

Language-specific function dispatch[edit]

All the functions mentioned above may (and most probably will) have a language-specific implementation. How the language-specific implementation will be chosen depends on the implementation.

Treatment of comments[edit]

One may want to add comments to a template or element of a template. In the current specification, there are no separate annotation fields for the elements. If desired, any implementation can add provision for comments like comments in code are added: you can place them anywhere in the specification, preceded with a suitable reserved character(s). This reserved string depends on the implementation where the template specifications are stored; e.g., with // comment here, as used in the CFG specification above, % comment here, or . For instance, in Wikifunctions, comments are also shown on the function's pate, which could also be adopted for templates, or another additional interface component.

Realization algorithm[edit]

The realization of a template specified using the above notation happens in five distinct phases.

Phase 1[edit]

All template elements are expanded by evaluating their content: textual elements, slot-strings, and string interpolations are transformed into lexemes (using either a generic or a language-specific function)^{[Note 8]} and all slot functions are evaluated with their arguments. Note that when subtemplates are evaluated, they need to be evaluated recursively up to phase 2 (detailed below).

The evaluation of each text element or of a slot should yield either a lexeme or a lexeme list datatype:

A lexeme consists of a collection of inflectional forms, each having its own spelling associated with a list of grammatical features, as well as a list of common grammatical features (applicable to all forms or constraining them). Every grammatical feature consists of an identifier (e.g. “singular”) associated with a grammatical category (e.g. “number”). Among these grammatical features a part-of-speech feature should be obligatory. Note that there is some confusion regarding the above terms, as different sources use the term feature either as category or as a specific value of a category (see Kibort & Corbett, 2008^[6], for a discussion of this terminology). Here we use the term feature in a narrow sense to designate the value of a category (e.g. plural) and in a broad sense (written in bold) to designate the above datatype indicating both the category and its assigned value.
All textual elements and most core functions should evaluate to the lexeme datatype.
A lexeme list is simply an ordered list of items, of which one is identified as the root of the list. Each item is itself either a lexeme or a lexeme list. This means that a lexeme list is a partially-specified tree of lexemes. By recursively diving into the root of a lexeme list one eventually gets to a single lexeme, which is termed the root lexeme of the list.
The evaluation of sub-templates should in general yield a lexeme list (except in the special case where the subtemplate evaluates to a single lexeme), where each item in that list corresponds to an element of the subtemplate. Moreover, the element which is given the root label should correspond to the item which is identified as the root of the list.

Phase 2[edit]

At this phase, the dependency labels of the template are evaluated. The order of evaluation is immaterial. For each template slot with a label, we call the slot itself the target slot. The slot referred to by the slot label is called the source slot. If there is no slot label, the source slot is always taken to be the root slot of the template (the one identified by root label).

The target lexeme is the lexeme resulting from the evaluation of the target slot. If that evaluation yields a lexeme list, we select the root lexeme of the list. Similarly we find the source lexeme.

Each dependency label should correspond to a function which takes as input argument two lexeme arguments, being the source and target lexemes. If there is no such function, the dependency label is inert. In order to ensure that the order of evaluation of dependency label functions is immaterial, each such function is only allowed to use the following three operations:

Verification of the part-of-speech of the lexeme with respect to the dependency label (i.e. the part-of-speech should be subsumed by some part-of-speech type).^{[Note 9]}
Sharing of some features, identified by their categories, between the two lexemes, by means of unification. Features can only be unified if they are compatible according to a given hierarchy of features. Once they are unified they are shared across the lexeme, and any subsequent change to that feature through any of the lexemes (e.g. by means of another dependency relation function) would affect the other lexeme as well.
Modifying some features of any of the lexemes with a predetermined value, such as assigning “nominative” to the element that is the subject. This is achieved by unifying the feature, identified by its category, with a predetermined value.

Note that the above unification operations operate only on the lexeme-level grammatical features, and not on the features at the form level. After this stage, the common grammatical features of each lexeme are not necessarily true for all the forms, but rather represent grammatical constraints on the forms.

Phase 3[edit]

At this point, the tree structure of the lexeme list is no longer needed and the list can be flattened. Then, the resulting lexeme list of the template is traversed and the list of forms of each lexeme is pruned according to the grammatical constraints, and, in case of grammatical under-specification,^{[Note 10]} sorted lexicographically according to a predetermined relative importance of grammatical categories and features, ensuring that forms with default or unmarked features (e.g. “nominative”) are given priority.

The pruning of the forms can be done in two ways, depending on the consistency and cleanliness of the lexical data:

If the lexical data is known to be consistent and clean, in that there are forms for all linguistically-possible combinations of features, and exactly the necessary grammatical categories are represented, strict pruning can be done, in which only forms which subsume the constraints are retained.^{[Note 11]} In particular, this means that if a form has a grammatical category that isn’t mentioned in the constraints, it will be pruned. Among the resulting forms, we moreover remove the forms which subsume any other forms in that list.^{[Note 12]} This amounts to finding the most specific forms (from the feature-hierarchy point-of-view) which still subsume the constraints.
If the lexical data is only partially known or ‘unclean’, a realizer may implement a best-effort alternative strategy. This may be lenient pruning such that any form whose constraints are unifiable with the constraints is retained or some form of human-in-the-loop pruning where input from a user is sought to fill the gap in order to switch to strict pruning, or some way of graceful degradation to eliminate the affected part.

Both these pruning methods may result in multiple remaining forms. For this reason, it is important to sort the forms according to a canonical order, so that a preferred form can ultimately be chosen.

Phase 4[edit]

In this phase, the linear list of lexemes is traversed, empty lexemes (which can be the result of empty sub-template realization or some function invocations) are removed, and phonological constraints are calculated (or looked up). This allows selecting the phonologically-conditioned forms of the various lexemes, modeling effectively sandhi phenomena (both word-internally and at word boundaries).

The specifics of this phase depend on the realization language. In general, those lexemes which have phonologically-conditioned forms must access the phonological representation (or features) of their neighboring lexemes in the linear list of lexemes, and accordingly select the corresponding form. Note that this may amount to a further pruning of the existing list of forms (according to the given phonological constraints) or some logic may be applied to mutate the existing forms.

This phase should also take care of necessary fusions of forms. This is similar to selecting a phonologically-conditioned form, but affects more than one lexeme (typically two). In some cases it can lead to modification of one lexeme form and removal of the second lexeme form, where two forms fuse together.

Phase 5[edit]

In this phase the final realization text is constructed. The preferred forms of all lexemes (first among the pruned and sorted list of forms) are concatenated together, with appropriate spacing in between. In general, the spacing of the original template is respected, though consecutive spans of spaces may be reduced to a single space. Some spaces may, however, be removed, especially near punctuation marks, or lexemes which are marked as (orthographic) clitics.

In this phase, capitalization and punctuation are also handled, removing any punctuation which becomes redundant at the final realization (this can happen if some sub-template realizes as an empty string, for instance), and capitalizing sentence-initial words (in languages where it is needed).

Example templates[edit]

To illustrate the template syntax and the resulting composition syntax, let's examine the example constructor given in the architecture document:

Age(
    Entity: Malala Yousafzai (Q32732)
    Age_in_years: 24
)

In this simple example, both fields of the constructor are needed to generate a sentence (e.g., “Malala Yousafzai is 24 years old”), but there could be constructors with optional fields, allowing for more flexibility.

We may assume that the constructor is associated (by the Orchestrator, or by a specialized Dispatch function) to a function invocation of the form Age_renderer_xx, where xx stands for the language code. Moreover, the sub-fields of the constructor are available as arguments to these functions. For simplicity of exposition, we use here English labels, but in practice, both the constructor name and its arguments may well be Z-ids.

Let's examine how the templates may look for diverse languages in the template syntax (the derived composition syntax and other mappings can be found in the appendices further below). To make the examples more realistic (and interesting), let's assume that some of them make use of a sub-template invocation Year_xx, whose role is to render just the "n years" noun phrase.

Swedish[edit]

In Swedish, there is no (person) verbal inflection. Only the numeral 1 in this context may inflect according to the gender of the noun, but if we are only interested in numeric output, this is irrelevant. So the resulting template is very simple, as no grammatical agreement is needed.

Age_renderer_sv(Entity, Age_in_years): "{Person(Entity)} är {Age_in_years} år gammal ."

French[edit]

Assuming the Age constructor always refers to people, the main verb avoir can be kept singular form a for this constructor, so there is no need for verbal agreement. On the other hand, the noun for year, an, may inflect for number (ans in plural). To encode this we use a grammatical dependency relation, nummod. The effect of this relation (gender and number agreement between noun and quantifier) is defined elsewhere (in the corresponding grammatical function).

Age_renderer_fr(Entity, Age_in_years):

"{Person(Entity)} a {Year(Age_in_years)}."

Year_fr(years): "{nummod:Cardinal(years)} {root:Lexeme(L10081)}"

Hebrew[edit]

In Hebrew there are two extra complications: one is that the main predicate (a pseudo-verbal particle) needs to agree in Gender with the subject (and takes the number of years as a genitival complement, marked using the dependency role gmod). The second is that the construction to express "n years" (in the context of age) expresses only n or year depending on the value of n. We may assume that a built-in conditional function called ElideIf allows conditional elision of the textual content of the list of lexemes, without removal of the grammatical features of their root. Using the Piping syntax, this can easily be applied on top of the basic slot invocation.

Age_renderer_he(Entity, Age_in_years):

"{subj:Person(Entity)} {root:GenderedLexeme(L64310, L64399)} {gmod:Year(Age_in_years)}."

Year_he(years):

"{nummod:Cardinal(years)|Elide_if(years<=2)} {root:Lexeme(L68440)|Elide_if(years>2)}"

Zulu[edit]

In the Zulu language (a.k.a. isiZulu) the general structure of the sentence is akin to "Person has year(s) that are N" (Output for the example: UMalala Yousafzai uneminyaka engama-25). There are several levels of concord that needs to be taken care of:

The Person function for Zulu should fetch the person's name together with the appropriate nominal marker (which is always u for people, written as u- before vowels).
The verb "have" (na-) needs to agree, by means of a subject concord marker, with the subject. While in general people always have subject marker u-, for the sake of the example we allow here for a more general pattern of subject concord, by enforcing a subj relation between the subject slot and a slot fetching the Subject Concord morpheme (by means of a function of the same name). The Subject Concord morpheme is governed by the noun class of the noun in the subject role.
The word for year unyaka (pl. iminyaka), represented by L-id L686326, agrees in number (singular/plural) with the age. This is achieved by enforcing a nummod relation between the year slot and the slot of the age cardinal.
The relative concord links the years to the number of years. It is determined by the noun class of the word for years (which is different for the singular and plural forms). This is achieved through the concord relation applied on a function fetching the Relative Concord marker, having the lexeme "year" as its source, that, in turn, has as property a noun class it is of.
The age numeral itself is treated as a noun and needs the noun prefix. Again, this is achieved using the concord relation, this time having its source in the age cardinal number. We assume here that the Cardinal function for the Zulu language enriches the cardinal lexeme with the appropriate noun class, according to rules of the language.

On top of these morphosyntactic agreement patterns, there are several phonological adjustments that need to happen. These are taken care of in the subsequent parts of the NLG pipeline, but are listed here for completeness:

The verb na ("have") is written as a single word together with the following word iminyaka ("years"). The /a/ is merged (i.e., vowel coalescence) with the /i/ to form an /e/.
The copula ba (“to be”) is realized as y if it is followed by an /i/ else ng, in this context. Here it is represented by the function Copula().

Age_renderer_zu(Entity, Age_in_years):

"{subj:Person(Entity)} {root:SubjectConcord()}na{Year(Age_in_years)}."

Year_zu(years):

"{root:Lexeme(L686326} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounConcord()}-{nummod:Cardinal(years)}"

Breton[edit]

(Many thanks to VIGNERON for bringing up this example)

Breton is similar to Swedish in that there is no needed morphological inflection, as nouns always take the singular form after numbers. The resulting sentence for the example constructor should be "25 bloaz eo Malala Yousafzai". For this reason, we don't need to annotate the slot of the "year" lexeme (bloaz, Lexeme L45068) with any dependency relation label, as the singular form would be selected by default. On the other hand, the word bloaz may undergo a phonological mutation (called softening) following certain digits (1, 3, 4, 5, and 9) and the number 1000, to become vloaz. To account for this, we must assume that the Breton implementation of the Cardinal function (Cardinal_br) annotates the rendered number with a phonological feature which indicates that this number may trigger softening on the token following it. If that's the case, the phonotactics module of the NLG architecture would take care of the softening of the following noun, without any dependency label annotations.

Age_renderer_br(Entity, Age_in_years):

"{Cardinal(Age_in_years)} {Lexeme(L45068)} eo {Person(Entity)} ."

Instrumentalization of the Template Syntax[edit]

The semantics of the template syntax are given by its realization algorithm, which can be implemented as is (see also implementations), or the template syntax may be used in other ways. In the context of Wikifunctions, the template syntax has to be mapped algorithmically into the Wikifunctions composition syntax. Other approaches may involve mapping the template syntax to other NLG frameworks. A sampling is presented below; more may be added in due course as interest, techniques, and tooling increase.

Transformation of template syntax to Wikifunctions composition syntax[edit]

Transformation of template syntax to other formalisms in NLG realizers[edit]

Implementations[edit]

Currently there is a single implementation of the system, as a Scribunto module.

Footnotes[edit]

↑ In the Abstract Wikipedia use case, this would mostly be information originating in Wikidata, including the new Abstract Content which is yet to be defined.
↑ In practice, since the slots may include functions which perform arbitrary computations, they are in fact more expressive than a pure CFG system.
↑ More generally, templates are specified by using a template language, be this formulated descriptively, recast as a Context-Free Grammar (availing of the duality of grammar and language) or in the more common Backus-Naur Form notation, or as a model, such as an XML schema (XSD) and document type definition (DTD) or an ontology (Mahlaza & Keet, 2021).
↑ If the template contains text alone, which is rendered unchanged, it can be considered as "canned text" which some authors do not consider as templates proper since it does not have at least one slot (see, e.g. ToCT). It is allowed here, as it may be useful in some cases. Similarly, under certain preconditions, one may want a sub-template not to be realized, which can be implemented as the realization of an empty string; alternatively, this can be addressed by processing preconditions differently. Note that elements of text within the template may be changed by the NLG pipeline, if they correspond to certain predetermined lexemes.
↑ See for instance here.
↑ The term "interpolation" is used here in the sense "the insertion of something of a different nature into something else" (as defined by the Oxford Languages dictionary).
↑ As an aside, regarding the data types used in these examples: the actual data types would be those defined in Wikifunctions. As an example, you can see the core types here.
↑ Such as the TemplateText function mentioned above.
↑ Note that another way of verification could be used as well and be it at this stage or as a separate module, i.e., there’s the aspect of verifying part-of-speech and verifying the asserted dependencies and the when and where constraint checking has to happen and how to manage that.
↑ Under-specification may occur if a lexeme has forms which differ according to a grammatical feature which is not specified by the template (directly or indirectly). In particular, this may also occur when Wikidata lexemes specify forms distinguished by phonotactic features. Since at this phase phonotactic features have not been propagated yet, all possible phonologically-conditioned forms will be available after pruning.
↑ See here for an explanation of subsumption of linguistic features.
↑ This can be done at the same time when finding suitable candidates which subsume the constraints.

References[edit]

↑ Deemter, Kees van; Theune, Mariët; Krahmer, Emiel (2003). "Real versus Template-Based Natural Language Generation: A False Opposition?". Computational Linguistics 31 (1): 15–24. ISSN 0891-2017. doi:10.1162/0891201053630291.
↑ Reiter, Ehud; Dale, Robert (1997). "Building applied natural language generation systems". Natural Language Engineering 3 (1): 83–84. doi:10.1017/S1351324997001502.
↑ Mahlaza, Zola, & Keet, C. Maria. (2021). ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS'21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969.
↑ Gutman, Ariel; Ivanov, Anton; Kirchner, Jess (2019), Using Dependency Grammars in guiding Natural Language Generation, Poster presented in the The Israeli Seminar of Computational Linguistics, IBM Research, Haifa.
↑ Gutman, Ariel; Ivanov, Anton; Saba Ramírez, Jessica (2022), Using Dependency Grammars in guiding templatic Natural Language Generation, Working paper, retrieved 2022-07-27
↑ Kibort, Anna; Corbett, Greville G. (2008). "Grammatical Features Inventory: Typology of grammatical features". University of Surrey. doi:10.15126/SMG.18/1.16. |access-date= requires |url= (help)

[4] In the Abstract Wikipedia use case, this would mostly be information originating in Wikidata, including the new Abstract Content which is yet to be defined.

[7] In practice, since the slots may include functions which perform arbitrary computations, they are in fact more expressive than a pure CFG system.

[8] More generally, templates are specified by using a template language, be this formulated descriptively, recast as a Context-Free Grammar (availing of the duality of grammar and language) or in the more common Backus-Naur Form notation, or as a model, such as an XML schema (XSD) and document type definition (DTD) or an ontology (Mahlaza & Keet, 2021).

[9] If the template contains text alone, which is rendered unchanged, it can be considered as "canned text" which some authors do not consider as templates proper since it does not have at least one slot (see, e.g. ToCT). It is allowed here, as it may be useful in some cases. Similarly, under certain preconditions, one may want a sub-template not to be realized, which can be implemented as the realization of an empty string; alternatively, this can be addressed by processing preconditions differently. Note that elements of text within the template may be changed by the NLG pipeline, if they correspond to certain predetermined lexemes.

[10] See for instance here.

[11] The term "interpolation" is used here in the sense "the insertion of something of a different nature into something else" (as defined by the Oxford Languages dictionary).

[12] As an aside, regarding the data types used in these examples: the actual data types would be those defined in Wikifunctions. As an example, you can see the core types here.

[13] Such as the TemplateText function mentioned above.

[15] Note that another way of verification could be used as well and be it at this stage or as a separate module, i.e., there’s the aspect of verifying part-of-speech and verifying the asserted dependencies and the when and where constraint checking has to happen and how to manage that.

[16] Under-specification may occur if a lexeme has forms which differ according to a grammatical feature which is not specified by the template (directly or indirectly). In particular, this may also occur when Wikidata lexemes specify forms distinguished by phonotactic features. Since at this phase phonotactic features have not been propagated yet, all possible phonologically-conditioned forms will be available after pruning.

[17] See here for an explanation of subsumption of linguistic features.

[18] This can be done at the same time when finding suitable candidates which subsume the constraints.

[1] Deemter, Kees van; Theune, Mariët; Krahmer, Emiel (2003). "Real versus Template-Based Natural Language Generation: A False Opposition?". Computational Linguistics 31 (1): 15–24. ISSN 0891-2017. doi:10.1162/0891201053630291.

[2] Reiter, Ehud; Dale, Robert (1997). "Building applied natural language generation systems". Natural Language Engineering 3 (1): 83–84. doi:10.1017/S1351324997001502.

[:0-3] Mahlaza, Zola, & Keet, C. Maria. (2021). ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS'21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969.

[5] Gutman, Ariel; Ivanov, Anton; Kirchner, Jess (2019), Using Dependency Grammars in guiding Natural Language Generation, Poster presented in the The Israeli Seminar of Computational Linguistics, IBM Research, Haifa.

[6] Gutman, Ariel; Ivanov, Anton; Saba Ramírez, Jessica (2022), Using Dependency Grammars in guiding templatic Natural Language Generation, Working paper, retrieved 2022-07-27

[14] Kibort, Anna; Corbett, Greville G. (2008). "Grammatical Features Inventory: Typology of grammatical features". University of Surrey. doi:10.15126/SMG.18/1.16. |access-date= requires |url= (help)

[1]

[2]

[3]

[Note 1]

[4]

[5]

[Note 2]

[Note 3]

[Note 4]

[Note 5]

[Note 6]

[Note 7]

[Note 8]

[6]

[Note 9]

[Note 10]

[Note 11]

[Note 12]