Abstract Wikipedia/Natural language generation system architecture proposal/kg
| Abstract Wikipedia |
|---|
| (Discussion) |
| General |
| Development plan |
|
| Notes, drafts, discussions |
|
| Examples & mockups |
| Data tools |
| Historical |
Proposal by Ariel Gutman
Mukanda yai ke tendula mutindu ya kuyidika ndinga ya bantu (NLG) sambu na Wikipedia ya kukonda mfunu. Ntangu beto ke tadila architecture ya systeme ya NLG, beto fwete tadila mambu yai ya ke landa:
- Modularity: the system should be modular, in that various aspects of NLG (e.g. morphosyntactic and phonotactic rules) can be modified independently.
- Lexicality: the system should be able to both fetch lexical data (separate from code), and rely on productive language rules to generate such data on the fly (e.g. inflecting English plurals with an -s).
- Recursivity: due to the compositional and recursive nature of most languages,[1] an effective NLG system would need to be recursive itself.
Na mambu ya Wikipedia ya kukonda mfunu, diambu ya nkaka ke kwisa na mabanza:
- Extensibility: the system should be amenable to extension both by linguistic experts and technical contributors as well as by non-technical and non-expert contributors, working on different parts of the system.
Na bandilu yina kele na zulu, yo ke monana nde yo kele mbote na kuyindula nde kisalu mosi ya Wikifunctions (=WF) lenda kuka ve kubakisa mbote-mbote mpasi ya ngidika ya NLG ya mutindu yai, kansi bisalu mingi ya mutindu yai fwete vukana, konso muntu kele na mukumba ya kitambi ya kuswaswana na nzila ya NLG.
Na mutindu ya bubu ya kusala bisalu ya WF, muntu mosi lenda:
- Binga bisalu ya nkaka ya WF.
- Baka bansangu na bisika ya nganda bonso Wikidata.
- Sobaka mutindu ya kusala mambu na ntoto ya mvimba.
Sambu na kununga bandilu yai, mukanda yai ke pesa ngindu nde nzila ya NLG fwete twadisama na WF Orchestrator, yina kele ve na nsi ya bandilu yai. Diaka, sambu na kusadisa bantu yina kele ve bantu ya mayele na kuvukana, bo me pesa ngindu ya kusala ndinga ya kutendula na kati ya nzo, yina lenda twadisama na muntu ya ke tadilaka mambu ya WF.
Mutindu ya nkaka ya kusala, ta vanda kukatula bandilu ya kusala ya bantu yina ke tadilaka WF, sambu na kukotisa nzila ya NLG ya mvimba na kati ya kisalu mosi ya WF (yina ta lomba bisalu ya nkaka ya WF). Ata metode yai ta soba mambu ya nkaka ya kusadila metode yai (mu mbandu, bantu yina ke pesaka maboko na WF lenda yidika mutindu ya kuyidika yo), mutindu ya kuyimba ta bikala kaka mutindu mosi.
Na nsuka ya mukanda, bo ke pesa mwa kufwanisa ya nkufi ti bametode ya nkaka yina bo ke pesa ngindu.
Architecture overview
Mutindu beto tendulaka yo na zulu, nzila ya mvimba ya NLG lenda vanda ve na kati ya kisalu mosi ya Wikifunctions (=WF), kansi yo fwete twadisama na muntu ya ke yidikaka WF, yina ta pesa nzila na kubaka bansangu na bisika ya nganda (mingi-mingi Wikidata), kubinga bisalu ya kuswaswana ya WF (yina bantu ke pesa maboko ke tendulaka) mpi kutanina nkadilu ya mfunu ntangu beto ke sala yo. Bo ke monisa mutindu ya kutunga na diagrame ya ke landa, kisika mutindu ya bleu ya mudidi kele bima yina bantu ya ke pesaka maboko na Wikifunctions (rectangles) to Wikidata (rectangules ya nziunga), kansi bima ya bleu ke monisaka kisalu to bansangu yina ke zingaka na kati ya muntu ya ke yidikaka WF, mpi yo ke pesaka ve maboko na kimvuka ya bantu.

Bika beto tubila bitambi na yo na bunda:
- Kana bo me pesa mutindu ya muntu yina ke tungaka, bo ke ponaka muntu yina ke balulaka, yo lenda vanda mfunu na kupesa nzila na muntu yina ke salaka kisalu ya kubalula yo na zina (mu mbandu, "makwela ya Marie ti Pierre") to na bangogo ("Marie kwelaka Pierre"). Na diambu yina, yo ta vanda mfunu na muntu mosi ya ke balulaka yo. mpi bansangu yina kele na kati ya muntu yina bo me pesa ke lutaka na muntu yina ke balula yo bonso bantendula ya kisalu na yo.
- The renderer is basically a template: a combination of static text, and slots which can be filled with the Renderer’s arguments, lexemes from Wikidata, or the output of other renderers. Templates are relatively easy to understand and write, and thus the authoring of renderers will be accessible to non-technical contributors.
- The output of the renderer is a dependency syntax tree (using for instance Universal Dependencies (UD) or Surface-Syntactic Universal Dependencies (SUD) formalisms)[2] in which the nodes are non-inflected lexemes (identified by their lemmas), augmented with some morphological constraints. In practice the tree doesn’t need to be fully specified; in particular, static text doesn’t necessarily need to be part of the tree.
- Relying on a language-specific grammar specification, the morphological constraints coupled with structure of the syntactic tree allow the inflection of the lemmas, according to the lexical data present in Wikidata, or using inflectional tables of the grammar specification. The output of this step is a linear sequence of text, minimally annotated with part-of-speech information (i.e. whether a word represents a noun, a verb, a preposition etc.).
- At this step phonotactic constraints are being applied, applying language specific sandhi phenomena. These can include the selection of contextual forms (e.g. in English a/an) or contraction/crasis of adjacent forms (e.g. French de + le = du).
- Sambu na kitambi ya nsuka ya kutula bunkete, yo lenda lomba kusoba luswaswanu, bisono ya nene mpi bidimbu sambu na kumonisa masonama ya nsuka sambu na kubumba yo na disolo ya Wikipedia. Bo lenda yidika kitambi yai na mutindu ya kukonda kuzaba ndinga, na kusadilaka (ndinga-ndinga) bantendula ya bitambi ya me luta.
Na nzo yina kele na zulu, kele ti bitini tatu yina bantu ya kimvuka fwete yidika:
- Templatic renderers - these make up the bulk of the needed work, as every constructor needs one templatic renderer per language (though re-use of renderers for parts of sentences is possible). Note that the term Renderer is used here in a narrower sense than in Architecture for a Multilingual Wikipedia. In the latter, the term Renderer refers to an end-to-end data-to-text function, while here we use the term Renderer to refer to a specific component of the NLG pipeline, namely a template. This is no coincidence, since in the above architecture, the other parts of the pipeline are relatively fixed, and don’t need constant curation by community members.
- Grammar specifications - these would have to specify the relevant morphological features needed for each language, their hierarchy and how these manifest themselves via dependency relations. These specifications may either be stored as data in Wikidata, or as functions in Wikifunctions (to be decided). It is probable that the creation and curation of these grammars will require substantial linguistic and technical knowledge, but since they are created once per (human) language, this is deemed acceptable.
- Wikidata lexemes - these will be curated as today, but it would be important that the features they use are inline with the grammar specifications of each language.
Structure of templates
Sambu kisalu ya nene ya mfunu ya bantu yina ke pesaka maboko na kimvuka ta vanda kusala bambaludi ya Templatic, yo kele mfunu na kukumisa kisalu yai pete na kiteso ya me fwana, mpi mingi-mingi kubuya kulomba eksperiansi ya kusonika.
Similar to the Composition “language” in Wikifunctions, we can develop an in-house templating language.[3] The templating language should allow specifying a linguistic tree (with UD annotations) over three types of arguments:[4]
- Masonama ya ke vandaka ve ya kieleka
- Terminal functions fetching lemmas from Wikidata, or creating lemmas on the fly from other arguments (e.g. numbers[5]).
- Bambalula ya nkaka
Ndinga ya kutendula ta vanda ti module ya kutomisa ya mepesama, yina muntu ya ke yidikaka miziki ya WF ta binga. Bantu ya nsuka ta vanda na mukumba ya kuluta na ba module ya mutindu na mutindu ya nzila ya NLG yina beto tubilaka na zulu.
Example
Beto baka mbandu ya muntu ya ke tungaka yina ke tendulaka mvula ya muntu: Na mutindu ya kusadila, mvula fwete tangama katuka na kilumbu ya kubutuka, kansi sambu na mbandu, yo me sonama na muntu ya ke tunga. Beto lenda yindula mpi muntu ya ke tungaka na mutindu ya mbote yina bo ke salaka bakalkile ya bansangu na mbala mosi.[6]
Age( Entity: Malala Yousafzai (Q32732) Age_in_years: 24 )
Sambu na kubalula muntu ya mutindu yai na Kingelesi, beto ta sadila notation templatique ya kufwanana na yina ya ke landa (sambu yo kele mutindu ya Z14/Implementation):
{
"type": "implementation",
"implements": "Age_renderer_en",
"template": {
"part": {
"role": "subject", # grammatical subject
"type": "function call",
"function": "Resolve_Lexeme",
"lexeme": {
"reference": "Entity"
}
},
"part": {
"role": "root", # root of the clause
"type": "function call",
"function": "Resolve_Lexeme",
"lexeme": {
"value": "be" # replace with L-id
}
},
"part": {
"role": "num", # numerical modifier
"of": 4, # Part 4 (“year”)
"type": "function call",
"function": "Cardinal_number",
"number": {
"reference": "age_in_years"
}
},
"part": {
"role": "npadvmod",
"of": 5, # Part 5 (“old”)
"type": "function call",
"function": "Resolve_Lexeme",
"lexeme": {
"value": "year" # replace with L-id
}
},
"part": {
"role": "acomp",
"type": "string",
"value": "old"
},
}
}
Bisalu ya nkaka ya syntaxe (npadvmod, acomp) kele ve na kuwakana, yo yina muntu lenda yambula yo.
Structure of grammar
Gramere fwete vukisa bansangu ya ke landa:
- Inki kitini ya ndinga kele na yo.
- Inki mambu ya grammaire kele ya kufwana sambu na konso kitini ya ndinga
- (Mbala ya nkaka) ndonga ya mutindu ya mambu
- Mutindu kuwakana ya gramere ke salaka (mu mbandu kuwakana ya ke tadila mambu) na kuwakana ti mambu ya gramere mpi bitini ya ndinga.
Note that the first points can be inferred from the Wikidata lexemes available for a given language, but it would be useful to make them explicit as part of a grammar definition, which would also enforce/validate the Wikidata lexeme definitions.[7] One could write such a validator per language as a WF function, which would then run on the Wikidata lexemes to mark if they are correctly annotated according to the language's schema.
Sambu na kuwakana ya gramere, bo lenda sonika yo bonso bansangu na Wikidata to bonso bisalu na WF. Bangwisana ya ke wakana lenda salama bonso kuvukisa mambu ya gramere ya ba node na bo, muntu lenda sadila konso kuwakana bonso kisalu ya Composition WF, na kusadilaka opérateur Unify bonso kisalu ya kutunga. Mu mbandu, kuwakana ya "subj" sambu na Kingelesi ta salama mutindu yai:
subj_en(noun, verb):
Unify(noun.pos, NOUN); # Validate types
Unify(verb.pos, VERB);
Unify(noun.number, verb.number);
Unify(noun.person, verb.person);
Unify(noun.case, NOMINATIVE);
Tala nde kana bo me sadila yo mutindu yai, kisalu ya subj kele ve kisalu ya bunkete, sambu yo ke bebisa bantendula na yo ya kukota (mpi na masonga yonso, bo ke sadilaka ve valere ya kuvutula, katula kaka kana kifu ya kuvukisa salamaka). Sambu na kukumisa mambu pete, kikalulu yai ya sipesiali ta lomba lusadisu ya muntu yina ke tadilaka bisalu.
Muntu lenda zola kuvukisa bima yina ke vukana. Mu mbandu, kana beto ke mona nde ntalu ti muntu ke vukana mbala mingi, beto lenda tendula kisalu mosi ya fioti bonso yai:
agr(left, right):
Unify(left.number, right.number);
Unify(left.person, right.person);
Na nima beto lenda tendula diaka kuwakana ya subj na zulu mutindu yai:
subj_en(noun, verb):
Unify(noun.pos, NOUN); # Validate types
Unify(verb.pos, VERB);
agr(noun, verb);
Unify(noun.case, NOMINATIVE);
Modular design of grammars and renders
It is often the case that languages from the same language family exhibit some grammatical and structural similarities. One may take advantage of this phenomenon by defining a hierarchy of languages and language-families,[8] and allow the NLG system to use dynamic dispatch to the most concrete implementation of a (sub)-renderer or a (sub)-relation.
Other approaches
Tii bubu yai, mono me zaba ba systeme zole ya nkaka yina bo me pesa ngindu ya kusadila NLG ya Wikipedia ya kukonda mfunu.
- Grammatical framework (GF) is an established functional programming language intended to support multilingual natural language generation and understanding (see newsletter description). It has a thriving community of computer scientists, linguists and other enthusiasts who contribute to it.
- Ninai/Udiron is a Python-based NLG system built by community member Mahir Morshed, which uses lexeme data from Wikidata and combines them using UD trees. The system has been built with the Abstract Wikipedia project in mind. Some interesting examples of constructors and how they are rendered can be found in the Ninai demonstrations.
Ata ba systeme yai zole kele ya kuswaswana, beto lenda fwanisa yo ti ngindu yina bo me tubila na mukanda yai na nzila ya axis ya mutindu mosi:
- Bametode yonso zole kele na lukanu ya kubalula mambu ya kukonda mfunu mpi ya kuvukisa na mutindu ya gramere mpi na nima masonama.
- Bo ke lombaka kuzaba mbote-mbote mayele ya baprograme, yo vanda ndinga mosi ya sikisiki (GF) to ndinga mosi ya baprogramme (Python).
- The ordering of the words in the output text is determined by the entire NLG pipeline (e.g. adding a Question operator could change the word order in English).
- Kana bantendula ya gramere kele ya mbote, yo ke ndimisa nde yo ta vanda ya gramere
Na ndambu ya nkaka, ngindu ya bo me tubila na mukanda yai kele sambu na kukumisa yo pete sambu na bantu yina kele ve ti nzayilu ya teknolozi na kupesa makabu. Yo ke tendula mambu yai:
- Yo lenda sala na mutindu ya mbote, ya kukonda kuvukisa, ya semantique (bonso mbandu ya Age na zulu). Kansi, yo ke katulaka ve kusadila bifwanisu ya nkaka.
- Na nivo ya luyantiku, yo ke lombaka ve mayele ya baprograme sambu na kusonika ba templatic. Nzayilu ya ndinga (mingi-mingi bansangu ya ke tadila mambu) lenda vanda mfunu sambu na kubasisa grammaire, mpi yo kele mfunu sambu na kusonika bansangu ya grammaire yo mosi.
- Ndonga ya bangogo ke monanaka na ba templates yo mosi, mpi yo ke sobaka ve na nima na nzila.
- Kubasika lenda vanda ya kukonda gramere, kana bo me sala ve modele mbote.
Footnotes
- ↑ The question whether recursion exists in all languages has seen heated debate in recent years
- ↑ The SUD formalism is simpler and possibly more adequate for NLG tasks. Osborne & Gerdes (2019) provide a discussion of the shortcoming of UD. See also https://surfacesyntacticud.github.io/conversions/ for a comparison of the two formalisms. In either case, we may need to extend the set of dependency relations in order to capture some patterns required for NLG, such as pronominal cross-reference.
- ↑ The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.
- ↑ See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.
- ↑ One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.
- ↑ Na mutindu ya kusadila, mvula fwete tangama katuka na kilumbu ya kubutuka, kansi sambu na mbandu, yo me sonama na muntu yina tungaka yo. Beto lenda yindula mpi muntu ya ke tungaka na mutindu ya mbote yina bo ke salaka bakalkile ya bansangu na mbala mosi.
- ↑ Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".
- ↑ Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.