Module:Sandbox/AbstractWikipedia

From Meta, a Wikimedia project coordination wiki
Module documentation

This module, created by user:AGutman-WMF, is a prototype implementation of Abstract Wikipedia's template language in Scribunto.

For an overview of the logic of the entire system, see the overview documentation page.

The current module handles the high-level verbalization of templates and abstract content. It makes use of several sub-modules, which can be divided into distinct categories:

Core code[edit]

This is the core code needed for the system to run. While it may need maintenance, it generally does not need contributions by the template-language users. If the code is to be ported to Wikifunctions, this code can probably live in the back-end code of the Wikifunctions Orchestrator or Evaluator:

  • The current module runs the entire NLG pipeline. It relies on the other modules to do this.

User-modifiable code & data[edit]

The system relies on contributions from users in various ways, both in terms of code and data. When ported to Wikifunctions, these would probably live in the user-visible (and user-modifiable) part of Wikifunctions.

Code[edit]

  • Module:Sandbox/AbstractWikipedia/Constructors is a module which fetches or creates abstract content for certain types of items using Wikidata properties. It allows filling in for items where no curated abstract content has been created.
  • Module:Sandbox/AbstractWikipedia/TextAssembler corresponds to the last stage of the architecture, assembling the output text of the pipeline while adjusting punctuation, spacing and capitalization. This is done in this module in the function constructText, which is intended to be language-agnostic. However, some data-tables within the module allow adjusting the realization behavior of specific punctuation marks.

Data[edit]

  • Module:Sandbox/AbstractWikipedia/GrammaticalFeatures provides tables which link Wikidata grammatical features and categories Q-ids to internal representation, as well as providing a canonical ordering of these features (necessary for the lexeme form selection algorithm).

Usage[edit]

There are two different modes the module can be used with:

Write content about a Wikidata item[edit]

Using the function content, the system will attempt to write some content about the given item. The write-up can either be based upon the manually-curated abstract content or be constructed on-the-fly from existing Wikidata properties. Either way, in order for the abstract content to be realized, appropriate renderers for the realization language must have been priorly defined.

The content function takes two arguments:

  • The first argument is the Q-id.
  • The second argument is the required language (e.g. he) If omitted, the content language of the project will be used.

Example[edit]

{{#invoke:Sandbox/AbstractWikipedia|content|Q937|en}}

Albert Einstein was a German physicist. He was born 14 March 1879 in Ulm and died 18 April 1955 in Princeton.

{{#invoke:Sandbox/AbstractWikipedia|content|Q6279|en}}

Joe Biden is an American politician. He was born 20 November 1942 in Scranton.

Direct Template Realization[edit]

One can ask to realize a specific template given as an argument, using other given arguments, using the function render.

  • The first argument is the template itself, using a sub-set of the template syntax described in the Template Language for Wikifunctions proposal (see limitations below).
  • The second argument is the language of rendering, as language code (e.g. en) If omitted, the content language of the project will be used.
  • Any following named arguments are arguments to the given template, which can be evaluated using interpolation. In general, the value of these arguments is passed on as plain text, but, when an interpolation is evaluated in the scope of a slot (and not a function argument), it is handled specially:
    • If the value is of the form of an L-id or a Q-id, the relevant function (Lexeme or Label/Person) will be invoked with this value.
    • If the value contains a slot syntax (i.e. anything surrounded by { }) the text will be evaluated as a subtemplate, which itself has access to all arguments of the template.
    • Otherwise, the text will be passed on to the TemplateText function.

Examples[edit]

{{#invoke:Sandbox/AbstractWikipedia|render|{nummod:Cardinal(num)} {root:Lexeme(noun)}|en|num=5|noun=L7}}

will render 5 cats while

{{#invoke:Sandbox/AbstractWikipedia|render|{nummod:Cardinal(num)} {root:Lexeme(noun)}|en|num=1|noun=L1122}}

will render 1 dog.

See more examples in User:AGutman-WMF/Template Examples.

Notes[edit]

There are several differences between this prototype and the Template Language for Wikifunctions proposal, or elaborations of points which were not completely specified there.

  • The functions callable within the templates are limited to those defined in Module:Sandbox/AbstractWikipedia/Functions and its submodules. Similarly, all relation functions must be defined in Module:Sandbox/AbstractWikipedia/Relations and its submodules.
  • The implementation of the language-specific function dispatch is different from what is stated in the proposal: instead of having language-code suffixes of function names, the prototype simply loads the relevant language-specific implementations in the appropariate submodule, e.g. Module:Sandbox/AbstractWikipedia/Functions/en for English. The language-agnostic functions are still available (if not overriden) thanks to Lua's metatable mechanism. One could use the same mechanism to define longer chains of language-inheritance.
  • Subtemplates can be defined as functions using the evaluateTemplate call. See for instance the implementation of QuantifiedNoun in the /Functions module. These subtemplates have access both to their own arguments and the global arguments passed to the top-level template.
  • Alternatively, subtemplates can be used as expansion of interpolation arguments, as explained above. These subtemplates have only access to the global arguments.
  • L-ids and Q-ids have special semantics, in that when they appear within a slot (e.g. {L123})they expand to the appropriate Lexeme or Label invocation (the latter calls the Person function if the Q-id refers to a human being). This happen also if they are passed as interpolation arguments.
  • Numbers given within a slot (e.g. {5}) are expanded to the Cardinal invocation. This, however, doesn't happen for numeric interpolation arguments.
  • The phonotactics and the spacing/capitalization module haven't been implemented yet.
  • Spans of spaces are conserved by the parser, and are considered as special elements of text (spacing elements).
  • Punctuation is marked specially as punctuation, but there is currently no special treatment of it. To treat punctuation as simple text, one can enclose it in a textual slot (e.g. {"."}) (but this doesn't work for the colon and the } symbol, due to limitations of the parser).


local p = {}

-- This is the main module for the template evaluation to be invoked from
-- content pages


local evaluator = require("Module:Sandbox/AbstractWikipedia/TemplateEvaluator")
local c = require("Module:Sandbox/AbstractWikipedia/Constructors")
local default_functions = require("Module:Sandbox/AbstractWikipedia/Functions")
local default_relations = require("Module:Sandbox/AbstractWikipedia/Relations")
local t = require("Module:Sandbox/AbstractWikipedia/TextAssembler")

-- global variables (populated below)
-- Note that in Wikifunctions the functions, relations and renderers should be
-- available globally as Wikifunctions functions. Thus, the only global variable
-- needed is only the realization language variable, and that is in fact just
-- for convenience.
functions = {}
relations = {}
renderers = {}
language = ''

applyPhonotactics = function (lexemes) end -- do nothing by default

-- Initializes the above global variables, given an optional language code
-- If no language code is given, defaults to the Wiki's content language
local function initialize ( lang )
	if lang then
		language = lang  -- global variable
	else  -- default to content language
		language = mw.getContentLanguage():getCode()
		mw.log("Using langauge "..language)
	end
	-- Initialize language-specific functions and relations
	local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Functions/"..language) 
	functions = status and module or {}
	setmetatable(functions, { __index = default_functions } )
	local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Relations/"..language) 
	relations = status and module or {}
	setmetatable(relations, { __index = default_relations } )
	local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Renderers/"..language) 
	renderers = status and module or {}
	-- There are currently no default renderers; to be added if appropriate
	local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Phonotactics/"..language) 
	if status then
		applyPhonotactics = module.applyPhonotactics
	else
		mw.log("No phonotactics module found for language "..language)
	end
end

-- This function flattens the lexeme tree structure to a flat list result
local function flatten ( lexemes, result )
	for _, lexeme_list in ipairs(lexemes) do
		if lexeme_list.root then
			flatten(lexeme_list, result)
		else -- It is a single lexeme
			table.insert(result, lexeme_list)
		end
	end
end

-- This function filters the forms of the lexemes to be consistent with their
-- features (morphosyntactic constraints)
local function applyConstraints ( lexemes )
	for _, lexeme in ipairs(lexemes) do
		lexeme.filterForms()
		lexeme.sortForms()  -- To ensure that canonical forms are prefered
	end
end

local function realizeTemplate(template, language_code, args)
	initialize(language_code)
	lexeme_tree = evaluator.evaluateTemplate(template, args)
	local lexemes = {}
	flatten(lexeme_tree, lexemes)
	applyConstraints(lexemes)
	
	applyPhonotactics(lexemes)
	
	return t.constructText(lexemes)
end

-- API function to render a template
function p.render ( frame )
	--frame.args[1] is the template, frame.args[2] is the optional language code
	return realizeTemplate(frame.args[1], frame.args[2], frame.args) 
end

-- API function to write content about a q_id
function p.content ( frame )
	local q_id = frame.args[1] or error "First argument should be Q-id"
	local lang = frame.args[2] -- fallback to content language
	local outline = c.Constructors(q_id)
	local content = ''
	for _, constructor in ipairs(outline) do
		local args = { ["main"] = constructor }
		local result = realizeTemplate("{root:main}", lang, args)
		if #result > 0 then
			if #content > 0 then
				-- Add spacing between setnences: possibly language dependant
				content = content .. ' ' .. result 
			else
				content = result
			end
		end
	end
	return content
end

return p