Module:Sandbox/AbstractWikipedia/TextAssembler

From Meta, a Wikimedia project coordination wiki
Module documentation

This module is part of the Abstract Wikipedia template-renderer prototype. It corresponds to the last block in the proposed NLG architecture.

It exposes the function constructText, which responsible for assembling a string of text from a list of lexemes passed to it.

While assembling the text, it takes care of spacing, punctuation and capitalization, according to the information given in the trailing_punctuation and capitalization tables.

Note that previously this code was part of the main module, and as such it is not mentioned in the recorded demo of the prototype.



local p = {}

-- The following gives a list of trailing punctuation signs, and their relative
-- rank. Lower rank (i.e. higher number) means that a punctuation mark is 
-- superseded by an adjacent higher rank mark. Between punctuation marks of equal 
-- rank, the latter supersedes.
trailing_punctuation = { ['.'] = 1, [','] = 2 }
-- The following lists punctuation marks which should trigger capitalization:
capitalization = { ['.'] = true }

-- This functions constructs the final string of the lexemes. 
-- It reduces spans of multiple spacings to a single one, handles punctuation
-- specially, and concatenates the rest of the text.
-- It also handles capitalization (except in the first sentence).
function p.constructText(lexemes)
 	local result = ''
 	local pending_space = ''
 	local pending_punctuation = ''
	for index, lexeme in ipairs(lexemes) do
		local text = tostring(lexeme)
		if lexeme.pos == 'spacing' then
			pending_space = text
		elseif lexeme.pos == 'punctuation' and trailing_punctuation[text] then
			if #pending_punctuation == 0 or trailing_punctuation[pending_punctuation] > trailing_punctuation[text] then
				pending_punctuation = text
			end
			-- Trailing punctuation removes prior space
			pending_space = ''
		elseif text ~= "" then -- Empty text can be ignored
			if result == '' or capitalization[pending_punctuation] then
				text = mw.getLanguage(language):ucfirst(text)
			end
			result = result .. pending_punctuation .. pending_space .. text
			pending_punctuation = ''
			pending_space = ''
		end
	end
	result = result .. pending_punctuation
	return result
end

return p