Wikimedia Fellowships/Project Ideas/Tools for text synthesis

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
List of Project Ideas Wikimedia Fellowship Project Idea

This is a project to implement the necessary tools to make it possible to generate content for parts of articles, or complete stubs, from structured information stored in templates, tables, external wikis like Wikipedia on alternate languages, or perhaps from the new Wikidata store. This is not about translating content from existing articles in other languages, it is about the necessary tools to do synthesis of text from preexisting well-defined data.

It is possible to automate parts of the stub creation process, especially if there already exists sufficient structured information available. Previously the structured information was curated data from external sources, which where then transformed into text by ad hoc scripts and uploaded by bots, but as more and more new language projects emerge (especially in Wikipedia) it will be possible to build stub articles from previously curated information from those existing projects. With the ideas for Wikidata, a new unified data store, this will become even more feasible.

It is although one serious problem that remain unsolved in Wikidata, and that is how to handle text synthesis for languages where complex inflection, derivation, compounding and cliticization must be done. It will be easy to include the structured data in infoboxes, but it will be as difficult as before to synthesize new content.



The primary target for the project is to increase the likelihood of a small lexicon project at Wikipedia to become successful. To achieve this we speed up the initial transition of the project from an initial empty state and up to a usable number of articles. This coincides with Wikimedia Foundations goals to disseminate knowledge effectively and globally, because to disseminate knowledge effectively imply write once and read in many projects and to disseminate knowledge globally imply to use the local languages.

A secondary target is to support formation of feature rich dictionary projects. By adding a small tagset they will be more able to describe usage and inflection, and thereby be more important as a plattform for language research and analyzis.


The main outcome of this proposal is to automate stub creation and text synthesis to a sufficient degree so small language projects can be kick started. To make this possible we believe it is sufficient to integrate a few existing tools, especially existing transducers for natural language processing, and to use them to refine content templates that are populated with structured data from other sources. The parser functions to integrate the tools will be the main deliverables in the project.

The outcome is not to do machine translation, nor to build any language specific transducers or tools to build such transducers. It is neither to make content templates except for a few to demonstrate and test the final system.


The project will have an impact on larger projects like the English Wikipedia as it will make it possible to generate stub articles for subjects foreign to the project, but the largest impact will be on the smaller languages. Instead of starting out with a empty article base where every article must be written from scratch, they will initially define templates for synthesizing stub articles from data stored in Wikidata or similar sources. By defining a few templates for content synthesis whole classes of articles can be created.

It is important to note that small languages is hindered by lack of visibility due to the initial small size of the article base, and because this leads to the low visibility its difficult to get enough interest in the project, and because to few peoples are interested the project does not grow. By transforming structured data into stub articles the project will be jump started and be more visible, possibly giving it a larger audience and thereby groving faster. It will probably look more like a classical lexicon than the usual Wikipedia-projects, but it will have a firm base for future extension.

The stub articles will usually not be created in the target wiki, they will exist as fall backs when there are no real articles. It will be like going to an image page that does not exist on Wikipedia, but it will still show up because it exist on Wikimedia Commons. Likewise the stub article can show up because it exist on Wikidata or at some other source, and there exist a transform so the text can be synthesized.


The main difficulty in creating a sustainable community effort is to make the necessary tools simple enough so an editor with only limited knowledge about computer linguistic can use them and create templates for text synthesis. It seems like it is possible to make sufficiently simple parser functions for some languages, that is languages where most of the text synthesis can be done as inflection, but there might be others where the approach is to simplistic, that is those where derivation and compounding will be important.

Most of the drive for developing the external tools for this project exist outside the Wikimedia community, the community will probably just reuse existing tools. Some of the tools are although rather simple and it could be possible to build the necessary dataset from inside the projects, but this is not necessary for the project to be successful.


Scalability within the movement, that is all the projects in every language, is limited by availability of transducers for the given language. The tools specific for Mediawiki can be made available for various languages through localization, but they will only be useful if the necessary transducers are available for the language. Localization is limited to the parser functions themselves, for them to do anything usefull there must be transducers available for the given language. A lot of such transducers are available, and more becomes available for each year, but they are still available for only a few languages. Even if not all of them have the same feature set, and some might even be incomplete and unusable for machine translation, they might still be sufficient for text synthesis in our context. As new transducers are made available the necessary parser functions can be enabled for the language and missing classes of articles will then be available as soon as the content templates for text synthesis are created.


The primary measure of the success of the project ends with functional code. Then a secondary measure of success will be if content templates are created by the community on Wikipedia for a sufficient number of article classes. A third measure is if this lead to a detectable increase in use of the wiki. A fourth measure is if the community grows, and this is extremely important to achieve sustainability.

At Wiktionary the impact would be rather large for projects where there exist freely licensed wordnet definitions and similar datasets from other projects. Data from those projects can then be reused to both initialize new dictionary entries and to extend existing ones. This should lead to a measurable growth in the dictionaries.

Known problems[edit]

It is very difficult to build a transducer for named entities as there will be a lot of exceptions. This could make it necessary to build some kind of specialized transducer or a exception rule set for an existing transducer into the Mediawiki-specific stub creation engine.

A more generic problem description is that there might be necessary to have a domain specific fall back mechanism.

Submitted by[edit]

I am Jeblad and has been a contributor to several projects under the Wikimedia-umbrella since summer of 2005. Initiating growth in small wikis is one of my pet projects over several years, starting with some ideas for the Northern Sami Wikipedia. I'm pretty sure that the most effective solution to the small language and Wikipedia -problem is to make the few editors at those small wikis more effective. They should not have to translate articles from other wikis to build stub articles, they should focus on writing unique and important articles for the local community. The common bulk of stub articles should be created by (semi)automatic processes.


This section is for endorsements by Wikimedia community volunteers. Please note that this is not a debate, vote, or poll, but is rather a space for volunteers to describe in detail why they think a project idea is of value. If you have concerns or questions rather than an endorsement to make, please use the idea Talk page. Endorsements by volunteers willing to work in collaboration with a fellowship recipient on a project are highly encouraged.