A proposal towards a multilingual Wikipedia

Wikipedia provides knowledge in more than 200 languages. Whereas a small number of languages are fortunate enough to have a large Wikipedia, many of the language editions are far away from providing a comprehensive encyclopedia by any measure. There are several approaches towards closing this gap, mostly focusing on increasing the number of contributors to the small language editions or to improve the provision of automatic or semi-automatic translations of articles. Both are viable. In the following we present a proposal for a different approach, which is based on the idea of a multilingual Wikipedia.

Imagine a small extension to the template system, where a template call like {{F12}} would not be expanded by a call to the template Template:F12, but rather to Template:F12/en, i.e. the template name with the selected language code of the reader of the page. A template call such as {{F12:Q64|Q5519|Q183}} can be expanded by Template:F12/en into “Berlin is the capital of Germany.” and by Template:F12/de into “Berlin ist die Hauptstadt Deutschlands.” (in the example, the template parameters Q5119, Q64 and Q183 refer to the Wikidata items for capital, Berlin and Germany respectively, which the templates query for the label in the respective language). Sentence by sentence could be created in order to provide for a simple article.

That wiki would consist of content, i.e. the article pages, possibly just a simple series of template calls, and frames, i.e. the templates that lexicalize the parameters of a given template call into a sentence (Note that “sentence” here should not be considered literally. It could be a table, an image, anything). The implementation of the frames can be done in normal wiki template syntax, in Lua, in a novel mechanism, or a mix of these. This would be up to the communities creating them.

Technical implementation

Precondition for this proposal is the support of Wiktionary data in Wikidata, in a way that allows us to access arbitrary forms of words from any template. This is important as we need to be able to choose the right case for a word, or an adjective based e.g. on the grammatical gender of a word. While the preconditions are not fulfilled, goal P1 can be ongoing (and requires moderation to come to the necessary decisions in a timely manner).

Goals: Preparation

P1: Design restrictions for wikitext in the different namespaces. Designing appropriate restrictions for the different namespaces of the project is considered crucial for the sustainable growth of the project. This is discussed below.

Goals: Necessities

Phase 1 goals are all required in order for the project to start.

N1: Implement the restrictions developed in P1. This also includes handling pagenames and their relation to Wikidata items.
N2: Language-selective templates aka frames. Based on the users language setting, the according version of the template will be called. Note that it must be ensured that all language versions of a frame have the same signature. At the same time, it must be possible to change the signature of a frame.
N3: Design and provide a fallback mechanism. Specific language versions of a frame might be missing and it has to be dealt with this situation.
N4: Implement a language sensitive search over the generated results.
N5: Caching. Since the content of a page will depend on the language settings of the user, an appropriate caching mechanism needs to be designed and deployed. One possibility would be to redirect to the appropriate subdomain, and thus use the standard chaching mechanism.

Goals: Improvements

These goals all considerably improve the project, but are – strictly speaking – optional and also independent of each other.

O1: Analyze the restrictions of N1. Consider relaxing or tightening them.
O2: Provide an internationalized form-based sentence editor. Basically a UI showing the text, where the user can select any sentence (or part of the text), and then, based on the template that generated it, sees a form for that template and can edits the parameters. The resulting text is being displayed in real-time. (The Visual Editor team is thankfully implementing most pieces of this part already.)
O3: Generator variables. In order to effectively use anaphora, pronouns and better style, it must be possible for the generators to set context variables that are in turn accessible by the subsequent generators.
O4: Allow textual additions and overrides. For each language, allow contributors to override a lexicalization or add a simple textual sentence. These can be used by more experienced contributors to adapt existing frames or create new ones.
O5: Optimize Wikidata data access in Wikidata client. Based on the usage of Wikidata, analyze for improvements and implement these.
O6: Provide integration patterns into Wikipedias. Provide a small set of patterns that each Wikipedia can choose to take in order to integrate content from the multilingual Wikipedia in their own content, if they so want.

A few possible questions

What do you mean by restricting the namespaces? I like my powerful wikitext!
In order to simplify the further development, it is very strongly suggested to heavily restrict the different namespaces when creating the multilingual Wikipedia. For example, the content pages could start by allowing only template calls and nothing else. The templates could start with restrictions as well, e.g. not to create fragments of wikitext or HTML, but necessarily to come to a well-defined output, etc. Note that we are not taking any existing possibilities away, as the project does not exist yet.
Why take a language-independent internal representation? Why not just use English and translate to the reader’s language when necessary?
Several reasons: natural language processing technology, especially automatic translation, simply does not exist in order to provide sufficient text quality in many of the languages we aim to support. We also should avoid a de-facto primacy of a single language embedded in the technology.
Natural language is ambiguous, and this can make automatic, fine-grained translations impossible. “The Seine is a river.” can not be translated into French without further knowledge, because in French one uses a different word for rivers that flow into the sea (“fleuve”) than those that flow into other rivers (“rivière”). “I went to the garden.” cannot be translated to Croatian without knowing whether the speaker is female or male (“Išla sam u vrt.” / “Išao sam u vrt.”). “He succeeded his uncle.” can not be translated to Uzbek without knowing whether the uncle was his mother’s or his father’s brother (“togʻa” / “amaki”). And this is not due to any insufficiencies in the English language: all languages have such ambiguities that make the translation of certain sentences into specific languages impossible without further information.
Editing the multilingual Wikipedia would be extremely hard!
Yes, this is an acknowledged and real drawback of the proposal. The form-based editing of sentences can potentially be experienced as extremely slow and unnerving, especially compared to normal text editing.
One advantage is that improvements to the editing side of the project can be explored in several independent and different paths, by researchers and volunteers alike. Hopefully one of these paths will lead to a solution that reduces the pain considerably. We expect that the visibility and possible impact of the project will incentivize the required research and experimentation.
But I thought our goal is to increase the number of contributors and simplify editing. This proposal is having the opposite effect: it will likely make editing much harder and decrease the number of contributors and their diversity.
No. Since the current Wikipedias would be basically unaffected by this proposal, editing for current contributors does not change at all. Editing the multilingual Wikipedia would indeed be very different, but this is a completely new project.
Also, our primary mission is allowing everyone to join in the sum of all human knowledge. Currently we fail on providing knowledge in many languages. This proposal has the possibility to significantly increase the amount of knowledge available to many readers. Making it easy to edit is a secondary goal, considered a necessity in order to achieve our mission, but not our primary goal.
I heard that natural language requires a real artificial intelligence (AI). Since this proposal is not building an AI, it must fail.
Understanding natural language is considered by many researchers to be solvable only by a real AI. Fortunately, we have a much simpler goal in mind, which is generating natural language out of a machine-readable presentation.
Why do we still split knowledge into articles? Wouldn’t a question answering system with a single unified knowledge base be better? It could also accept input in form of natural language, and based on that, adapt its knowledge base.
As we do not aim at understanding the sentences but merely representing them, we will not have automatically all necessary means to provide a language answering system that creates answers automatically fitted to a question. We also cannot simply integrate knowledge offered to the system. It is conceivable though that the proposed multilingual Wikipedia might be an important step towards such a system.
To create and maintain the frames is very hard, too hard for our contributors.
I have more confidence in our contributors. Also, with an increased growth of the content of the multilingual Wikipedia, the payoff for creating frames will increase. The growth of content and frames will give positive feedback to each other. Also, every single language creates such a positive feedback loop independently – which means that the project has a broad foundation to grow in a sustainable way.
In a multilingual project, edit wars and discussions will be unresolvable as editors will not be able to engage in discussions in a common language.
This will indeed be hard. Three comments: first, discussions could be held using the multilingual system itself. Second, even if those discussions are very visible they concern only a small percentage of the content. A very useful resource can grow around mostly undisputed knowledge. Third, existing multilingual projects like Commons and Wikidata show us that the community can deal with these effects.
This is impossible! (Or: the underlying theory of language is too naïve, or: poetry can never be caught in a language-independent form and generated, etc.)
There is no need to be able to represent the whole breadth of human language. The multilingual Wikipedia will take time to grow, and will in the beginning often be ridiculed for glaring omissions or terrible text style. But the community can continuously and iteratively improve these parts. Existing stub creation bots demonstrate the effectiveness of text generation, and research and applications of frames and text generation are abundant.
Isn’t Wikidata doing all of this already? Why another new project?
Wikidata has well-defined goals, and it is starting to fulfill this goals. But Wikidata cannot be used to tell stories, offer context, and provide knowledge in the way prose does. Wikidata is a necessary precondition (in particular support for Wiktionary data, which is currently still in the planning stage). The multilingual Wikipedia goes well beyond the goal of Wikidata. Having said that, the multilingual Wikipedia can indeed be a part of the Wikidata wiki. This would allow us for a less exposed project start and a simple deployment.
How will the current Wikipedias interact with the multilingual Wikipedia?
This will be decided by the current Wikipedias. One possibility would be to be able to access the multilingual Wikipedia transparently in cases where there is no article covering the topic in the given Wikipedia. This way the community of a Wikipedia can decide to concentrate on a smaller set of topics that really interests them and choose the multilingual Wikipedia as a backup to provide wider coverage. No Wikipedia with only half a dozen active editors would feel obliged to create articles about all countries, elements, cities, monarchs, species – something that is increasingly often created by bots anyway, but in a static way and without a clear sustainability and update strategy.
This is replacing our beloved Wikipedias!
The number of major breakthroughs that would be required to achieve a full replacement of the large Wikipedias is breathtaking. Consider the side-effects if we really ever reach that point: everyone on the world could communicate with everyone else, and instead of this future being mediated by the big tech-companies for a handful of languages, a community of volunteers achieved this goal, for hundreds of languages, and with the technology and data available to everyone. We would have the depth and breath not only of the English Wikipedia, but of all Wikipedias combined, available in every language.
So the question is not whether we risk what we have achieved so far, but whether we risk where we could be in two or three years.

Related and previous work

This is an incomplete list, but can be used as entrance points to research in natural language generation and frames.

FrameNet by Charles Filmore, http://en.wikipedia.org/wiki/FrameNet and http://framenet.icsi.berkeley.edu
KPML by John Bateman, http://www.fb10.uni-bremen.de/anglistik/langpro/kpml/README.html
Multilingual Document Authoring by XEROX, see e.g. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.7353
AceWiki-GF: http://eswc-conferences.org/sites/default/files/papers2013/kaljurand.pdf