Scribe

From Meta, a Wikimedia project coordination wiki

Home

Blog Posts & Presentations Monthly Updates Get Involved
Scribe

Helping editors of under-resourced languages to create new high-quality Wikipedia articles

Resources


About Scribe:

Scribe is an editing tool for underserved language Wikipedias. This tool will allow editors to have a base to start with when translation is not possible. This will allow editors to choose a subject to write about according to their community interests and notability criteria, regardless of the existence of that topic in other Wikipedias.

Overview of the enhanced editing experience[edit]

Scribe will be implemented as a gadget and will provide editors in underserved communities with: section planning, reference collection, and important key points for each section.

It supports editors by planning the layout of a typical Wikipedia article based on existing articles in their language. Further, users can discover and interact with references in their language, to encourage the writing of high-quality articles. These references, collected from online sources, are summarized into important key points for each section. We will rely on Wikidata in the planning of the sections as well as in the integration of references and facts from the existing information in Wikidata.

Our tool will use techniques from Information retrieval and Natural Language Processing research such as reference collection, document planning techniques and extractive summarization.  

Document Planning with Scribe
Content selection and reference suggestion with Scribe

Document planning[edit]

In the content translation tool, the structure of the source Wikipedia is translated and suggested to the editor in the target Wikipedia. In order to achieve the same, we will generate a suggested structure from similar Wikipedia articles in the underserved community Wikipedia. The similarity score will be calculated through the overlap between relationships in Wikidata between the two entities. In the academic literature, there has been a plenty of work on automatically suggesting structure for Wikipedia articles. We intend to follow a similar line of research to Sauper et al.[1], Piccardi et al. [2], adapted to newly created articles where we fallback into Wikidata to calculate similarity between the newly created article and existing articles in the target language to recommend structure from. Calculating similarity scores between Knowledge base entities is a well studied area of research [3][4][5] which we are going to exploit to rank top section recommendations for new articles.

Important key information with references in the target language[edit]

Wikidata is used in many Wikipedias already [1] (e.g. the Fromage Infobox on French Wikipedia). While many small Wikipedias lack content, Wikidata contains information for over 50 million data items. Of those entities, over 30 Million have at least 5 statements (excluding external identifiers). While this data is a good starting point, Wikidata only supports factual information, not the extensive contextual information usually contained in a Wikipedia article. Therefore, this information represents a very limited coverage of any high-quality Wikipedia article. On the other hand, there are various online sources that are typically used as a reference for articles that contain more (contextual) information. These references are the backbone of any high-quality article on Wikipedia.

Therefore, we propose to support editors in under-resourced Wikipedias by suggesting more information from external sources, in addition to the facts existing in Wikidata. When editing, we display extracts from online resources in the form of bullet points and the link to the respective resource. This gives us the advantage that all external information has to be validated before publication on Wikipedia.  

We plan follow similar techniques from the multi-document summarization literature[6], specially those which rely on a structured aware approach for content selection [7]. To eliminate information redundancy we will rely on techniques from sentence aggregation[8].

For sentence realisation we rely on an extractive summarization approach following the line of work by Nallapati et al.[9] and Kedzie et al.[10]. Our choice of extractive summarization approach rather than an abstractive one is attributed to the the following reasons: Abstractive approaches are more prone to generate semantically incorrect information (hallucinations[11]), so the quality of summaries will not be guaranteed, especially when applying the same techniques on under-resourced languages with fewer training data. Additionally, for each generated word the model has to form a large target vocabulary, which is very slow in practice. Finally and most importantly, we design our tool so that generated summaries are not imported to the target article as they are but rather aid editors by highlighting the key points in the suggested references. Later on, they can reuse those in creating their article and rephrase according to their community editing standards.

Technical Details[edit]

The tool will mainly make use of existing computational resources, it will not be needed to add computational power for this tool only. The computation splits in the following two aspects: (1) client side computation and (2) service based computation. (1) The client side computation (i.e. browser) will mainly be used to manage lightweight frontend functionality, such as the generation of Wikitext, the drag and drop etc. These are functionalities similar to the existing ones and be able to be processed by any computer. (2) The gadget will do API calls to web services hosted on an external server. It is a similar process to what existing gadgets do, e.g. ProveIt [2]. The server side will be responsible for performing tasks such as querying and filtering references, calculating textual similarity between Wikidata entities for section suggestion, performing extractive summarization. In order to reduce the online computation load we intend to do a large amount of caching for potential topics for each target language (existing Wikidata IDs without articles) before the service goes public.

We will work with the server side computational facilities in an incremental way. First, by utilising maximum already available resources for us, such as our personal machines and servers we have access to. If more computational power is needed, we will then work on collaborations with people that already run tools with similar capabilities of which we can use many services in our project, e.g. StepHit [3].

Scribe interface[edit]

This is a preliminary design of our gadget. Along the span of our project, we will try several iterations of this design while trying it and acquiring feedback from the editors’ community. Scribe will be designed to have the following key features:

  • Display of suggested section headers, important key points, references, and images
  • Hovering to expand the references content
  • Drag and drop of the suggested headers, references, images into the editing field (in Visual Editor as well as Wikitext)
  • Generation of appropriate Wikitext for the dragged and dropped content
Preliminary Design of the Scribe interface, displaying the main content
Scribe's drag and drop functionality
English translation of the Scribe interface design

Motivation[edit]

Distribution of languages in Wikipedia, same as table as a graph
All Wikipedia language versions, in terms of articles (red) and active editors (blue). A long a tail can be seen for most of the languages having few articles and editors.

Bias of language support in Wikipedias[edit]

Wikipedia is a widely used resource to access information about a variety of topics. However, while a few language versions cover a variety of topics, other languages are barely supported. As a speaker of Arabic, one of the 5 most widely spoken languages in the world, you will find a drastic lack of information even on basic topics. Compared to English’s almost 6 million articles, Arabic Wikipedia has only over 600,000 articles. This reproduces a general lack of information on the web: Only 0.6% of the content online is in Arabic [4], restricting the access of a huge part of the world to knowledge and effectively excluding a large proportion of the non-English speaking world from knowledge access.

This problem is not only confined to Arabic. Hindi is spoken by 370 million people in the world, making it the second most commonly spoken language in the world [5]. However, it is only covered by 120,000 Wikipedia articles - a significant gap in information that needs to be addressed urgently to reach a larger share of the world’s population.

The vast majority of Wikipedias have below 100.000 articles. For an encyclopedia containing general knowledge, that is a very small amount of topics covered. These underserved Wikipedias need technological and community-based solutions to support the access of information to every person in the world, regardless of their native language.

When looking at underserved language communities, we find a severe lack of information. This lack of content in the native language is a considerable barrier to participation online. In the case of Wikipedia, we get into a vicious cycle: Few articles mean little attention for the language Wikipedia, with few people browsing their language Wikipedia, there is a small pool of people from which editors could emerge, which results in few articles, which again leads to little attention etc.

Number of Articles Number of Wikipedias
> 5M 1
1M - 4M 14
100K - 1M 45
10K - 100K 79
1K - 10K 111
0 - 100 48

Bias in the number of editors for Wikipedias[edit]

Following the small number of articles, there are vast differences in the number of editors maintaining Wikipedias, too. The large Wikipedias, such as English or German, have a large, very active community that contributes to its content. But as visible in Figure 1, most Wikipedias have a small number of editors maintaining them, correlating with the low number of articles.

The largest number of Wikipedias are maintained by under 50 editors (including 11 Wikipedias with one or no active editor). This means a very heavy load on those editors- they have to fill all the various roles of maintaining a Wikipedia and consistently ensuring its high quality, whilst dividing those time-consuming tasks between a very small number of people.

Having editors predominantly from a small set of language communities introduces a certain level of topic bias. Topics important in parts of the world with underserved languages will find their way into Wikipedia with considerably higher difficulty.

Source: http://wikistats.wmflabs.org/display.php?t=wp
Number of Active Editors Number of Wikipedias
> 100K 1
10K - 100K 5
1K - 10K 21
500 - 1K 11
100 - 500 40
50 - 100 30
0 - 50 195
Limitations of the content translation tool[edit]
Screenshot of the Content Translation Tool, English to Turkish translation

It can be challenging for the limited number of editors in the underserved Wikipedias to create and maintain a large set of high-quality articles. The community has built a set of tools to facilitate article creation, such as the Content Translation Tool. This tool enables editors to easily translate an existing article from one language to another.

However, the content translation tool has the following limitations:

  • The articles that can be translated are selected by their importance to the source Wikipedia community. Topics with significance to the target community do not necessarily have an equivalent in the source Wikipedia. In those cases, there is no starting point using the content translation tool. It has been shown, that the English Wikipedia is not the superset of all Wikipedias, and the overlap of content between the different languages is relatively small, indicating cultural differences in the content. Editors should be encouraged to avoid a topical bias and cover articles important to their communities.
  • From a series of interviews with editors we found that editors tend to keep the references from the source language (or delete them) when creating articles with the content translation tool. In practice, searching for those references and assigning them to the equivalent article sections is a time-consuming task and considered the backbone of any high-quality article on Wikipedia.
  • Especially for underserved languages, the machine translation is limited. There are few documents available online aligned with English and even less with other languages, that the translation engine can be trained on. This leads to often criticised quality.
  • Monolingual speakers are disadvantaged. They cannot verify the translation in the context of the source language or the references used.

References[edit]

  1. Sauper, Christina; Barzilay, Regina (2009). "Automatically Generating Wikipedia Articles: A Structure-Aware Approach" (PDF). Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP: 208–216. 
  2. Piccardi, Tiziano; Catasta, Michele (2018). "Structuring Wikipedia Articles with Section Recommendations" (PDF). SIGIR '18 The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval: 665–674. doi:10.1145/3209978.3209984. Retrieved 2018-06-27. 
  3. Bordes, Antoine (2013). "Translating Embeddings for Modeling Multi-relational Data". Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. 
  4. Lin, Yankai (2015). "Learning Entity and Relation Embeddings for Knowledge Graph Completion". Proceedings of the Twenty-Ninth (AAAI) Conference on Artificial Intelligence. 
  5. Nickel, Maximilian (2016). "A Review of Relational Machine Learning for Knowledge Graphs". Proceedings of the IEEE. doi:10.1109/JPROC.2015.2483592. 
  6. Goldstein, Jade (2000). "Multi-document summarization by sentence extraction". NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization. 
  7. Banerjee, Siddhartha (2016). "WikiWrite: Generating Wikipedia Articles Automatically" (PDF). Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16). 
  8. Hercules, Dalianis (1996). "Aggregation in natural language generation". Trends in Natural Language Generation An Artificial Intelligence Perspective. 
  9. Nallapati, Ramesh (2017). "SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents". Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California USA. 
  10. Kedzie, Chris (2018). "Content Selection in Deep Learning Models of Summarization". Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. 
  11. Chisholm, Andrew (2017). "Learning to generate one-sentence biographies from Wikidata". Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, {EACL} 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers.