Research:Increasing article coverage
We are interested in identifying the gaps of knowledge in Wikipedia. The result of this research can help the communities and the foundation in focusing their efforts in editor acquisition (through Editathons, for example) and task recommendation systems currently under development.
The most commonly used approach for identifying what knowledge is missing from Wikipedia that probably should be there is the use of red links data along with their corresponding page-views. Topical coverage of Wikipedia has been formally studied. For example, it is shown that Wikipedia’s coverage is driven by the interests of its editors, and therefore, its completeness relies on subject-area of articles.
We propose two ways to formalizing the goal of increasing Wikipedia article coverage:
To consider these problems, we envision the following route:
To achieve this, the following steps are currently visioned:
- 1 Introduction
- 2 Methodology
- 3 Evaluation
- 4 Results
- 5 Presentations
- 6 References
The over 80 language editions of Wikipedia represent the largest encyclopedia in human history. However, the different language editions vary dramatically in how comprehensive they are. This research aims to identify important content available in one language edition but missing from another as well as the editors who would be interested in translating or creating from scratch such articles in the destination language.
We divide the problem into four parts and address each separately.
Finding Missing Articles
We use two sources for identifying missing articles: Wikidata for mapping from language independent entities to Wikipedia articles in different languages, and Wikipedia's inter-language links (ILLs). We augment the Wikidata mapping with the ILLs, by building a graph G in which the nodes correspond to either Wikidata items or articles and the edges are Wikidata links, ILLs and MediaWiki redirects. We say article T is missing in language S if and only if none of the Wikidata items in the same connected component as T map to an article in S. Including ILLs reduces the number of entities that are falsely declared to be missing.
Ranking Missing Articles
For many languages there are many more articles that could be created than the current number of volunteer editors can contribute to. On top of that, not all missing articles in a destination language are relevant or desired in that language edition. Therefore, it is important to be able to rank the missing articles in the destination language, such that the editors' effort may be directed at the most crucial missing articles first. There are currently two approaches considered.
Pageviews as a proxy for importance
We build a linear-regression model for estimating the number of pageviews for an article in the destination language based on features of the corresponding article in the source language. The model is trained on articles that exist for both the source and the destination languages. We run the model on the articles in the source that are missing in the destination to estimate the number of pageviews these as yet non-existent articles would receive in the destination language if it were to exist. As input variables, the model uses the number of pageviews of the missing article in the source language, its length, and topics expressed in the article text of the missing article in the source language (topics are computed via latent Dirichlet allocation [LDA]). Topical features matter because different languages put different levels of emphasis on different topics (e.g., articles on Italian singers are more relevant for the Italian than the Chinese Wikipedia).
Notability as a proxy for importance
Notability is one of the most important and challenging measure of importance considered by the editors. Given a measurable definition of notability, we can build prediction models that can help assess whether a not-yet-existent article in the destination language is considered notable by the editors of that language.
Please share your thoughts in the talk page about how we can define notability.
Computing Editor-Article Affinity
For a given editor E and missing article A, we are interested in estimating the distance between A and E's topical interests (affinity of editor E for article A). We first embed documents in S in a topic vector space using LDA. We compute the affinity editor E has for article A as a function of the topic vectors for entities in E's edit history that exist in S and the topic vector for A. More specifically, an editor's interest vector is the normalized sum of the document vectors of the last 15 articles they have edited in the source language that have a corresponding article in the target language, weighted by the log of the number of bytes they have added to the article. The affinity that E has for A is the cosine distance between E's interest vector and the A's topic vector.
Matching Translators with Missing Articles
For each editor, we are interested in finding K important articles for which the editor has high affinity while ensuring that every article is recommended to only one editor. As a preprocessing step, we remove disambiguation pages and very short articles from the set of missing articles. Then we take the N missing articles with the highest estimated future number of pageviews in the destination and distribute these among the editors in a way that maximizes the total estimated affinity that editors have for their recommendations. This problem can be formulated as an integer max-flow problem that can be solved using linear programming techniques (One can easily show that the problem is a min-cost flow problem with integral demands. In this case, the relaxed linear programming problem is guaranteed to provide optimal integer solutions.)
How well the recommendation algorithm described above works in practice can be assessed by subjecting it to editors, i.e., selecting one or more source and destination languages, identifying missing content in the destination languages, computing editors' affinity for each missing article in the destination language, and matching important missing articles and editors. We are currently planning to do this in the following sequence and with a subset of editors in each group: 1) internal Wikimedia Foundation trial, 2) French Wikipedia, 3) Spanish Wikipedia.
Identifying potential contributors
We determine which editors are suitable for receiving recommendations for translating from the source to the target language via two methods. The first is scraping the target users' User pages for a Babel template that indicates that they speak the source language. The second is selecting target users who have an account with the same username in the source language, have made at least one edit in both the source and target Wikipedias, have made at least one edit in either language within the last year and have matching email addresses for the two accounts.
Internal WMF test
The goal of this stage is to receive internal feedback on the recommendation algorithm and identify bugs and/or must-have features before introducing the recommendations to other editors. 11 staff members who speak at least one of the French or Spanish languages have volunteered for this stage.
French Wikipedia Test: Template of the recommendation email
Re: Aidez à améliorer l'exhaustivité de Wikipédia en français
L’équipe Recherche de la Fondation Wikimédia (Wikimedia Research) travaille actuellement sur l’identification d’articles populaires et importants  dans certaines langues du projet Wikipédia qui n’existent pas encore sur le Wikipédia francophone. Les cinq articles suivants existent dans la version anglophone de Wikipédia et sont considérés comme étant importants pour les autres langues du projet. Au vu de votre historique de contribution à Wikipédia, nous pensons que vous êtes un(e) excellent candidat(e) pour contribuer à ces articles. Démarrer la création de l'un de ces articles serait un premier pas considérable en vue d'élargir les connaissances disponibles en français. 
(LIST OF 5 RECOMMENDATIONS)
Nous vous remercions d'avance pour votre aide.  
Equipe de Recherche, Fondation Wikimédia, 149 New Montgomery Street, 6th Floor, San Francisco, CA, 94105, 415.839.6885 (Office).
 Nous identifions les articles importants et populaires grâce à un algorithme. Cette sélection d'articles peut être un résultat personnalisé ou aléatoire. Vous pouvez en apprendre davantage sur la personnalisation et les méthodes utilisées pour trouver les articles importants à cette adresse.
 Les liens pointent vers l’outil de traduction de Wikipédia (ContentTranslation Tool). Cet outil est en cours de développement par l’équipe Language Engineering de la fondation (pour l’instant en version beta dans certaines langues). En savoir plus: https://www.mediawiki.org/wiki/Content_translation.
 Si vous désirez plus d’informations sur ce projet de recherche, vous pouvez lire cette page (en anglais), et nous en parler sur sa page de discussion (en anglais de préférence, même si nous trouverons certainement un traducteur si vous nous écrivez en français :).
 Votre avis est important pour nous. Faites nous part de vos impressions par courriel à l’adresse email@example.com.
Si vous ne souhaitez plus recevoir de courriel de Wikimedia Research, merci d’envoyer un courriel ayant pour sujet "unsubscribe" à l’adresse firstname.lastname@example.org.
French Wikipedia Test: Lessons Learned (Draft)
We will continue updating this section as more lessons become available. This is not a finalized set of items that we will use moving forward. We will still update the following paragraphs in the next few days. Here is what we have learned so far.
Out of the 12 thousand users we contacted, approximately 0.25% replied saying that they do not have the requisite proficiency in either English or French to attempt a translation. In order to reduce this number, we have modified our selection criteria. This is our current proposal, feel free to contribute any suggestions.
- The editor must have made an edit of any size in either the source or the target language in the last 12 months. This requirement was included in the frwiki test as well.
- The editor must have made edits of at least 200 bytes in both the source and the target language. Previously, the condition was that the editor made an edit of any size in both the source and the target language and an edit of at least 100 bytes in either of the two. There are, however, editors who will make minor edits such as including image links in Wikipedias whose language they are not proficient in. These editors should be excluded.
- The editor has not indicated that their proficiency in either language is less than intermediate in a Babel template.
Finally, to reduce confusion for editors who satisfy the above conditions but are not able to read the target language, we will add a section to the start of the email in the source language, explaining the situation and apologizing for the mistake.
Trying to make the recommendations personalized to an editor’s interests, requires a substantial amount of additional "algorithmic" work compared to just recommending articles that are predicted to be widely read. To see if this additional work is justified (for when we implement the model within the CX tool), some editors did not receive personalized recommendations. We quickly saw that personalized recommendations lead to significantly higher engagement. Nearly all feedback around the poor quality of the recommendations came from editors who did not receive personalized recommendations. Going forward we will personalize all messages.
- Our article selection methods are not yet advanced enough to exclude all articles that would not be of interest or encyclopedic value in the target language. This is a hard task, one that humans often disagree on. We will:
- Make this explicit in the body of the recommendation email and highlight that it is up to the editors to make the final call on whether the article should exist in the target language.
- Furthermore, the importance threshold for articles to be included in the recommendations was probably too low for the frwiki test and will be raised in the future.
- We are also investigating new ways for improving the algorithm's assessment of article importance.
- The algorithm did not have a condition to filter out articles that have low quality in the source language. If the recommendation was to encourage editors to create an article, this would not matter as much as when the recommendation is about translations. In the translation recommendation cases, we should filter out low quality pages from the article set.
Disambiguation pages were excluded from the list of articles based on the "disambig" template. We are now also using all the disambiguation template variants to filter out disambiguation pages..
The results are shared in detail in a paper that can be accessed at Growing Wikipedia Across Languages via Recommendation.
This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding.
Below you can find the links to previous presentations about this research:
- [2017-03-15] Presentation as part of CITRIS Research Exchange seminar series. (Slides, Video)
- [2016-04] at WWW2016 Conference
- [2015-08] The first results from this research were presented in Wikimania 2015. You can read more about the abstract of our presentation here.
- Halavais, Alexander, and Derek Lackaff. "An analysis of topical coverage of Wikipedia." Journal of Computer‐Mediated Communication 13.2 (2008): 429-440.
- Robert West, Doina Precup, and Joelle Pineau: Automatically Suggesting Topics for Augmenting Text Documents. In Proc. 19th ACM Conference on Information and Knowledge Management (CIKM'10), 2010.
- Gerard de Melo, Gerhard Weikum: Untangling the Cross-Lingual Link Structure of Wikipedia. In Proc. 48th Annual Meeting of the Association for Computational Linguistics (ACL’10), 2010.
- B. Hecht and D. Gergle. The tower of Babel meets web 2.0: User-generated content and its applications in a multilingual context. In Proc. CHI, pages 291–300, 2010.
- P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn, and D. Gergle. Omnipedia: Bridging the wikipedia language gap. In Proc. CHI, 2012.