Research:Expanding Wikipedia articles across languages/Inter language approach

From Meta, a Wikimedia project coordination wiki

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Today Wikipedia contains more than 40 million articles, is actively edited in about 160 languages, and its article pages are viewed 6000 times per second. Wikipedia is the platform for both access and sharing encyclopedic knowledge. Despite its massive success, the encyclopedia is incomplete and its current coverage is skewed. To address these gaps of knowledge and coverage imbalances (and perhaps for other reasons), Wikipedia needs to not only maintain its current editors but also help new editors join the project. Onboarding new editors, however, requires breaking down tasks for them (either by machines or humans who help such editors learn to become prolific Wikipedia editors).

Up until recently, the task of breaking down Wikipedia contributions and creating a template for contributions has been done for the most part manually. Some editathon organizers report that they do template extractions to help new editors learn, for example, what the structure of a biography on Wikipedia looks like. Sometimes, extracting templates manually means going across languages, especially in cases where the language the editors being onboarded in is a small Wikipedia language (in terms of the number of articles already available in that language). Manual extraction of templates is time consuming, tedious, and a process that can be heavily automatized, to help reduce the workload for editathon organizers and help more newcomers to be onboarded.

This research aims at designing systems that can help Wikipedia newcomers identify missing content from already existing Wikipedia articles and gain insights on basic facts or statistics about such missing components, using both the inter-language and intra-language information already available in Wikipedia about the structure of the articles.

A complementary approach for recommending sections, is to use information coming from other languages to create such recommendations. The ideal situation would be, when an article A is being created in language X, check if that article does already exists in other languages, translate the sections' names into language X, and recommend this to the editor. While more languages covers article A, more likely is this strategy to generate good recommendations. Therefore, there are two main challenges to apply this idea, i) understand that an article A is the same across different languages (article alignment), and ii) translate section names to the target language (section alignment).

Article Alignment[edit]

A good proxy to solve the article alignment problem is to use the Wikidata information embedded in all Wikipedia articles. Wikidata is a collaboratively edited knowledge base, where each item represents a topic in language-agnostic approach. For example, the topic “Recommender Systems” is represented by item Q554950, being the same item for "Sistemas de Recomendación" (Spanish) or “Sistemi za preporuku” (Serbian). All Wikipedia articles, in all languages, links to one Wikidata item.

Coverage and uniqueness In order to understand the potential of cross-lingual recommendations, we need to measure the amount of overlapped content across languages. To this aim, we define two metrics:

  • Coverage: The percentage of the total Wikidata items covered (with an article) in a given language.
  • Uniqueness: The percentage of Wikidata items that is covered just by a given language.
Coverage of Wikidata Items per Language

The figure above shows the languages with more coverage. Not surprisingly, English Wikipedia has the higher coverage, being substantially bigger than others. Interestingly, the English Wikipedia has not the (relative) higher unique content. On the other hand, languages with good coverage, like French or Spanish, have a low amount of unique content. The figure below, shows that there is not an strong correlation between coverage and uniqueness.

Uniqueness vs Coverage

Checking specific languages pairs, we can see that there are no languages that can cover more than the 20% of the English content , while English itself can cover more than the 40% of more than 15 languages. These results suggests that English is a good source, but bad target for cross-lingual recommendations. We evaluate the impact of coverage and uniqueness in section recommendations later on this article.

% English items covered by other Wikipedias

Check the code and details: Coverage Notebook

Section Alignment[edit]

Our goal is to map a section name in the source(s) language to the target language. As we show below, the performance of commercial machine translation services as Google Translate varies a lot depending on the language pairs, without good results in most of the cases. Open source alternatives such Apertium have not good enough language coverage.

Here we describe a methodology to create section alignments, based just in open source software, that provides translations with a encyclopedic standards quality.

You can test the current status of this work on this API:

https://secrec.wmflabs.org/API/alignment/sourceLang/TargetLang/SectionTitle

Example: https://secrec.wmflabs.org/API/alignment/es/en/Historia

Translation without parallel data[edit]

Machine translation algorithms relies on parallel data. Both, traditional Statistical machine translation and the most recent Neural machine translation relies on parallel corpora to train their models, that is, it requires two corpuses that are identical translations in order to learn from them. These corpuses need to be large enough to learn from. This is not the case for Wikipedia, where articles in different languages are not necessarily translations of each other. We refer to this kind of dataset (articles talking about the same topic, but not being translations), as comparable data.

In the last years, new approaches using Word embedding alignments has been proposed as a solution for translating words and short sentences without requiring parallel data [1] [2]. In this work, we improve those vector alignments using Wikidata and also design a set of features based on the comparable datasets.

Machine Learning Task[edit]

Our task is taking the top N most popular section in a pair of languages x and y, compute the probability section sx and sy to be the same section, i.e the probability that sx is the translation of sy.

Features[edit]

We consider three set of features:

Article to Article Features[edit]

Given articles in two languages x and y, where xi and yi represents the same Wikidata element, we compute the following features:

  • Co-occurrences count: we build a weighted bipartite graph where each node is a section name in one language, and edges are co-occurrences of section names in the other language. The weights on the edges are increased each time that a pair of sections co-occurs.
  • Co-occurrences Tf–idf: Taking the feature above, we represent each section as bag-of-words, were the Term Frequency (TF) is the the weight of the in-edges.
  • Content Embeddings Distance: We create an aligned vector for each section (based on [3]) and compute cosine similarity between all possible pairs in the bipartite graph. Later for each section pair, we aggregate these results by mean and median.
  • Links Similarity: Similar with the previous feature, for each pair of articles xi and yi, we represent each section as vector of links (represented as Wikidata items), and compute the jaccard similarity between all pairs.

Section heading text[edit]

Taking all the possible pairs of sections in the two languages we compute:

  • Aligned Embeddings Distance: We compute the cosine distance between the vector representation of each section title.
  • Edit Distance: We compute the Levenshtein distance between section titles.

Aggregated Section Characterization[edit]

For each language we compute a set of statistics for sections, and the compare them across languages:

  • Section Position
  • Number of links per section
  • Links density per section (number of links divided by section length)
  • Section Frequency (the section that appears more in the full corpus is number 1)
  • Section relative length

Dataset[edit]

We selected 6 languages: French, Japanese, English, Spanish, Arabic and Russian, looking for diversity of families and scripts, as well as good enough coverage of Wikipedia Articles. We parse all the sections headings of those languages, and rank them by popularity, then in a large community effort we ask volunteers to translate those sections titles in the other 5 languages (T195001).

Results[edit]

We perform better equal or better than Google translate in 82.6% of the languages pairs.

Wiki Section Alignment, improvement compared with Google Translate

Section Recommendation[edit]

Using the aligned section across language, we can provide section recommendations. Our approach works as follows: Given an article in the target language T, we retrieve the list of existing sections on that article, as well as the list of sections in all the (N) sources languages, S_n. We count the number of sections per language, and the language with larger number of sections is used as template. Next, each section in S_n is mapped to a list of sections in language T. Finally the template is is updated with all the other sources languages.

For example, consider the page Quilombo in the English Wikipedia. According to Wikidata that page corresponds to following pages with the following sections:

  • ru Киломбу: 'Примечания'
  • fr Quilombo (esclavage): 'Histoire', 'Étymologie', 'Organisation', 'Économie', 'Notes et références'
  • es Quilombo: 'Historia', 'Infraestructura', 'Organización', 'Economía', 'Véase también', 'Referencias', 'Bibliografía'

Given that Spanish 'es' is the language with more sections (7), Spanish is used as template. Next, each section in each language is mapped to the target language, in this case English. For example, in Russian:

  • Примечания:{'Footnotes': 0.98, 'References and notes': 0.95, 'Citations': 0.94, ...}

Now, each section in each source language,correspond to a list of sections in the target language. We call each of this list, a cluster. The challenge here is to understand that the cluster of Historie in French, correspond to the cluster Historia in Spanish, information that we don't have a priori. For solving that, we compute the dot product between each pair of cluster in pair of languages. Higher dot product implies higher probability of being equivalent cluster. For each section in the template language, we obtain such similarities. Finally, the template is updated using the most similar cluster in all the remaining languages.

An online demo app can be found in: http://secrec.wmflabs.org

A demo API can be found here: http://secrec.wmflabs.org/API/recommendation/lang/title

For example: http://secrec.wmflabs.org/API/recommendation/en/Quilombo?verbose=False&blind=False

Where the parameters are:

  • lang: {str} One of the six supported languages [ar,en,es,fr,en,ru]
  • title: {str} Is an existing article on the target language.
  • verbose (optional): {Boolean} When True, provide contextual information about recommendations
  • blind (optional): {Boolean} When True, gives recommendations without considering the existing sections on the current article. When False, return just potential missing sections.

For more details, check the recommendations code here

As mentioned above you can also experiment with the demo alignment API here

Feedback from experienced editors[edit]

In mid-2019, we decided to ask for feedback from experienced editors on the quality of the section recommendations in six languages: English, French, Spanish, Arabic, Russian, and Japanese. We developed a simple web tool to help these editors explore and evaluate the recommendations, and suggest improvements.

We received dozens of ratings through the app, as well as valuable feedback on the feedback talkpage, and via email. This feedback is synthesized below.

Major feedback[edit]

  1. There are often obvious redundancies in recommended section titles. In some cases, a recommended section is very similar (contains many of the same words, and performs the same function) to a section that is already in the article. In other cases, multiple recommended sections are redundant with each other. See this feedback from Winged Blades of Godric for an example that includes both types of redundancy. This feedback from Smalyshev indicates that even capitalization differences can trip up the model. Users key in on these and find them distracting.
  2. Some recommended sections are deprecated or not allowed per local policy. Some general section headings like "Biography" and "Overview" are discouraged by a local wikis manual of style, even if that section heading is common articles on that topic in that wiki and/or in other wikis. Example from Vort.
  3. The model occasionally recommends sections that are obviously inappropriate for the article. See this feedback from Smalyshev.
  4. Section recommendations may work best on short stubs. See this example from Omotecho. It's good to hear this, because stub expansion is one of our key notional use-cases for section recommendation.
  5. The section recommendation model does not currently take hierarchy into account. See this post by DGG. Some sections generally only exist as sub-sections of other sections. The recommendation model doesn't currently consider section hierarchies, or other dependent relationships between related sections.

Design implications[edit]

  1. We should allow Section Recommendation API users to specify a blacklist of section titles or title keywords that they are not interested in. This function will allow tool/bot developers who use the model to screen out sections they know are against policy on their wiki and/or that their users are not interested in these kinds of recommendations.
  2. We should provide a mechanism in the API for accepting binary feedback on specific section-article pairs. This will allow tool developers to implement flags or relevance ratings into their tools, and have that user feedback captured in a way that facilitates tuning and re-training of the model.
  3. We should make the model better at detecting near-duplicates. Including section headings that include words or word-stems that exist in other recommended sections and sections already in the article, as well as capitalization differences.
  4. We should allow API users to set a confidence threshold. Give API consumers the ability to control how many recommendations they want to receive, and/or how much they want to "risk" seeing sections that aren't relevant.

Other notes[edit]

  1. No explicit feedback on cross-wiki section alignment.  There was no feedback that explicitly signalled that the alignment of sections between wikis was off. Of course, our testing environment wasn't designed to evaluate alignment, but seems to me that if section alignment had been bad enough to be really distracting and disruptive, we would have heard more comments like that from these editors. So based on the data we have, I believe that section alignment is good enough that it doesn't damage the recommendations.
  2. Many experienced editors don't seem to be super enthusiastic about the tool, but they don't object to it. Some people said they "weren't impressed" or acknowledged that the recommendations were "mixed", but no one objected to the concept of section recommendations or told us they thought the tool was harmful. Some experienced editors acknowledged that they believed the model could be useful for new editors, even though it wasn't useful for them.

Potential research directions[edit]

  1. Audit the model for problematic biases (e.g. recommending different sections for male and female biographies)
  2. Test the recommendations with edit-a-thon organizers or WikiEd courses (to get the new editor perspective)
  3. Run a pilot study with new editors, using a MediaWiki gadget to surface recommendations in context.


Code[edit]

The full pipe-line for section alignment can be found here

Recommendations code is here

Online app and API code is here

Get involved[edit]

You can also help us map sections across languages using our easy-to-use app.

Subpages[edit]

Pages with the prefix 'Expanding Wikipedia articles across languages/Inter language approach' in the 'Research' and 'Research talk' namespaces:

Research:

References[edit]

  1. Conneau, Alexis, et al. "Word translation without parallel data." arXiv preprint arXiv:1710.04087 (2017)
  2. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)
  3. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)