Research:Expanding Wikipedia articles across languages/Inter language approach

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Today Wikipedia contains more than 40 million articles, is actively edited in about 160 languages, and its article pages are viewed 6000 times per second. Wikipedia is the platform for both access and sharing encyclopedic knowledge. Despite its massive success, the encyclopedia is incomplete and its current coverage is skewed. To address these gaps of knowledge and coverage imbalances (and perhaps for other reasons), Wikipedia needs to not only maintain its current editors but also help new editors join the project. Onboarding new editors, however, requires breaking down tasks for them (either by machines or humans who help such editors learn to become prolific Wikipedia editors).

Up until recently, the task of breaking down Wikipedia contributions and creating a template for contributions has been done for the most part manually. Some editathon organizers report that they do template extractions to help new editors learn, for example, what the structure of a biography on Wikipedia looks like. Sometimes, extracting templates manually means going across languages, especially in cases where the language the editors being onboarded in is a small Wikipedia language (in terms of the number of articles already available in that language). Manual extraction of templates is time consuming, tedious, and a process that can be heavily automatized, to help reduce the workload for editathon organizers and help more newcomers to be onboarded.

This research aims at designing systems that can help Wikipedia newcomers identify missing content from already existing Wikipedia articles and gain insights on basic facts or statistics about such missing components, using both the inter-language and intra-language information already available in Wikipedia about the structure of the articles.

A complementary approach for recommending sections, is to use information coming from other languages to create such recommendations. The ideal situation would be, when an article A is being created in language X, check if that article does already exists in other languages, translate the sections' names into language X, and recommend this to the editor. While more languages covers article A, more likely is this strategy to generate good recommendations. Therefore, there are two main challenges to apply this idea, i) understand that an article A is the same across different languages (article alignment), and ii) translate section names to the target language (section alignment).

Article Alignment[edit]

A good proxy to solve the article alignment problem is to use the Wikidata information embedded in all Wikipedia articles. Wikidata is a collaboratively edited knowledge base, where each item represents a topic in language-agnostic approach. For example, the topic “Recommender Systems” is represented by item Q554950, being the same item for "Sistemas de Recomendación" (Spanish) or “Sistemi za preporuku” (Serbian). All Wikipedia articles, in all languages, links to one Wikidata item.

Coverage and uniqueness In order to understand the potential of cross-lingual recommendations, we need to measure the amount of overlapped content across languages. To this aim, we define two metrics:

  • Coverage: The percentage of the total Wikidata items covered (with an article) in a given language.
  • Uniqueness: The percentage of Wikidata items that is covered just by a given language.
Coverage of Wikidata Items per Language

The figure above shows the languages with more coverage. Not surprisingly, English Wikipedia has the higher coverage, being substantially bigger than others. Interestingly, the English Wikipedia has not the (relative) higher unique content. On the other hand, languages with good coverage, like French or Spanish, have a low amount of unique content. The figure below, shows that there is not an strong correlation between coverage and uniqueness.

Uniqueness vs Coverage

Checking specific languages pairs, we can see that there are no languages that can cover more than the 20% of the English content , while English itself can cover more than the 40% of more than 15 languages. These results suggests that English is a good source, but bad target for cross-lingual recommendations. We evaluate the impact of coverage and uniqueness in section recommendations later on this article.

% English items covered by other Wikipedias

Check the code and details: Coverage Notebook

Section Alignment[edit]

Our goal is to map a section name in the source(s) language to the target language. As we show below, the performance of commercial machine translation services as Google Translate varies a lot depending on the language pairs, without good results in most of the cases. Open source alternatives such Apertium have not good enough language coverage.

Here we describe a methodology to create section alignments, based just in open source software, that provides translations with a encyclopedic standards quality.

Translation without parallel data[edit]

Machine translation algorithms relays on parallel data. Both, traditional Statistical machine translation and the most recent Neural machine translation, relays in parallel corpus to train their models, meaning that requires two corpus, that are identical translations, in order to learn from them. These corpus, needs to be large enough to learn from. This is not the case of Wikipedia, were articles in different languages are not necessarily translations of each other. We refer to this kind of dataset (articles talking about the same topic, but not being translations), as comparable data.

In the last years, new approaches using Word embedding alignments has been proposed as solution for translating words and short sentences without requiring parallel data [1] [2]. In this work, we improve those vectors alignments using Wikidata and also design a set of features based on the comparable dataset.

Machine Learning Task[edit]

Our task is taking the top N most popular section in a pair of languages x and y, compute the probability section sx and sy to be the same section, i.e the probability that sx is the translation of sy.

Features[edit]

We consider three set of features:

Article to Article Features[edit]

Given articles in two languages x and y, where xi and yi represents the same Wikidata element, we compute the following features:

  • Co-occurrences count: we build a weighted bipartite graph where each node is a section name in one language, and edges are co-occurrences of section names in the other language. The weights on the edges are increased each time that a pair of sections co-occurs.
  • Co-occurrences Tf–idf: Taking the feature above, we represent each section as bag-of-words, were the Term Frequency (TF) is the the weight of the in-edges.
  • Content Embeddings Distance: We create an aligned vector for each section (based on [3]) and compute cosine similarity between all possible pairs in the bipartite graph. Later for each section pair, we aggregate these results by mean and median.
  • Links Similarity: Similar with the previous feature, for each pair of articles xi and yi, we represent each section as vector of links (represented as Wikidata items), and compute the jaccard similarity between all pairs.

Section heading text[edit]

Taking all the possible pairs of sections in the two languages we compute:

  • Aligned Embeddings Distance: We compute the cosine distance between the vector representation of each section title.
  • Edit Distance: We compute the Levenshtein distance between section titles.

Aggregated Section Characterization[edit]

For each language we compute a set of statistics for sections, and the compare them across languages:

  • Section Position
  • Number of links per section
  • Links density per section (number of links divided by section length)
  • Section Frequency (the section that appears more in the full corpus is number 1)
  • Section relative length

Dataset[edit]

We selected 6 languages: French, Japanese, English, Spanish, Arabic and Russian, looking for diversity of families and scripts, as well as good enough coverage of Wikipedia Articles. We parse all the sections headings of those languages, and rank them by popularity, then in a large community effort we ask volunteers to translate those sections titles in the other 5 languages (T195001).

Results[edit]

We perform better equal or better than Google translate in 82.6% of the languages pairs.

Wiki Section Alignment, improvement compared with Google Translate

Code[edit]

TODO Public code wil available soon

Get involved[edit]

You can also help us map sections across languages using our easy-to-use app.


References[edit]

  1. Conneau, Alexis, et al. "Word translation without parallel data." arXiv preprint arXiv:1710.04087 (2017)
  2. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)
  3. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)