Research:Expanding Wikipedia articles across languages/Inter language approach

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Open source project Open source
via Github

Today Wikipedia contains more than 40 million articles, is actively edited in about 160 languages, and its article pages are viewed 6000 times per second. Wikipedia is the platform for both access and sharing encyclopedic knowledge. Despite its massive success, the encyclopedia is incomplete and its current coverage is skewed. To address these gaps of knowledge and coverage imbalances (and perhaps for other reasons), Wikipedia needs to not only maintain its current editors but also help new editors join the project. Onboarding new editors, however, requires breaking down tasks for them (either by machines or humans who help such editors learn to become prolific Wikipedia editors).

Up until recently, the task of breaking down Wikipedia contributions and creating a template for contributions has been done for the most part manually. Some editathon organizers report that they do template extractions to help new editors learn, for example, what the structure of a biography on Wikipedia looks like. Sometimes, extracting templates manually means going across languages, especially in cases where the language the editors being onboarded in is a small Wikipedia language (in terms of the number of articles already available in that language). Manual extraction of templates is time consuming, tedious, and a process that can be heavily automatized, to help reduce the workload for editathon organizers and help more newcomers to be onboarded.

This research aims at designing systems that can help Wikipedia newcomers identify missing content from already existing Wikipedia articles and gain insights on basic facts or statistics about such missing components, using both the inter-language and intra-language information already available in Wikipedia about the structure of the articles.

A complementary approach for recommending sections, is to use information coming from other languages to create such recommendations. The ideal situation would be, when an article A is being created in language X, check if that article does already exists in other languages, translate the sections' names into language X, and recommend this to the editor. While more languages covers article A, more likely is this strategy to generate good recommendations. Therefore, there are two main challenges to apply this idea, i) understand that an article A is the same across different languages (article alignment), and ii) translate section names to the target language (section alignment).

Article Alignment[edit]

A good proxy to solve the article alignment problem is to use the Wikidata information embedded in all Wikipedia articles. Wikidata is a collaboratively edited knowledge base, where each item represents a topic in language-agnostic approach. For example, the topic “Recommender Systems” is represented by item Q554950, being the same item for "Sistemas de Recomendación" (Spanish) or “Sistemi za preporuku” (Serbian). All Wikipedia articles, in all languages, links to one Wikidata item.

Coverage and uniqueness In order to understand the potential of cross-lingual recommendations, we need to measure the amount of overlapped content across languages. To this aim, we define two metrics:

  • Coverage: The percentage of the total Wikidata items covered (with an article) in a given language.
  • Uniqueness: The percentage of Wikidata items that is covered just by a given language.
Coverage of Wikidata Items per Language

The figure above shows the languages with more coverage. Not surprisingly, English Wikipedia has the higher coverage, being substantially bigger than others. Interestingly, the English Wikipedia has not the (relative) higher unique content. On the other hand, languages with good coverage, like French or Spanish, have a low amount of unique content. The figure below, shows that there is not an strong correlation between coverage and uniqueness.

Uniqueness vs Coverage

Checking specific languages pairs, we can see that there are no languages that can cover more than the 20% of the English content , while English itself can cover more than the 40% of more than 15 languages. These results suggests that English is a good source, but bad target for cross-lingual recommendations. We evaluate the impact of coverage and uniqueness in section recommendations later on this article.

% English items covered by other Wikipedias

Check the code and details: Coverage Notebook

Section Alignment[edit]

Our goal is to map a section name in the source(s) language to the target language. As we show below, the performance of commercial machine translation services as Google Translate varies a lot depending on the language pairs, without good results in most of the cases. Open source alternatives such Apertium have not good enough language coverage.

Here we describe a methodology to create section alignments, based just in open source software, that provides translations with a encyclopedic standards quality.

You can test the current status of this work on this API:

https://secrec.wmflabs.org/API/alignment/sourceLang/TargetLang/SectionTitle

Example: https://secrec.wmflabs.org/API/alignment/es/en/Historia

Translation without parallel data[edit]

Machine translation algorithms relays on parallel data. Both, traditional Statistical machine translation and the most recent Neural machine translation, relays in parallel corpus to train their models, meaning that requires two corpus, that are identical translations, in order to learn from them. These corpus, needs to be large enough to learn from. This is not the case of Wikipedia, were articles in different languages are not necessarily translations of each other. We refer to this kind of dataset (articles talking about the same topic, but not being translations), as comparable data.

In the last years, new approaches using Word embedding alignments has been proposed as solution for translating words and short sentences without requiring parallel data [1] [2]. In this work, we improve those vectors alignments using Wikidata and also design a set of features based on the comparable dataset.

Machine Learning Task[edit]

Our task is taking the top N most popular section in a pair of languages x and y, compute the probability section sx and sy to be the same section, i.e the probability that sx is the translation of sy.

Features[edit]

We consider three set of features:

Article to Article Features[edit]

Given articles in two languages x and y, where xi and yi represents the same Wikidata element, we compute the following features:

  • Co-occurrences count: we build a weighted bipartite graph where each node is a section name in one language, and edges are co-occurrences of section names in the other language. The weights on the edges are increased each time that a pair of sections co-occurs.
  • Co-occurrences Tf–idf: Taking the feature above, we represent each section as bag-of-words, were the Term Frequency (TF) is the the weight of the in-edges.
  • Content Embeddings Distance: We create an aligned vector for each section (based on [3]) and compute cosine similarity between all possible pairs in the bipartite graph. Later for each section pair, we aggregate these results by mean and median.
  • Links Similarity: Similar with the previous feature, for each pair of articles xi and yi, we represent each section as vector of links (represented as Wikidata items), and compute the jaccard similarity between all pairs.

Section heading text[edit]

Taking all the possible pairs of sections in the two languages we compute:

  • Aligned Embeddings Distance: We compute the cosine distance between the vector representation of each section title.
  • Edit Distance: We compute the Levenshtein distance between section titles.

Aggregated Section Characterization[edit]

For each language we compute a set of statistics for sections, and the compare them across languages:

  • Section Position
  • Number of links per section
  • Links density per section (number of links divided by section length)
  • Section Frequency (the section that appears more in the full corpus is number 1)
  • Section relative length

Dataset[edit]

We selected 6 languages: French, Japanese, English, Spanish, Arabic and Russian, looking for diversity of families and scripts, as well as good enough coverage of Wikipedia Articles. We parse all the sections headings of those languages, and rank them by popularity, then in a large community effort we ask volunteers to translate those sections titles in the other 5 languages (T195001).

Results[edit]

We perform better equal or better than Google translate in 82.6% of the languages pairs.

Wiki Section Alignment, improvement compared with Google Translate

Section Recommendation[edit]

Using the aligned section across language, we can provide section recommendations. Our approach works as follows: Given an article in the target language T, we retrieve the list of existing sections on that article, as well as the list of sections in all the (N) sources languages, S_n. We count the number of sections per language, and the language with larger number of sections is used as template. Next, each section in S_n is mapped to a list of sections in language T. Finally the template is is updated with all the other sources languages.

For example, consider the page Quilombo in the English Wikipedia. According to Wikidata that page corresponds to following pages with the following sections:

  • ru Киломбу: 'Примечания'
  • fr Quilombo (esclavage): 'Histoire', 'Étymologie', 'Organisation', 'Économie', 'Notes et références'
  • es Quilombo: 'Historia', 'Infraestructura', 'Organización', 'Economía', 'Véase también', 'Referencias', 'Bibliografía'

Given that Spanish 'es' is the language with more sections (7), Spanish is used as template. Next, each section in each language is mapped to the target language, in this case English. For example, in Russian:

  • Примечания:{'Footnotes': 0.98, 'References and notes': 0.95, 'Citations': 0.94, ...}

Now, each section in each source language,correspond to a list of sections in the target language. We call each of this list, a cluster. The challenge here is to understand that the cluster of Historie in French, correspond to the cluster Historia in Spanish, information that we don't have a priori. For solving that, we compute the dot product between each pair of cluster in pair of languages. Higher dot product implies higher probability of being equivalent cluster. For each section in the template language, we obtain such similarities. Finally, the template is updated using the most similar cluster in all the remaining languages.

An online demo app can be found in: http://secrec.wmflabs.org

A demo API can be found here: http://secrec.wmflabs.org/API/recommendation/lang/title

For example: http://secrec.wmflabs.org/API/recommendation/en/Quilombo?verbose=False&blind=False

Where the parameters are:

  • lang: {str} One of the six supported languages [ar,en,es,fr,en,ru]
  • title: {str} Is an existing article on the target language.
  • verbose (optional): {Boolean} When True, provide contextual information about recommendations
  • blind (optional): {Boolean} When True, gives recommendations without considering the existing sections on the current article. When False, return just potential missing sections.
  • useMoreLike (optional):{Boolean} implement the search on similar Wikidata items when there no sections to recommend. For example, if is not possible to find recommendations for en:Pythonides (Q1760915) article, this feature will allow to search for similar Wikidata items using the cirrus morelike api, and using them as seed for providing recommendations. Note that seed articles can come for any of the supported languages. The seed language is defined by the auxLang parameter.
  • auxLang (optional): {str} is used for useMoreLike, to provide the seed in other than the target language [ar,en,es,fr,en,ru]


For more details, check the recommendations code here

As mentioned above you can also experiment with the demo alignment API here

Code[edit]

The full pipe-line for section alignment can be found here

Recommendations code is here

Online app and API code is here

Get involved[edit]

You can also help us map sections across languages using our easy-to-use app.

Subpages[edit]

Pages with the prefix 'Expanding Wikipedia articles across languages/Inter language approach' in the 'Research' and 'Research talk' namespaces:

References[edit]

  1. Conneau, Alexis, et al. "Word translation without parallel data." arXiv preprint arXiv:1710.04087 (2017)
  2. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)
  3. Smith, Samuel L., et al. "Offline bilingual word vectors, orthogonal transformations and the inverted softmax." arXiv preprint arXiv:1702.03859 (2017)