Research:Multilingual Readability Research

Tracked in Phabricator:
Task T293028

Created

10:20, 9 September 2021 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Indira Sen

Gesis

Mykola Trokhymovych

University Pompeu Fabra

Duration: 2021-August – ??

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. The next step consists of identifying metrics to quantify the size of these gaps. For some of the gaps in the content dimension we have readily available metrics (especially around representation gaps such as gender, geography, etc.) However, for some of the metrics, we are still lacking metrics. This project focuses on identifying possible metrics for the readability-gap in Wikimedia projects (with focus on supporting multiple languages).

Roughly, readability aims to capture how hard it is for a reader to understand a written text. While there are off-the-shelve readability-scores for English (and other languages), it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.

The aim of this project is to assess whether it is feasible to automatically evaluate readability of Wikipedia articles across the many languages covered in Wikimedia projects.

Methods[edit]

Timeline[edit]

Conduct background research on existing approaches to measuring readability (see Research:Multilingual_Readability_Research/Background_Research)
Identify candidate approaches that support multiple languages. Conduct exploratory research with corresponding models and datasets.
Specifying the task, implementing the models, and evaluating performance
Make a decision whether an automatic evaluation of readability across projects is feasible.

Results[edit]

Identifying a set of candidate approaches[edit]

In Research:Multilingual_Readability_Research/Background_Research, ⁣ I tried to get an overview of different approaches to measuring readability, with a focus on multilingual approaches. The following two approaches emerge as the most promising candidates:

Language-dependent approach using pre-trained multilingual language models. Several studies have shown that standard language models such as BERT can be used to derive features from sentences or texts that can capture readability similar or better than hand-crafted linguistic features. These models support not all but on the order of 100 different languages.
Language-agnostic approach using an entity-linker. This approach yields a language-agnostic representation of text as a sequence of entities (instead of words/syllables/etc.). From this we can derive shallow features (e.g., average number of entities per sentence) without language-specific parsing; this has been shown to capture some aspects of readability. This relies on the availability of the entity-linker, one promising open candidate is dbpedia-spotlight which is open, exists for several languages, and can be expanded to new languages.

We can apply these two models on different datasets for evaluation as a classification task:

Simple vs English Wikipedia corpus. Texts with 2 reading levels (simple, normal). While this only covers English texts, the main advantage is that it provides on the order of 65k articles.
Vikidia vs Wikipedia corpus. Texts with 2 reading levels (simple, normal). While it is a much smaller corpus, it contains texts in several languages.

Generating a multilingual dataset[edit]

Code and data: https://gitlab.wikimedia.org/repos/research/readability

We generate datasets with matched articles on the same topic available in two different readability levels:

Simple-English Wikipedia data: a large corpus of 85,626 articles in English from Simple and English Wikipedia.
Vikidia-Wikipedia: a much smaller corpus of articles for different languages from Vikidia and Wikipedia; for 8 languages we have at least 100 articles and for 4 languages we have at least 1000 articles (fr: 9893, it: 1157, es: 2136, en: 1729, eu: 945, hy: 6, de: 256, ca: 211, scn: 8, ru: 106, pt: 40, el: 33)

A language-agnostic model for readability[edit]

We develop and evaluate a language-agnostic model for assessing the readability of Wikipedia articles:

While there are many different ways in which to operationalize readability, we start from standard readability formulas which are known to capture at least some aspects of readability (such as the Flesch-Reading ease in for articles in English Wikipedia). Our focus is to adapt these approaches to more languages by using language-agnostic features.
We build a binary classification model that predicts the annotated level of readability of an article (easy or difficult). The model’s prediction score (between 0 and 1) can be interpreted as a normalized readability score.
We train the model only in English where we have sufficient ground truth data from Simple Wikipedia and English Wikipedia. We test the model in other languages using annotated ground truth data from Wikipedia and corresponding children’s encyclopedias. This reveals how well the model generalizes to other languages without re-training or fine-tuning in the absence of ground truth data (which is the case for the majority of languages in Wikipedia).

In summary, we find that the language-agnostic model constitutes a promising approach for obtaining readability scores of articles in different languages without language-specific fine-tuning or customization.

The language-agnostic approach is less precise than the standard readability formulas for English
The language-agnostic approach generalizes better to other languages than the standard readability formulas (non-customized).
The language-agnostic approach performs similar or almost as good as the customized versions of the standard readability formulas for most languages (noting that such customizations exist only for very few languages).

Read the details: Research:Multilingual Readability Research/Evaluation language agnostic

An improved multilingual model for readability[edit]

We develop an alternative model for assessing the readability of Wikipedia articles. This is using a multilingual pre-trained large language model (BERT). The model supports 104 languages; therefore, we call it a multilingual model, in contrast to the language-agnostic model above. While this means that the model cannot support all Wikipedia languages, it still has several advantages over the language-agnostic model:

the performance in the evaluation dataset improved substantially
the number of supported languages actually increased. The reason is that the language-agnostic model requires an entity-linking model. For this step, we currently rely on DBPedia-spotlight, which supports only about 20 languages.

Read the details: Research:Multilingual Readability Research/Evaluation multilingual model