User:Trokhymovych/drafts/Multilingual readability model card

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Mykola Trokhymovych and Martin Gerlach
Model owner(s)	Martin Gerlach
Code	training and inference
Uses PII	No
In production?	No
	This model uses article text to predict how hard it is for a reader to understand it.
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

This model generates scores to assess the readability of Wikipedia articles. The readability scores is a rough proxy to capture how difficult it is for a reader to understand the text of the article.

Specifically, we propose a multilingual model using pre-trained mBERT^[1]. It supports not all but about 100 languages with the largest Wikipedias.

We fine-tune the model using annotated data of articles available in different readability levels. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We evaluate the model's performance on small annotated datasets available in a few languages using different children's encyclopedias (such as Vikidia).

Motivation[edit]

As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. One of the goals is to identifying metrics to quantify the size of these gaps. This model attempts to provide a metric to measure readability of articles in Wikimedia projects; specifically focusing to provide multilingual support.

While there are readily available formulas to calculate readability of articles (such as the Flesch-Kincaid score), these formulas are often developed for a specific language (most commonly English). Usually, these formulas cannot be applied out of the box to other languages. As a result, it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.

You can find more details about the project here: Research:Multilingual Readability Research

Users and uses[edit]

Use this model for

Define the readability score of Wikipedia article revision
Define the Flesch–Kincaid score of the article in multilingual setup
Compare readability of different revisions of the same article

Don't use this model for

Making predictions on language editions of Wikipedia that are not in the listed languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
Making predictions on namespaces outside of 0, disambiguation pages, and redirects

Current uses

Ethical considerations, caveats, and recommendations[edit]

The model only uses publicly available data of the content (i.e. plain text) extracted from the articles.

Nevertheless, there are certain caveats:

Multitlingual support: The model has only been trained on English data annotated with different readability levels. Our evaluation shows that the resulting model also works for other languages. However, performance varies across languages (see below). While this is a known issue for mBERT more generally ^[2], in the context of readability we are unable to systematically evaluate the model for many supported languages due to the lack of ground-truth data. In order to address this issues, we have started a research project to manually evaluate the model based on readers' perception of readability through surveys (ongoing).

Model[edit]

The presented system is based fine-tuned language model mBERT^[3] along with CatBoost regressor ^[4] as a Flesch–Kincaid scoring model. It is built in a paradigm of having one generalized model for all covered languages. The system includes the following steps:

1. Text features preparation:

Process wikitext and extract the revision text
Split text into sentences.

2. Masked Language Models (MLM) outputs extraction:

Pass each of the sentences to the pre-trained classification model

3. Final scores extraction

Apply mean pooling to the list of scores to extract the final unified readability score. This corresponds to a binary classification score of whether article should be annotated with one of the two levels of readability (easy or difficult).
Apply the Flesch–Kincaid scoring model on top of sentence scores. This score corresponds to a predicted Flesch-Kincaid grade level, i.e. a U.S. grade level capturing roughly "the number of years of education generally required to understand this text", that can be applied to other languages. The motivation is to provide a more interpretable score as an alternative to the binary classification score.

Performance[edit]

We evaluate the model on a binary classification task. As for the model probability output, we use the mean pooling of sentences MLM scores, and to get the binary label from it, we use the threshold of 0.5.

The testing data consist of pairs of texts that correspond to the simple (easy) and difficult (hard) versions of one article (for example, the same article from English Wikipedia and Simple English Wikipedia). Even though we train the model only on English texts, we evaluate performance in other languages. We evaluate model performance using AUC and Accuracy metrics.

Model performance metrics
testing set	Accuracy	AUC
simple-en-test	0.891352	0.955451
simple-en-validation	0.893358	0.955407
klexikon-de	0.757636	0.948942
vikidia-ca	0.860656	0.914270
vikidia-de	0.690476	0.872446
vikidia-el	0.524390	0.761154
vikidia-en	0.921013	0.982656
vikidia-es	0.702041	0.822553
vikidia-eu	0.579792	0.611134
vikidia-fr	0.731558	0.826539
vikidia-hy	0.535455	0.695755
vikidia-it	0.763791	0.856777
vikidia-oc	0.571429	0.795918
vikidia-pt	0.811037	0.908483
vikidia-ru	0.701923	0.837555
vikidia-scn	0.636364	0.752066
wikikids-nl	0.715346	0.788743
txikipedia	0.425975	0.386073

Implementation[edit]

Model architecture

mBERT model tunning:

Learning rate: 2e-5
Weight Decay: 0.01
Epochs: 5
Maximum input length: 512
Number of encoder attention layers: 12
Number of decoder attention layers: 12
Number of attention heads: 12
Length of encoder embedding: 768

CatBoost:

Iterations: 5000
Learning Rate: 0.01
Loss: RMSE

Output schema

{
  lang: <language code string>,
  rev_id: <revision_id string>,
  score: {
     prediction: <boolean decision result>
     probability: <probability of being hard to read>
     fk_score: <Flesch–Kincaid score approximation>
  }
}

Example input and output

Example input:

curl "https://<endpoint>/v1/models/readability:predict" -X POST -d '{"lang": "en", "rev_id":1161100049}' -H "Host: readability.experimental.wikimedia.org" --http1.1

Experimental endpoint (internal use only): inference-staging.svc.codfw.wmnet:30443

Example output:

{
"model_name":"readability",
"model_version":"2",
"wiki_db":"enwiki",
"revision_id":1161100049,
"output":{
    "prediction":true,
    "probabilities":{
        "true":0.8169194640857833,
        "false":0.1830805359142167
    },
    "fk_score":11.953445079550391
    }
}

Data[edit]

Training data consist of pairs of texts that correspond to the articles in English Wikipedia and Simple English Wikipedia. We treat one of the texts in a pair as simple (easy) and another as difficult (hard). Each text is represented as a list of sentences.

We split data into three parts: train (80%), validation (10%), and test (10%). An important detail is that we include different versions of the article only to one data part (train, test, or validation).

Apart from the holdout dataset, we evaluate model performance in other languages. In particular, we use Vikidia pairs for it, oc, el, de, ru, es, en, ca, hy, scn, pt, fr, eu, Klexikon for de, wikikids for nl, Txikipedia for eu.

Data pipeline

Training data

Number of samples: 174,642
Balance of classes: 1:1
Languages: en

Test data

Number of samples: 119,536
Balance of classes: 1:1
Languages: en, de, ca, el, es, eu, fr, hy, it, oc, pt, ru, scn, nl

Licenses[edit]

Code: Apache 2.0 License
Model: Apache 2.0 License

Citation[edit]

To be added soon.

↑ https://huggingface.co/bert-base-multilingual-cased
↑ Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16
↑ https://huggingface.co/bert-base-multilingual-cased
↑ https://catboost.ai/en/docs/concepts/python-reference_catboostregressor

[1] ttps://huggingface.co/bert-base-multilingual-cased

[2] Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16

[3] ttps://huggingface.co/bert-base-multilingual-cased

[4] ttps://catboost.ai/en/docs/concepts/python-reference_catboostregressor

[1]

[2]

[3]

[4]