Research:Improving multilingual support for link recommendation model for add-a-link task

Tracked in Phabricator:
Task T342526

Created

10:23, 24 July 2023 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Aisha Khatun

Wikimedia Foundation

Kevin Bazira

Wikimedia Foundation

Isaac Johnson

Wikimedia Foundation

Duration: 2023-07 – ??

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In a previous project we developed a machine-learning model to recommend new links to articles^[1]: Research:Link_recommendation_model_for_add-a-link_structured_task

The model is used for the add-a-link structured task. The aim of this task is to provide suggested edits to newcomer editors (in this case adding links) to break down editing into simpler and more well-defined tasks. The hypothesis is that this leads to a more positive editing experience for newcomers and, as a result, they will keep contributing in the long-run. In fact, the experimental analysis showed that newcomers are more likely to be retained with this features, and that the volume and quality of their edits increases. As of now, the model is deployed to approximately 100 Wikipedia languages.

However, we have found that the model currently does not work well for all languages. After training the model for 301 Wikipedia languages, we identified 23 languages for which the model did not pass the backtesting-evaluation. This means, that we think the model’s performance does not meet a minimum quality standard in terms of the accuracy of the recommended links. Detailed results: Research:Improving multilingual support for link recommendation model for add-a-link task/Results round-1

In this project, we want to improve the multilingual support of the model. This means we want to increase the number of languages for which the model passes the backtesting evaluation such that it can be deployed to the respective Wikipedias.

Methods[edit]

We will pursue 2 different approaches to improve the multilingual support.

Improving the model for individual languages.

We will try to fix the existing model for individual languages. From the previous experiments where we trained the model for 301 languages, we gathered some information about potential improvements for individual languages (T309263). For example, the two most promising approaches are

Unicode decode error when running wikipedia2vec to create article embeddings as features. This appeared in fywiki and zhwiki (T325521). This has been documented in the respective github repository. It also proposed a fix; however, this hasnt been merged yet. The idea would be to implement (or adapt if necessary) the proposed fix.
Word-tokenization. Many of the languages which failed the backtesting evaluation do not use whitespaces to separate tokens (such as Japanese). The current model relies on whitespaces to identify tokens in order to generate candidates for anchors for links. Thus, improving the work-tokenization for non-whitespace delimited languages should improve the performance of the models in these languages. We recently developed mwtokenizer, a package for doing tokenization in (almost) all languages in Wikipedia. The idea would be to implement mwtokenizer into the tokenization pipeline.

Developing a language-agnostic model.

Even if we can fix the model for all languages above, the current model architecture has several limitations. Most importantly, we currently need to train a separate model for each language. This brings challenges for deploying this model for all languages, because we need to train and run 300 or more different models.

In order to simplify the maintenance work, ideally, we would like to develop a single language-agnostic model. We will explore different approaches to try to develop such a model while ensuring the accuracy of the recommendations. We will use. Among others, the language-agnostic revert-risk model as an inspiration where such an approach has been implemented and deployed with success.

Results[edit]

Improving mwtokenizer[edit]

We hypothesize that we can improve language support for the add-a-link model by improving the tokenization for languages that do not use whitespaces to separate words such as Japanese.

As a first step, we worked on the newly developed mwtokenizer package (as part of Research:NLP Tools for Wikimedia Content), a library to improve tokenization across Wikipedia languages, so that it can be implemented into the add-a-link model. Specifically, we resolved several crucial issues (phab:T346798) such as fixing the regex for sentence-tokenization in non-whitespace languages.

As a result, we released a new version (v0.2.0) of the mwtokenizer package which contains these improvements.

Improving the model for individual languages[edit]

Some of the major changes made to improve performance of the existing language-dependent models are phab:T347696:

Replacing nltk and manual tokenization with mwtokenizer. This enabled effective sentence and word tokenization of non whitespace languages and thus improved performance (Merge Request).
Fixing Unicode error that was preventing a few models to run successfully. (Merge Request).
Fixing a regex that was causing links detected by the model to not be placed appropriately in the output string for non-WS languages. (Merge Request).

Having solved the major errors, we can now run all the languages without error and have improved performance in a lot of the non-whitespace languages using the improved mwtokenizer. Below are the current results for the languages that did not pass backtesting before. Previous results can be found here: Results round-1.

Table showing change in performance for languages that did not pass backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments	passes backtesting
aswiki	0.57	0.68	0.16	0.28	improvement!	borderline (precision is below 75%)
bowiki	0	0.98	0	0.62	improvement!	True
diqwiki	0.4	0.88	0.9	0.49	recall dropped	True
dvwiki	0.67	0.88	0.02	0.49	improvement!	True
dzwiki	-	1.0	-	0.23	improvement!	True
fywiki	error	0.82	error	0.459	improvement!	True
ganwiki	0.67	0.82	0.01	0.296	improvement!	True
hywwiki	0.74	0.75	0.19	0.30	similar results	True
jawiki	0.32	0.82	0.01	0.35	improvement!	True
krcwiki	0.65	0.78	0.2	0.35	slight improvement	True
mnwwiki	0	0.97	0	0.68	improvement!	True
mywiki	0.63	0.95	0.06	0.82	improvement!	True
piwiki	0	0	0	nan	only 13 sentences	False
shnwiki	0.5	0.99	0.02	0.88	improvement!	True
snwiki	0.64	0.69	0.16	0.18	similar results	borderline (precision is below 75%, recall is close to 20%)
szywiki	0.65	0.79	0.32	0.48	slight improvement	True
tiwiki	0.54	0.796	0.5	0.48	slight improvement	True
urwiki	0.62	0.86	0.23	0.54	improvement!	True
wuuwiki	0	0.68	0	0.36	improvement!	borderline (precision is below 75%)
zhwiki	-	0.78	-	0.47	improvement!	True
zh_classicalwiki	0	1.0	0	0.0001	improvement, low recall	False
zh_yuewiki	0.48	0.31	0	0.0006	low recall	False

The following table shows current performance of some languages that had passed backtesting earlier. We make this comparison to ensure the new changes does not deteriorate performance.

Table showing change in performance for some languages that passed backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments
arwiki	0.75	0.82	0.37	0.36	improvement
bnwiki	0.75	0.725	0.3	0.38	similar results
cswiki	0.78	0.80	0.44	0.45	similar results
dewiki	0.8	0.83	0.48	0.48	similar results
frwiki	0.815	0.82	0.459	0.50	similar results
simplewiki	0.79	0.79	0.45	0.43	similar results
viwiki	0.89	0.91	0.65	0.67	similar results

Exploratory work for language-agnostic model[edit]

Currently we train a model for each language wiki and each model is served independently. This creates deployment strain and is not easy to manage in the long run. The main goal is to develop a single model that supports all (or as many as possible) languages in order to decrease the maintenance cost. We could also develop a few models each with a set of compatible languages.

First we need to ensure languages can be trained and served using a single model. To test this hypothesis we perform some exploratory work on language-agnostic models (phab:T354659). Some of the important changes that made were:

Removing the dependency on Wikipedia2Vec by using outlink-embeddings. These embeddings were created in-home using Wikipedia links. (Merge Request)
Add a gridsearch module to select the best possible model (Merge Request)
Add a feature called `wiki_db` that names a wiki (e.g. enwiki, bnwiki). This should ideally help the model when combining multiple languages. (Merge Request)
Combined training data of multiple languages, trained a single model, ran evaluation of each language, and compared performance with single-language models. (Merge Request)

To create a language-agnostic model

We first combined training data of 2 unrelated languages and performance did not drop much. This motivated us to scale the experiment to 11 and then to 50 languages. We trained two models on two sets of ~50 languages. One set had 52 central languages from fallback chains and another had 44 randomly selected wikis. We trained a model on all languages in each set and evaluated on each individual language wiki. The performance comparison of the language-agnostic model and the single-language model can be found here: main_v2 and sec_v2 . The performance of the language-agnostic model for both sets of languages are comparable to the single-language versions. This shows we can theoretically select any set of wikis, perform combined training, and expect very good results.
We extend the experiment and train a model with all (317) language wikis with a cap of 100k samples per language. The evaluations can be found here: all_wikis_baseline. Similar to before, some languages have some drop in performance, but a lot of the languages perform almost on par with single language based models. Specifically, 14% of the languages had >=10% drop in precision, while the rest were close to the precision of single-language trained models. We increased the cap to 1 million samples per language. Evaluation here: all_wikis_baseline_1M. The performance remains extremely close to the 100k samples experiment, with slight decrease in precision in 4 languages and slight increase in 4 other languages.

Takeaways: Based on our experiments, we confirm that it is indeed possible to combine languages, even randomly, and expect performance very close to the single-language models. How many models to train, what languages should be trained together, and how many samples to choose are all questions that need more experiments to answer and will mostly depend on the memory and time constraints of training the model(s).

Resources[edit]

t.b.a.

References[edit]

↑ Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1] Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1]