Research:Develop a ML-based service to predict reverts on Wikipedia
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
The Research team in collaboration with the ML-Platform team are creating a new service to help patrollers to detect revisions that might be reverted.
Requirments:
- One single model for all Wikipedia languages.
- Model should be primarily language agnostic
- Model will be able to run for single revisions or batches
- Model should be able to run in LiftWing.
Context[edit]
Fighting disinformation and keeping knowledge integrity in our projects is one of the most important and difficult tasks for our movement. The existing content policies had positioned Wikipedia with a central role in the information ecosystem. However, the workload this implies to our communities seems to be one of the main limitations for keeping and improving content reliability. The usage of machine learning-based tools (a.k.a AI) appears as a powerful solution to support them. The technology department has developed several tools in that direction. However, these tools suffer from several limitations:
- They are highly language dependent.
- Rely on complex manual annotation campaigns that are difficult to scale, specially for small languages communities..
- They were created as stand-alone applications, which requires dedicated software architecture and data pipelines.
Our Approach[edit]
Our proposal is to to design a new generation of Machine-Learning models, that are primarily language agnostic, based on implicit annotations (e.g. wikitext templates, reverts, etc) and with a standardized architecture. These models would help the developers community and other WMF teams to build tools to sustain and increase knowledge integrity and fight disinformation in Wikimedia projects.
Our recent research has shown that it is possible to build tools based mainly on language agnostic features, that can replace some of the current language-dependent models. Building these tools should (and can) be done taking into account the differences across projects. Having language agnostic models will allow to solve the scalability issues raised on the disinformation strategy and provide better support for patrollers in small Wikimedia projects.
Revert Risk Model(s)[edit]
We have developed two models one Language Agnostic fully based on Edit Types, and a Multilingual one based on mBert.
All Edits | Anonymous Edits | |
---|---|---|
Language Agnostic | 0.79 | 0.67 |
Multilingual | 0.68 | 0.69 |
As seen in the table above, the Language Agnostic Model works well for all edits, while the Multilingual approach works better for anonymous edits.
APIs[edit]
Models are available through internal APIs maintained by the ML-Platform team, and can be accessed using the following end-points:
Language Agnostic Model:
curl "https://inference.svc.codfw.wmnet:30443/v1/models/revert-risk-model:predict" -d @input.json -H "Host: revert-risk-model.experimental.wikimedia.org" --http1.1 -k
Multilingual Model:
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revert-risk-model:predict" -d @input.json -H "Host: revert-risk-model.experimental.wikimedia.org" --http1.1 -k
Where the input file input.json follows this format:
{ "lang": "2-digit-wikidb_code", "rev_id": revision_number }
For example:
{ "lang": "ru", "rev_id": 123855516 }
Model Cards[edit]
- Multilingual Revert Risk Model Card (Proposed)
- Language Agnostic Revert Risk Model Card (TBA)