Research:Copyediting as a structured task/A model to detect sentences for copyediting

The aim of this project is to develop a model to detect sentences which need copyediting.

Specifically, we train a machine-learning model that will take as input a sentence and yields as output a single score that indicates whether the sentence needs improvement in terms of copyediting; i.e. the score will indicate the quality of the sentence (in terms of copyediting). Such a model is similar, in spirit, to the model that automatically scores sentences whether they are missing citations: Research:Identification of Unsourced Statements

The use-case of the model is in the context of copyediting as a structured task. That is, the aim of the model is not to correct all copyediting errors in Wikipedia. Instead, the aim of the model is to surface high-quality suggestions to newcomer editors, who will then perform the edits themselves. Therefore, the model’s output score can be used in different ways:

Surface the lowest-scoring sentences (those with the highest need of copyediting) to newcomer editors as candidates for improvement without a specific suggestion.
Prioritize the lowest-scoring sentences for running other copyedit-tools such as LanguageTool which yield specific suggestions for improvement. In previous analysis we have shown that the latter tools are too sensitive when applied on Wikipedia articles. This prioritization will likely improve the accuracy of copyedit-suggestions from these tools.

This is work in progress.

The work has been conducted together with Djellel Difallah and Rashid Alyassi from NYU Abu Dhabi.

Methods

Data

In order to train a model for detecting copyedits, we need to identify suitable datasets with annotated information about copyedits. Unfortunately, there are few resources that are multilingual and large-scale. Therefore, we use a combination of different annotated datasets for training and evaluating the model. The datasets consist of pairs of the same sentence; for each pair, we have one version before and one version after the (copy)-edit. Specifically, we consider the following datasets for our model:

C4: C4_200M_synthetic is a ground truth dataset in English for grammatical error correction created synthetically from corrupting clean sentences. It contains 200M pairs of sentences. Only English.
Wikiedits: a multilingual dataset of millions of pairs of sentences extracted from edit-diffs. This was generated from adapting processing pipeline from WikEd Error Corpus.
Copyedits: a small-scale multilingual dataset of pairs of sentences that were marked as copyedits. Specifically, these are edits that contain edit tags labeled “newcomer task copyedit” (see list of tags).

Model

Our base model is the multilingual XLM-RoBERTa which supports around 100 languages. The motivation for this model that it has been shown to perform well in grammar-related tasks such as COLA

Training. We perform different steps in training the model.

We first perform a general pre-training on English data related to grammatical error correction. Specifically, we train the model using 100K pairs of sentences from the C4 data, 100K pairs of sentences from the WikiEdits, as well as 100K pairs of sentences from the copyedits datasets (all English).
For each language, we perform an additional step of fine-tuning with data related to grammatical error correction from the corresponding language. Specifically, we 100K pairs of sentences from the WikiEdits dataset of that specific language.
Thus, we train a separate model for each language; however, the pre-training component is the same in each case.

The desired outcome of the model is to obtain scores in order to rank sentences in terms of the need for copyediting. Therefore, we use an objective function that takes into account the ranking of two sentences based on their assigned scores based on the Terry-Bradley model. That is the loss is lower if the scores align with the ground truth of whether the sentence needs copyediting (before the edit) or does not need copyediting (after the edit).

Evaluation. We evaluate the model for each language on the pairs of sentences from the copyedits dataset in the corresponding language. After scoring each sentence of the pair, we calculate the accuracy as the fraction of pairs for which the scores can distinguish between the sentence before the edit (lower score=needs copyediting) vs the sentence after the edit (higher score = doesnt need copyediting).

Results

We tested the model for 10 different languages for which we have enough sufficiently large number of samples in the copyedits-dataset.

Language (wiki)	N. samples	Accuracy
ar (Arabic)	2,844	74.51
bn (Bengali)	408	64.95
cs (Czech)	4,375	63.98
es (Spanish)	32,490	67.86
fr (French)	18,049	81.14
pt (Portuguese)	6,026	70.89
ro (Romanian)	2,006	53.29
sv (Swedish)	19,031	75.94
uk (Ukraine)	12,116	64.23
vi (Vietnamese)	224	77.23

Conclusions

The accuracy across languages is mixed.
- There are many languages for which we get better than 70% (Arabic, French Portuguese, Swedish, Vietnamese)
- However, there are some languages for which accuracy is lower than 60% such as Romanian
This suggests that this model could be a good-enough approach to prioritize lowest-ranking sentences for copyediting (or similarly, use this as an additional filter for sentences identified via other tools). Especially, in the context of structured tasks this could be an option where we dont need a high recall but only a small set of high-quality suggestions.
It also highlights the problem of a lack of high-quality multilingual ground truth data for copyediting. We have seen anecdotal evidence that not all edits in the copyedits-data necessarily correct grammatical or spelling mistakes but can also correspond to stylistic or other changes. As a result, it is also possible that low accuracy in some language can stem from lower quality ground truth data.

Resources

Code/Data: t.b.a.