Research:Copyediting as a structured task/LanguageTool

From Meta, a Wikimedia project coordination wiki

A detailed analysis of LanguageTool for the copyedit structured task.

Advantages of LanguageTool[edit]

There are many reasons why LanguageTool seems a good starting point:

Challenges for LanguageTool[edit]

LanguageTool provides a browser-interface (https://languagetool.org/) to copy-paste text for copyediting.

When using LanguageTool for Wikipedia articles we face different challenges:

  • A Wikipedia article contains not only plain text but also other elements such as tables, infoboxes, references etc which we probably dont want spell-check.
  • A Wikipedia article contains content (text, links, etc) that is transcluded from, e.g., templates. Fixing potential copyedits in this case is not recommended as i) it would have to be done in the template and not in the article itself; and ii) will also affect the content in other articles.
  • A Wikipedia article contains many text elements which might appear as copyedits but are in fact correct, such as in quotes, uncommon entity names and should thus not be highlighted as copyedits.

As an example, when manually pasting the text from the lead section of the article on Roman Catholic Diocese of Bisceglie, LanguageTool yields 7 copyedits (marked in bold) which are all false positives:

The Diocese of Bisceglie (Latin: Dioecesis Vigiliensis) was a Roman Catholic diocese located in the town of Bisceglie on the Adriatic Sea in the province of Barletta-Andria-Trani, Apulia in southern Italy. It is five miles south of Trani. In 1818, it was united with the Archdiocese of Trani to form the Archdiocese of Trani-Bisceglie.[1][2]

The main challenge is then to ensure that applying LanguageTool to find copyedits in Wikipedia articles yields genuine errors and not too many false positives (highlighted errors that are in fact correct).

API for LanguageTool[edit]

In order to investigate LanguageTool in more detail, we set up our own instance to be used via an API.

Endpoint on cloud-vps[edit]

We set up a remote server running our own instance of LanguageTool on cloud-vps.

We can then query LanguageTool in the following way:

More documenation available at: https://github.com/wikimedia/research-api-endpoint-template/tree/language-tool

Frontend on toolforge[edit]

We also built an experimental API to run LanguageTool on Wikipedia articles. The tool automates some of the pre- and post-processing:

  • it extracts the plain text of an article. Using the HTML-version, we can keep track of the HTML-tags encoding information about whether the text corresponds to, e.g., a link, a quote, a reference, etc
  • it runs LanguageTool on the extracted plain text using the endpoint on cloud-vps
  • it allows for filtering of the copyedits based on some heuristics. For example, we filter errors related to the anchor-text of links.

The tool can be queried by specifying the language (e.g. "en") and the article title of the corresponding Wikipedia. Some example queries in different languages:

The supported languages are: ar, ast, be, br, ca, da, de, el, en, eo, es, fa, fr, ga, gl, it, ja, km, nl, pl, pt, ro, ru, simple, sk, sl, sv, ta, tl, uk, zh. These correspond to the Wikipedia-projects for which there is a supported language in LanguageTool. We always use the language without specifying a specific variant (e.g. “en” instead of “en-US”). For Simple Wikipedia (simplewiki) we use LanguageTool with “en”.

More documentation available at: https://gitlab.wikimedia.org/repos/research/copyedit-api

Evaluation of LanguageTool[edit]

In order to evaluate the performance of LanguageTool in detecting errors, we need an annotated dataset with ground-truth errors. Comparing the predicted with the true errors, we can calculate performance metrics around precision and recall, especially true positives (how many of the predicted errors are genuine errors) and false positives (how many of the predicted errors do not correspond to a genuine error).

The main limitation is that these ground-truth datasets are extremely rare. Even more so when going beyond English or for Wikipedia articles.  

Benchmark corpus[edit]

One starting point is the NLP-task of grammatical error correction, i.e. “the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors.” In the past, different benchmark datasets with ground-truth errors have been compiled to systematically investigate different approaches for grammatical error correction. Though, most of these resources are only available for English.

We evaluate LanguageTool on the W&I benchmark data of the BEA19 Shared task using Errant. W&I (Write & Improve) is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level. Thus, we have annotated errors in three difference levels: A (beginner), B (intermediate), C (advanced). My interpretation is that these classes contain errors with increasing complexity.

We then compare the errors from LanguageTool with the ground-truth errors in the benchmark data. For LanguageTool, we use language variants “en” and “en-US”. We evaluate on error detection (only detection) as well as error correction (detection + improvement).

Evaluation on Error Detection
data #sents LT-lang #TP #FP #FN Prec. Rec. F0.5
A.train 10,880 en 2,338 2,045 26,734 0.5334 0.0804 0.2508
en-US 4,108 3,200 24,964 0.5621 0.1413 0.3523
B.train 13,202 en 1,363 1,954 22,854 0.4109 0.0563 0.1818
en-US 2,586 3,335 21,631 0.4368 0.1068 0.2699
C.train 10,667 en 516 1,362 9,140 0.2748 0.0534 0.1503
en-US 924 2,436 8,732 0.275 0.0957 0.2
Evaluation on Error Correction
data #sents LT-lang #TP #FP #FN Prec. Rec. F0.5
A.train 10,880 en 1,898 2,481 26,264 0.4334 0.0674 0.2078
en-US 2,873 4,431 25,289 0.3933 0.102 0.2504
B.train 13,202 en 1,175 2,136 22,490 0.3549 0.0497 0.1592
en-US 1,911 4,004 21,754 0.3231 0.0808 0.2019
C.train 10,667 en 461 1,415 9,017 0.2457 0.0486 0.1357
en-US 739 2,619 8,739 0.2201 0.078 0.1613

Summary:

  • Error detection yields a precision of around 55% in the easy-corpus (A.train). The difference between language-variants is small (53 for en and 56 for en-US)
  • Error-detection yields a recall between 8% (en) and 14% (en-US). The en-US language variants is more sensitive capturing more errors.  This means that LanguageTool does not detect all errors and misses a large fraction. Though in absolute numbers, LanguageTool still detects thousands of errors.  
  • The number of correctly detected errors decreases for medium (B.train) and hard (C.train) corpora.
  • Error-correction is a much harder problem, however, the precision is still at around 40% for the easy corpus (A.train)

Wikipedia (English)[edit]

We would like to understand how the results from the benchmark corpora generalize when applied to Wikipedia. However, evaluating LanguageTool on Wikipedia articles is more challenging. We dont have a ground-truth dataset of at least some articles with a complete annotation of all the grammatical errors. Thus, we cannot just repeat the analysis from above.

Therefore, we will do an approximation by using annotations on the article level (instead of each single error):

  • Featured articles are considered to be some of the best articles Wikipedia has to offer. As a rough approximation, we assume that these articles are free of any errors (as viewed by Wikipedia’s editors); thus, we consider any error we will find here as a false positive. For enwiki, we find 6,090 featured articles with 1,192,369 sentences.
  • Articles with a copyedit-template. Wikipedia’s editors add this template to articles to indicate at the top of the article that these “may require copy editing for grammar, style, cohesion, tone, or spelling.” We assume that these articles have a higher chance to yield errors. For enwiki, we find 1,024 articles with the copyedit-template with 104,403 sentences.

Running LanguageTool we get the following statistics:

  • featured_en: 0.06 errors per sentences
  • featured_en-US: 0.792 errors persentence
  • copyedit-template_en: 0.125 errors per sentence

Summary:

  • How many false positives are there?
    • Using LanguageTool with the language-variant “en-US” on featured articles causes an extremely high number of false positives in Wikipedia articles. On average, almost every sentence will yield a false positive. This is consistent with the qualitative observations when using the browser-interface of LanguageTool. In fact, the default language-variant in the browser version is “en-US”
    • Using the language-variant “en” on featured articles substantially reduces the occurrence of false positives  by more than 10-fold to only 1 false positive every 15 sentences or so.
  • How precise are errors highlighted by LanguageTool?
    • Using the language-variant “en” on copyedit-template articles we find a higher rate of errors (0.126 per sentence) than for featured articles (0.06 per sentence). Assuming that the errors in featured articles correspond to a baseline rate of false positives in all articles, we can approximate the precision by subtracting the baseline rate. Thus, we would have 0.126-0.06=0.066 errors per sentence that are genuine. This would correspond to a precision of 0.066/(0.06+0.066)= 0.524.
    • This value for  the precision is consistent with the findings in the benchmark corpora (52% vs 53%)
    • The value for the precision is likely to be a lower bound (in reality it is higher) since we assume that all found errors in featured articles are false positives where, in fact, some of them might be genuine.


Error types We can also look at the types of errors LanguageTool detects in Wikipedia articles.

  • What sticks out, is that the large fraction of “misspelling” for featured articles when using “en-US”. One interpretation is that errors from the “misspelling”-rule are a main driver of false-positives in Wikipedia articles.
Fraction of error ruletypes from evaluating LanguageTool on enwiki
Fraction of error categories from evaluating LanguageTool on enwiki
















Wikipedia (non-English)[edit]

We compare the error rate from LanguageTool in articles with the featured article badge (Q17437796) against articles containing the copyedit-template (Q6292692) in the corresponding language.

wiki_db language-code featured_n-art featured_n-sent featured_n-err template_n-art template_n-sent template_n-err featured_err-per-sent template_err-per-sent prec
enwiki en 6090 1192321 71574 1024 104403 13197 0.06 0.126 0.525
simplewiki en 30 4926 286 15 415 66 0.058 0.159 0.635
arwiki ar 692 154990 310459 512 22594 58990 2.003 2.611 0.233
astwiki ast 325 71918 324711 868 38430 169848 4.515 4.42 0
bewiki be 88 38043 65063 675 36571 61446 1.71 1.68 0
brwiki br 2 223 448 0 0 0 2.009 - -
cawiki ca 764 145185 155363 8 397 538 1.07 1.355 0.21
dawiki da 17 5967 5077 14 819 676 0.851 0.825 0
dewiki de 2730 935452 102807 0 0 0 0.11 - -
elwiki el 129 30611 44224 0 0 0 1.445 - -
eowiki eo 311 70371 136043 0 0 0 1.933 - -
eswiki es 1235 350673 425496 1547 99005 159589 1.213 1.612 0.247
fawiki fa 198 53013 3358 16 1024 141 0.063 0.138 0.54
frwiki fr 2019 679560 749826 0 0 0 1.103 - -
gawiki ga 2 509 1433 0 0 0 2.815 - -
glwiki gl 218 59451 112419 209 11371 21333 1.891 1.876 0
itwiki it 536 124571 207444 720 53209 106962 1.665 2.01 0.172
jawiki ja 92 30542 399 0 0 0 0.013 - -
kmwiki km 21 930 28741 6 53 1506 30.904 28.415 0
nlwiki nl 365 115060 87518 0 0 0 0.761 - -
plwiki pl 944 268900 220568 1 22 27 0.82 1.227 0.332
ptwiki pt 1315 328326 190550 1346 43865 32652 0.58 0.744 0.22
rowiki ro 196 58467 70636 256 15335 28486 1.208 1.858 0.35
ruwiki ru 1627 651035 480016 11 1157 304 0.737 0.263 0
skwiki sk 73 20324 24104 0 0 0 1.186 - -
slwiki sl 381 86241 123624 212 11198 16979 1.433 1.516 0.055
svwiki sv 354 79217 81458 278 11898 16231 1.028 1.364 0.246
tawiki ta 14 3160 787 2 179 161 0.249 0.899 0.723
tlwiki tl 29 3274 13667 3 155 930 4.174 6 0.304
ukwiki uk 233 68208 40630 883 64783 42342 0.596 0.654 0.089
zhwiki zh 929 154844 15396 1761 57344 7259 0.099 0.127 0.215


Summary:

  • The error rates in enwiki and simplewiki are consistent.
  • In most languages, error rates in articles with the copyedit-template is indeed higher than for featured articles.
  • However, for most languages the precision is below 0.5 and for some languages the error rate in featured articles is similar to that in articles with copyedit-templates.

Takeaway:

  • These results suggest that we should add a post-processing step in which we filter some errors such as certain types (e.g. spelling) or in certain text regions (text of links).

Filtering errors[edit]

We can now use additional post-processing to filter certain errors and use the above evaluation protocol to assess whether this strategy improves the precision. The idea is to propose such a filter that removes errors that are false positives (those in the featured articles) but keeps as many that are genuine (those in the articles with copyedit-templates).

As a first naive attempt, we use the annotations of the text contained in the HTML of the article:

  • we keep track of all substrings that contain any annotation (such as when text is bold, italics, a link, etc)
  • we filter an error from LanguageTool if: i) the position of the error overlaps with any of the substrings; or ii) the string of the error does not match any of the substrings with an annotation.
wiki_db language-code featured_err-per-sent featured_err-per-sent-filter template_err-per-sent template_err-per-sent-filter prec prec-filter prec-change-ppt
enwiki en 0.06 0.037 0.126 0.072 0.525 0.489 -0.036
simplewiki en 0.058 0.037 0.159 0.133 0.635 0.724 0.089
arwiki ar 2.003 1.01 2.611 1.785 0.233 0.434 0.201
astwiki ast 4.515 1.572 4.42 1.794 0 0.124 0.124
bewiki be 1.71 0.557 1.68 1.039 0 0.464 0.464
brwiki br 2.009 0.377 - - - - -
cawiki ca 1.07 0.291 1.355 0.554 0.21 0.475 0.265
dawiki da 0.851 0.242 0.825 0.336 0 0.28 0.28
dewiki de 0.11 0.042 - - - - -
elwiki el 1.445 0.564 - - - - -
eowiki eo 1.933 0.647 - - - - -
eswiki es 1.213 0.243 1.612 0.624 0.247 0.611 0.363
fawiki fa 0.063 0.014 0.138 0.057 0.54 0.757 0.217
frwiki fr 1.103 0.196 - - - - -
gawiki ga 2.815 1.525 - - - - -
glwiki gl 1.891 0.43 1.876 0.865 0 0.503 0.503
itwiki it 1.665 0.421 2.01 0.916 0.172 0.541 0.369
jawiki ja 0.013 0.012 - - - - -
kmwiki km 30.904 16.081 28.415 18.528 0 0.132 0.132
nlwiki nl 0.761 0.257 - - - - -
plwiki pl 0.82 0.284 1.227 0.591 0.332 0.519 0.187
ptwiki pt 0.58 0.314 0.744 0.461 0.22 0.318 0.098
rowiki ro 1.208 0.287 1.858 1.127 0.35 0.745 0.396
ruwiki ru 0.737 0.31 0.263 0.156 0 0 0
skwiki sk 1.186 0.511 - - - - -
slwiki sl 1.433 0.536 1.516 0.97 0.055 0.447 0.393
svwiki sv 1.028 0.293 1.364 0.487 0.246 0.398 0.152
tawiki ta 0.249 0.168 0.899 0.782 0.723 0.785 0.062
tlwiki tl 4.174 2.009 6 2.142 0.304 0.062 -0.242
ukwiki uk 0.596 0.248 0.654 0.388 0.089 0.362 0.274
zhwiki zh 0.099 0.062 0.127 0.097 0.215 0.355 0.141

Summary:

  • The post-processing step of filtering errors substantially improves the precision of almost all wikis (i.e. it filter relatively more errors in the featured articles than in the copyedit-template articles)
  • Other more nuanced filters could lead to further improvements in the precision of LanguageTool.

Takeaways from the evaluation[edit]

  • We can apply LanguageTool to at least 30 Wikipedias running our own instance. Checking the text of Wikipedia articles requires some preprocessing of the text (e.g. to identify only raw text and avoid transcluded content from templates) and post-processing to filter some errors (e.g. avoid correcting anchor text of links)
  • LanguageTool can detect a high volume of copyedit errors beyond simple misspellings based on a dictionary-lookup.
  • We estimate the precision of the errors of LanguageTool in for English to be around 50% (or higher)
  • The concern about a large number of false positives can be mitigated by using the generic language-variant (e.g. “en” instead of “en-US”).
  • Applying different filtering of errors can substantially improve the precision of LanguageTool to detect errors in almost all wikis.

Comparison to spell-checker[edit]

In this section, we look how spellcheckers perform at the same tasks as above. This gives us a good sense how LanguageTool compares to the much simpler spellchecking tools. Specifically, I used the enchant spell-checking library which provides uniform access to spellcheckers in different languages via python. From the projects considered for the evaluation of LanguageTool, I readily found spellcheckers a subset of projects ‘enchant.list_languages()‘ (though many more languages can be installed): enwiki (en-US), simplewiki (en-US), arwiki (ar), cawiki (ca) , dewiki (de_DE), elwiki (el), eswiki (es), fawiki (fa), frwiki (fr), glwiki (gl_ES), itwiki (it_IT), nlwiki (nl), plwiki (pl), ptwiki (pt_BR), rowiki (ro), ruwiki (ru_RU), svwiki (sv), ukwiki (uk).

Benchmark corpus[edit]

Evaluation on Error Detection
data #sents LT-lang #TP #FP #FN Prec. Rec. F0.5
A.train 10,880 en_GB 1,878 1,471 27,194 0.5608 0.0646 0.2211
en_US 1,925 1,720 27,147 0.5281 0.0662 0.2205
B.train 13,202 en_GB 1,249 1,764 22,968 0.4145 0.0516 0.1722
en_US 1,312 2,119 22,905 0.3824 0.0542 0.1729
C.train 10,667 en_GB 423 1,288 9,233 0.2472 0.0438 0.1282
en_US 460 1,692 9,196 0.2138 0.0476 0.1259


Evaluation on Error Correctopm
data #sents LT-lang #TP #FP #FN Prec. Rec. F0.5
A.train 10,880 en_GB 872 2,477 27,290 0.2604 0.031 0.1049
en_US 897 2,748 27,265 0.2461 0.0319 0.1049
B.train 13,202 en_GB 683 2,330 22,982 0.2267 0.0289 0.0956
en_US 685 2,746 22,980 0.1997 0.0289 0.0916
C.train 10,667 en_GB 271 1,440 9,207 0.1584 0.0286 0.083
en_US 276 1,876 9,202 0.1283 0.0291 0.0763


Summary:

  • There is little difference between using the en_US or en_GB spellchecker
  • The performance in error detection is comparable to that of LanguageTool
  • The performance in error correction is only about half as good as that of LanguageTool (both in terms of precision and recall)

Wikipedia[edit]

wiki_db language-code featured_n-art featured_n-sent featured_n-err template_n-art template_n-sent template_n-err featured_err-per-sent template_err-per-sent prec
enwiki en_US 6090 1235144 1221727 1024 108060 148391 0.989 1.373 0.280
simplewiki en_US 30 5045 1714 15 435 675 0.340 1.552 0.781
arwiki ar 692 173033 1593977 512 22594 136618 9.212 6.047 0.000
cawiki ca 764 145185 182387 8 397 535 1.256 1.348 0.068
dewiki de_DE 2730 935452 1390710 0 0 0 1.487 - -
elwiki el 129 30611 45876 0 0 0 1.499 - -
eswiki es 1235 350673 721562 1547 99005 238857 2.058 2.413 0.147
fawiki fa 198 53013 205747 16 1024 3370 3.881 3.291 0.000
frwiki fr 2019 679560 2332701 0 0 0 3.433 - -
glwiki gl_ES 218 59451 57637 209 11371 11756 0.969 1.034 0.062
itwiki it_IT 536 124571 220426 720 53209 117756 1.769 2.213 0.200
nlwiki nl 365 115060 109633 0 0 0 0.953 - -
plwiki pl 944 268900 198679 1 22 29 0.739 1.318 0.439
ptwiki pt_BR 1315 328326 433376 1346 43865 85232 1.320 1.943 0.321
rowiki ro 196 58467 70200 256 15335 27174 1.201 1.772 0.322
ruwiki ru_RU 1627 651035 1069748 11 1157 845 1.643 0.730 0.000
svwiki sv 354 79217 146754 278 11898 26267 1.853 2.208 0.161
ukwiki uk 233 68208 75141 883 64783 72344 1.102 1.117 0.013

Summary:

  • For featured articles, the error rate from the spellcheckers is much higher than that for LanguageTool (e.g. in enwiki we find about 1 error per sentence compared to 0.06 errors per sentence when using LanguageTool). In general this translates into lower precision for spellcheckers.

Filtering errors[edit]

wiki_db language-code featured_err-per-sent featured_err-per-sent-filter template_err-per-sent template_err-per-sent-filter prec prec-filter prec-change-ppt
1 enwiki en_US 0.989 0.253 1.373 0.521 0.280 0.514 0.235
2 simplewiki en_US 0.340 0.079 1.552 0.563 0.781 0.859 0.078
3 arwiki ar 9.212 6.703 6.047 5.007 0.000 0.000 0.000
4 cawiki ca 1.256 0.268 1.348 0.411 0.068 0.348 0.280
5 dewiki de_DE 1.487 0.428 - - - - -
6 elwiki el 1.499 0.600 - - - - -
7 eswiki es 2.058 0.303 2.413 0.779 0.147 0.611 0.464
8 fawiki fa 3.881 1.735 3.291 2.102 0.000 0.174 0.174
9 frwiki fr 3.433 0.680 - - - - -
10 glwiki gl_ES 0.969 0.182 1.034 0.487 0.062 0.627 0.565
11 itwiki it_IT 1.769 0.378 2.213 0.929 0.200 0.593 0.393
12 nlwiki nl 0.953 0.201 - - - -
13 plwiki pl 0.739 0.222 1.318 0.591 0.439 0.624 0.185
14 ptwiki pt_BR 1.320 0.243 1.943 0.618 0.321 0.608 0.287
15 rowiki ro 1.201 0.257 1.772 1.059 0.322 0.757 0.435
16 ruwiki ru_RU 1.643 0.465 0.730 0.303 0.000 0.000 0.000
17 svwiki sv 1.853 0.582 2.208 0.801 0.161 0.273 0.112
18 ukwiki uk 1.102 0.363 1.117 0.620 0.013 0.415 0.402

Summary:

  • The filtering of errors substantially increases the precision of spellcheckers. The resulting approximated precision after filtering is comparable to that of LanguageTool but still remains systematically lower.

Takeaways[edit]

  • Spellcheckers can also detect and surface many meaningful errors for copyediting
  • Spellcheckers seem to suffer from a much higher rate of false-positives than LanguageTool. This can be partially addressed by imposing aggressive post-processing filters on the surfaced errors.
  • LanguageTool has a clear advantage in suggesting the correct improvement (spellcheckers perform substantially worse in error correction)
  • Since spellcheckers are available in many languages, they can serve as a backup solution for languages which are not supported by LanguageTool.
  • Given the higher rate of false positives when using spellcheckers, it would be desirable to develop a model that can assign a confidence score to the surfaced errors so that the structured task could prioritize surfacing those errors which were assigned a high confidence score.