Research:Copyediting as a structured task/Literature review

From Meta, a Wikimedia project coordination wiki

Literature Review[edit]

Recommendations[edit]

  • Use simple spell-checkers to detect errors and suggest corrections (LanguageTool or Enchant in case we want to cover more than the 35 languages); these are the only options that can scale to a sufficient number of languages.
  • Apply a set of filters to decrease the sensitivity for raising errors in the context of Wikipedia articles (e.g. linked entities, quotes, etc). We want to increase precision in order to make sure that the suggested errors are truly errors.
  • Develop a way to evaluate the accuracy of the raised errors. Most likely, this will require manual evaluation. There are challenges for automatic evaluation in terms of generating a ground-truth dataset that i) is representative of content in Wikipedia, ii) exists for several languages.
  • Long-term: Develop a model to highlight sentences that require editing (without necessarily suggesting a correction) based on copyediting templates. This could provide a set of more challenging copyediting tasks compared to spellchecking. This is also a more researchy project.

Summary[edit]

There is a variety of approaches to detecting errors from simple spelling to more advanced grammar correction to fancy AI-based writing aids. The main limitation is that the latter more advanced tools are usually only available for English and would require substantial effort (if possible at all) to apply to any other language. The main reason is that there is a severe lack of data to train and evaluate such models for essentially any other language besides English (though the situation is not great in English either).

In the context of copyediting as a structured tasks for wikimedia projects, there are not many options. Simpler tools for copyediting such as basic spellchecker (Enchant) or grammar-checker (languagetool) are the most promising (or only viable) candidates. In contrast, the added value of state-of-the-art machine-learning based approaches remains unclear.

  • Basic spell-checkers can be used for many, if not (almost) all, of the languages versions in Wikipedia
  • Basic spell/grammar-checkers are transparent and open/free in contrast to many of the modern tools using machine-learning
  • Despite the restriction to simpler types of errors, it is very likely that there are enough errors we can surface.
  • Corrections of these errors are valued by the community which is reflected in the many different tools and projects that organize around such tasks. However, this raises a few concerns
    • Do we need to build automatic algorithms to surface and correct these errors?
    • Is there potential for conflict of there are already communities dedicated to these types of errors?

The main challenge will be to evaluate any of the approaches (even the simple ones). This has two reasons:

  • The general lack of annotated data (even for English), i.e. labeled sentences which contain specific errors
  • There will need to be some tailoring to the context of Wikipedia. Many of the community-tools developed a sometimes rather long set of heuristics to ignore certain errors highlighted from spellcheckers (such as everything in quotes, or links, etc). The main reason is that spell-checkers are often built around achieving a high recall in order not to miss a potential errors. However, in the context of structured task we would probably want to focus on a high precision such that the errors we highlight are actually errors. Naturally, one can usually not increase precision and recall but needs to find a suitable trade-off.

Detailed notes[edit]

Copyediting in general[edit]

What is copyediting? What are different aspects of copyediting?

Overall, the term copyediting is not very well defined. The Wikipedia-page on proofreading mentions that “'copy editors' focus on a sentence-by-sentence analysis of the text to "clean it up" by improving grammar, spelling, punctuation, syntax, and structure.” It contains a good overview on the distinctions (and overlap) between editing, copyediting, proofediting, and proofreading. Most of the tasks relevant in this project are already captured under proofreading, such as grammar, spelling, punctuation, etc.

A good introductory overview on grammar and spell-checking is given in Devopedia: Grammar and Spell-checker, for example common types of mistakes, timeline of the development of different models, and pointers to common tools.

The simplest case of copyediting might be automatic spell-checking/correction. Some good overviews are Roger Mitten: Spellchecking by computer and Norvig: How to write a spelling corrector. Mitten mentions that in the real world recall is more important than accuracy since you dont want a false word to slip through -- in our use-case for structured tasks I think that the opposite holds in that accuracy is actually more important. It is believed that “Spelling mistakes constitute the largest share of errors in written text” (Jayanthi 2020). Tools for simple spell-checking are ubiquitous.

The main distinction can be made between non-word errors and real-word errors (see Kukichs's "Techniques for automatically correcting words in text"[1], link to pdf). Non-word errors can be easily identified (and corrected) by looking up strings in a dictionary and flagging them as misspellings if no match is found. As such they are easy because it is possible to treat the words in question in isolation. In contrast, real-world errors is a class of errors in which one correctly spelled word is substituted for another such that isolated-word error detection does not work anymore (e.g. form vs from or there vs their). In order to detect these errors, one needs information from the surrounding words. Thus, it is a much harder problem. Most existing spelling correction techniques focus on isolated words (non-errors), without taking into account any information that might be gleaned from the linguistic or textual context, for example they do not take into account, e.g., typographic, cognitive, and grammatical errors, that result in other valid words (real-world errors). It is not trivial to estimate the proportion of the error-types (e.g. non-word vs real-word errors) due to lack of automatic tools for detecting such errors. Reported estimates say that 25-40% of misspellings were real-word errors.

Naturally, there is much more fine-grained classification of errors (often rather qualitative, e.g. see the overview-references). The ERRANT-tool is a recently developed tool to automatically classify different types of errors appearing in the problem of grammatical error correction in NLP (spelling, punctuation, grammatical, and word choice errors; see more below): The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. The taxonomy contains 25 main error types (see Ch. 5 of Christopher Bryant’s thesis) with SPELL being one of them.

The code for the tool is open-source and it has been used to annotate state-of-the-art benchmark datasets (such as the BEA-2019 Shared Task). However, the description states that the tool has been developed for English, and it remains unclear how well it works for other languages.

Overall, there are the following classes of models to detect/correct errors (from Bryant’s “Automatic annotation of error types for grammatical error correction”[2]): Rule-based: Hard-coded rules. Capture some errors well, but more complex are not captured. Statistical models: Using large text-corpora to measure probability of observed sequences of text (n-grams). The intuition is that low-probability sequences are much more likely to be an error. Classifiers: Build classifiers to identify specific error-types. Machine-translation: treat error correction as a translation problem. Instead of translation a sentence from one language to another, one translates a sentence from ungrammatical to grammatical. Recently, this framework has been adapted using neural networks.

Copyediting in Wikipedia[edit]

What are existing approaches to copyediting in the Wikipedia-world?

The Growth Team has captured some notes on their early conversations around copyediting with community members.

There are guides in the Wikipedia-namespace:

  • Wikipedia: Basic copyediting. This is a guide for new copyeditor. For example, it points to external tools (grammarly and languagetools) that can support copyediting.
  • Wikipedia:Spellchecking. This is a guide how to do spell-checking in Wikipedia and contains a list of different spellcheckers.

There are several projects/initiatives dedicated to copyediting:

  • Wikipedia: Typo Team. This project focuses on correcting typos and misspellings in articles. They provide a set of tools on how to search for typos and how to correct them; there are also some reports and links to automatically create lists of probable typos. One tool being used is the moss-tool. Another effort is the Wikipedia:Adopt-a-typo in which users search and correct specific typos.
  • Wikipedia:WikiProject_TypoScan. A project that uses the tool AutoWikiBrowser/Typos to automatically generate a list of thousands of articles with typos. Not sure if this is still active as the last such table is from 2012.
  • Wikipedia:WikiProject_Grammar. This project provide a place where Wikipedians can ask about grammar, improve their grammar, or learn how to correct grammar in articles. It encourages the use of the {{copyedit}}-template to mark articles that require work. The project contains a header that it is believed to be inactive.
  • Wikipedia:WikiProject_Guild_of_Copy_Editors. This project is dedicated to improving the quality of writing in articles on the English Wikipedia. The project organizes its work using the {{copyedit}} (and related) templates. The work goes beyond typos and grammar to make articles more clear, correct, concise, comprehensible, and consistent. They also keep statistics around the backlog etc.

There are several templates for tracking work around copyediting:

There are different tools that are already being used:

There are several open phabricator tickets around spell-checking etc:

Summary

  • Existing approaches to spellchecking mostly use hand-curated lists or very simple heuristics to automatically screen for typos
  • Languagetool has been requested previously and there was a running instance on wikimedia cloud

Common tools for copyediting[edit]

What are the tools (outside) of Wikipedia that are commonly used for automatic copyediting tasks, such as spellchecking etc?

The most simple approach is manually curated lists of common spelling mistakes, such as Wikipedia:Lists_of_common_misspellings. These are highly specific capturing only a small subsets of mistakes, but for that probably very effective. These often exist in many different languages. They are also commonly used for Wikipedia-tools (some of the bots) There is a large variety of more or less simple spellcheckers. Below is a selection of some of the possible options:

  • Enchant. Probably the most promising spellchecker since it is a library (and command-line program) that wraps a number of different spelling libraries and programs with a consistent interface (hunspell, aspell, etc). By using Enchant, you can use a wide range of spelling libraries, including some specialised for particular languages, without needing to program to each library's interface. Since it wraps different libraries, we can cover essentially any language (via hunspell, aspell, ispell etc), especially we can also use special dictionaries developed for agglutinating languages (e.g. finnish).
  • Hunspell is the spellchecker used by libre-office, firefox, chrome, etc. and is under free license. It uses unicode utf-8 encoded dictionaries and supports many languages (e.g. see this list of dictionaries): English, German, French, Portuguese, Czech, Vietnamese, Arabic, Bengali, etc.
  • Aspell is also open/free (GNU) and supports many languages with their dictionaries (English, German, French, Portuguese; arabic, bengali, czech, vietnamese). It was designed to replace Ispell, and unlike the latter Aspell can easily check UTF-8 documents without having to use a special dictionary. Ispell will only suggest corrections that are based on a Damerau–Levenshtein distance of 1 -- it will not attempt to guess more distant corrections based on English pronunciation rules.

There are more complex grammar-checking tool, the most useful seems to be Languagetool. This is still a rule-based checker but goes beyond simple spelling mistakes and also highlights issues with grammar etc. The code is free/open-source. While it supports 25+ languages (such as English, German, French, Portuguese, Arabic but not: Bengali, Czech, Vietnamese), it does not seem straightforward to go beyond the listed languages. For automated requests, we need to set up our own instance though this seems easy. Interestingly, the wikicheck-tool used to be a deployed instance to run languagetool for wikipedia articles. The official repo still contains code to run languagetool on a wikipedia dump or on a page fetched from the wikimedia-api (github-link).

There are also machine-learning tools involving some form of training data. These models promise to take into account word surroundings and thus capture more complex spelling-, grammar-, or tone-mistakes. However, there are main limitations: i) limited to few languages (often only English), ii) not transparent what it does, iii) they are often commercial products. Some example:

  • Grammarly. Not transparent what it does. Only english, no realistic plans to extend beyond other languages. Growth team played around and have some notes
  • JamSpell. “A Modern spellchecker which takes into account word surroundings”. It supports a few languages (en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no) but doesnt seem easy to use.
  • Other tools (not free/open):
    • trinka . focus on academic writing, not open, only English
    • Reverso. Spellchecker but only for English and French. Not open
    • Ginger. Grammar checker. Not open, not free. Only English.
    • Writefull. Aimed at students/researchers. Only English
    • Whitesmoke. Only english, not free/open.

Summary: There seem to be 3 types of available tools:

  • Basic spellcheckers which focus on simple errors with heuristics but have the advantage that they are easy to scale to many (if not all) languages.
  • LanguageTool grammar-checker goes beyond simple spell-checking. It seems fairly simple to set up and use. It supports 30 languages or so but it is not trivial to go beyond that.
  • Grammarly et al: These and similar tools based on more fancy technologies (neural networks etc) are able to capture more complex aspects of copyediting. However, it is not clear whether this capacity is actually needed.There are major downsides such as lack of transparency and it does not seem easy to scale to many languages beyond English (if any).

Thus my recommendation would be to use simple spellcheckers as they can work in most of the languages we want and should be able to generate enough samples to correct. In case more complex errors are needed we could still augment some of the languages with LanguageTool

NLP/ML Research on copyediting[edit]

Summary:

  • State-of-the-art models are not well-studied beyond English
  • There is generally a lack of training data (even for well-resourced languages such as English) to take full advantage of ML-based models


Grammatical Error Correction (GEC)

One of the main approaches to copyediting in NLP research is the task of grammatical error correction, which involves errors on spelling, punctuation, grammatical, and word choice errors. Often this includes the correction (a subtask is called grammatical error detection). A good overview can be obtained from papers-with-code and NLP-progress. The task is often considered as a translation problem of translating a wrong sentence into a correct sentence. Some review papers going into more depth about recent advances, in particular, using neural networks [3][4][5]

Data. The task is evaluated on some benchmark datasets (shared tasks) such as the BEA-2019 shared task or the earlier CoNLL-2014 shared task. However, one of the biggest challenges for GEC models, however, is data sparsity. Unlike other NLP tasks, such as speech recognition and machine translation, there is very limited training data available for GEC, even for high-resource languages like English. As a solution, researchers have created datasets from Wikipedia’s revision history, github, or by generating synthetic data (introducing errors via some model); though these are custom solutions not covering many languages. For English, the tool ERRANT allows to annotate errors with one of 25 different classes of errors.

List of datasets (only English if not stated otherwise):

  • BEA-2019 shared task. Contains few thousand sentences with errors of different levels, e.g. English learners. Standardizes previous datasets and  annotates error-classes with ERRANT [6].
  • Wikipedia-revision history. These papers generate large datasets from Wikipedia’s revision history using a range of heuristics (e.g. edits are small in Levensthein distance). Grundkiewicz, R., & Junczys-Dowmunt, M. (2014).[9] introduces the WikiEd-corpus with 12M sentences. Wikiedits code to generate the WikiEd corpus. Currently support English, Polish, German. Similar approaches for English [10], German (code) [11], or Japanese [12] (there are probably more). In principle, the approach could be applied to any language; though in practice it will probably require some fine-tuning in each case.
  • Github-typo-corpus [13]. This constructs a large corpus of typos/misspellings from github commit messages. It contains data for 15 languages but mostly contains data for English, japanese, chinese. They compare different spellcheckers (Enchant and Aspell) on these errors focusing on different error-classes (from ERRANT). For the SPELL-type they yields around precision=0.55 and recall=0.65
  • cLang-8 [14]. English, German, and Russian training data based on NAIST Lang-8 Learner Corpora.
  • C4_200M synthetic dataset. Develops new synthetic training dataset for GECV (only English).


Models. Here I describe some of the recent models which were highlighted as yielding state-of-the-art performance for grammatical error correction. The main limitation, of course, is that they have been mostly only been tested on English and it remains unclear to which degree they can be extended to other languages. One of the main limitations is already the training of these models which requires a lot of data, which even for English is hard to come by (and even harder for other languages).

List of models:

  • Gector [15]. This Model was developed by the Grammaly Research Team (see the blogpost). It uses state-of-the-art language model (BERT etc ) to improve performance on GEC-benchmarks. The authors admit that “these systems are better suited to academic research than to real-world applications”. It also seems to be exclusively limited to English.
  • NeuSpell [16] is an open-source toolkit for context sensitive spelling correction in English. It also uses  state-of-the-art language model (BERT etc ). While performance for specific tasks in GEC might be improved, there is currently only support for English. Entry in huggingface.
  • gT5 [17]. Model for GEC developed as part of the cLang-8 data. Though there is no code for the trained model.
  • GECwBert. Description how to use language-models to do GEC.
  • Textly-drf-API. All-in-one API for grammar correction, spell check, etc. no evaluation, etc
  • Gramformer. Gramformer is a library that exposes 3 seperate interfaces to a family of algorithms to detect, highlight and correct grammar errors. Entry in huggingface.


Grammatical Error Detection

This is a subtask of GEC, where one is only interested in detecting errors and not in providing the correction (overview). For example, Bell, S., Yannakoudakis, H., & Rei, M. (2019) [18] use BERT etc for GED by labeling each token (Code).


Spelling correction.

This can be considered a subtask of GEC by only considering the specific error-type of spelling. For example, using the ERRANT-tool one can restrict to the SPELL-type.

Data.

  • GEC-benchmark data with error-type=SPELL (e.g. from the ERRANT tool).
  • Spelling error corpora (usually list of misspelled words)
    • http://aspell.net/test/
    • Roger mittens birkbeck spelling error corpora
    • Corpora of misspelling for download

Models.  

  • off-the-shelf spellcheckers such as Enchant.
  • Neuspell [16]. Test-data from GEC focusing only on SPELL-type. Only English, however: “Following usage above, once can now seamlessly utilize multilingual models such as xlm-roberta-base, bert-base-multilingual-cased and distilbert-base-multilingual-cased on a non-English script.”
  • Whitelaw, C., Hutchinson, B., Chung, G., & Ellis, G. (2009). [19] use Language models to do spellchecking, comparison against aspell in english and German. Bottleneck is the construction of human evaluation data.


Misspellings in Word embeddings

Here the task is not correcting the sentence itself, but to make embeddings (vector representations of the words/sentences) resilient with respect to misspellings. For example, Edizel et al. (2019) [20] correct sentence/word embeddings for misspellings (in order to avoid OOV words). Thus this is only indirectly related to copyediting in the structured task framework.

References[edit]

  1. Kukich, K. (1992). Techniques for automatically correcting words in text. In ACM Computing Surveys (Vol. 24, Issue 4, pp. 377–439). https://doi.org/10.1145/146370.146380 pdf
  2. Bryant, C. (2019). Automatic annotation of error types for grammatical error correction. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html
  3. Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2014). Automated Grammatical Error Detection for Language Learners, Second Edition. In Synthesis Lectures on Human Language Technologies (Vol. 7, Issue 1, pp. 1–170). https://doi.org/10.2200/s00562ed1v01y201401hlt025
  4. Naghshnejad, M., Joshi, T., & Nair, V. N. (2020). Recent Trends in the Use of Deep Learning Models for Grammar Error Handling. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2009.02358
  5. Wang, Y., Wang, Y., Liu, J., & Liu, Z. (2020). A Comprehensive Survey of Grammar Error Correction. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2005.06600
  6. Bryant, C., Felice, M., Andersen, Ø. E., & Briscoe, T. (2019). The BEA-2019 Shared Task on Grammatical Error Correction. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, 52–75. https://doi.org/10.18653/v1/W19-4406
  7. Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-2014 shared task on grammatical error correction. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, Baltimore, Maryland. https://doi.org/10.3115/v1/w14-1701
  8. Napoles, C., Sakaguchi, K., & Tetreault, J. (2017). JFLEG: A fluency corpus and benchmark for grammatical error correction. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain. https://doi.org/10.18653/v1/e17-2037
  9. Grundkiewicz, R., & Junczys-Dowmunt, M. (2014). The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction. In Advances in Natural Language Processing (pp. 478–490). https://doi.org/10.1007/978-3-319-10888-9_47
  10. Lichtarge, J., Alberti, C., Kumar, S., Shazeer, N., Parmar, N., & Tong, S. (2019). Corpora Generation for Grammatical Error Correction. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1904.05780
  11. Boyd, A. (2018). Using Wikipedia edits in low resource grammatical error correction. Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-Generated Text, 79–84. https://www.aclweb.org/anthology/W18-6111.pdf
  12. Tanaka, Y., Murawaki, Y., Kawahara, D., & Kurohashi, S. (2020). Building a Japanese Typo Dataset from Wikipedia’s Revision History. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 230–236. https://doi.org/10.18653/v1/2020.acl-srw.31
  13. Hagiwara, M., & Mita, M. (2019). GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1911.12893
  14. Rothe, S., Mallinson, J., Malmi, E., Krause, S., & Severyn, A. (2021). A Simple Recipe for Multilingual Grammatical Error Correction. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2106.03830
  15. Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR -- Grammatical Error Correction: Tag, Not Rewrite. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2005.12592
  16. a b Jayanthi, S. M., Pruthi, D., & Neubig, G. (2020). NeuSpell: A Neural Spelling Correction Toolkit. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online. https://doi.org/10.18653/v1/2020.emnlp-demos.21
  17. Rothe, S., Mallinson, J., Malmi, E., Krause, S., & Severyn, A. (2021). A Simple Recipe for Multilingual Grammatical Error Correction. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2106.03830
  18. Bell, S., Yannakoudakis, H., & Rei, M. (2019). Context is Key: Grammatical Error Detection with Contextual Word Representations. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1906.06593
  19. Using the web for language independent spellchecking and autocorrection. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 890–899. https://www.aclweb.org/anthology/D09-1093.pdf
  20. Edizel, B., Piktus, A., Bojanowski, P., Ferreira, R., Grave, E., & Silvestri, F. (2019). Misspelling Oblivious Word Embeddings. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1905.09755