Learning patterns/Proofreading large amounts of text
What problem does this solve?
Catalan Wikipedia needed a thorough linguistic review (spell checking and grammar). This was a daunting task. With the help of the proofreading software LanguageTool we have made some progress. The most important lesson we have learned in the process is apparently trivial. But the more strictly you follow this advice, the better.
What is the solution?
The errors should be classified depending on the kind of supervision they need.
- Errors that can be corrected always automatically. Of course, you must be absolutely sure that it is always fine to apply the correction. This implies that you don't change words in other languages (in Catalan, we have to take care specially of words in Spanish, Portuguese, French and Italian) or in non-standard language (old or dialectal).
- Errors that need supervision. It is enough to look a few words around the error in order to know if the correction is appropriate.
- Errors that need very careful supervision. You need to read probably the whole paragraph or even the whole article. For example, in Catalan, "hivernar/hibernar".
Moreover, some simple errors can be found in the online Wikipedia, but errors that need a full morphosyntactic analysis are to be found in the Wikipedia dump.
Things to consider
When to use
- IEG grant: Proofreading semiautomatically the Catalan Wikipedia with LanguageTool
- Some scripts (with documentation) used for proofreading the Catalan Wikipedia.