Grants:IEG/Proofreading semiautomatically the Catalan Wikipedia with LanguageTool/Timeline

From Meta, a Wikimedia project coordination wiki

Timeline for Proofreading semiautomatically the Catalan Wikipedia with LanguageTool[edit]

Timeline Date
Filter sentences so that text in other languages (quotations, titles, bibliography...) or with too much errors (old or non-standard language) is automatically discarded, and the human supervision of the results can be speeded up significantly. 31 march 2016
Document clearly in plain words and with examples the process of proof-reading so other people can use it in other languages 30 May 2016
Find a someone to help testing the project in some other language 30 May 2016
Test in some other Wikipedia 30 June 2016

Monthly updates[edit]


Tasks accomplished:

  • Opened an account in Wikimedia Tool Labs.
  • Made a fork of LanguageTool code for experimenting with improvements.
  • Made a first analysis of the whole Catalan Wikipedia with LanguageTool. It takes around 30 hours in Wikimedia Tool Labs, which is good enough for the purpose of this project.
  • Ported some scripts from Perl to Pywikibot.
  • Made around 100.000 edits with part of the results. (A better way of monitoring the statistics is needed.)


Tasks accomplished:

  • Around 50.000 edits done.
  • More scripts have been translated from Perl to Python.


Tasks accomplished:

  • A new analysis of the whole Catalan Wikipedia has been made. The servers in Wikimedia Tool Labs were very unstable for this task so I used a few hours of a server in AWS.
  • Around 50.000 edits done.
  • Some experimentation has been done trying to discard sentences in other languages. So far the results are not helpful. In the Catalan wikipedia there are a lot of quotations and bibliography in Spanish, and it is very costly to distinguish automatically between the two languages.


Tasks accomplished:

  • New analysis of the whole Wikipedia in AWS (the 3rd). The decrease in the detected linguistic errors can be seen in the summary of the results. In the second analysis, there was no decrease because it turned out that the first analysis (done in Tool Labs) was in fact truncated.
  • Around 40.000 edits done.


  • Created documentation.
  • The scripts have been adapted so they can be used in other languages.
  • A test has been made in the Spanish wikipedia. About 1000 edits have been made in Spanish, and a bot flag has been requested in Spanish Wikipedia. This is an example of supervision file in Spanish.

Is your final report due but you need more time?