Grants:IEG/Proofreading semiautomatically the Catalan Wikipedia with LanguageTool/Midpoint
Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.
- The project code has been rewritten and improved.
- The process of supervising corrections is making good progress.
- New ideas have arisen that can be tested in the coming months.
Methods and activities
An account in Wikimedia Tool Labs was opened and these programmes have been installed:
- Pywikibot library
- A modified branch of LanguageTool (GitHub repository)
- A collection of scripts (Github repository)
The process involves these steps:
- Analyze the whole Catalan Wikipedia with LanguageTool.
- With the help of some scripts, Filter and sort the results, and present them in a proper way.
- Supervise manually the results and apply the corrections to the Wikipedia with a bot.
- Try to improve every step so that it can be done faster next time.
- Around 250.000 edits have been made from January 1, 2016.
- Most of the scripts have been rewritten and improved in Python.
There have been some minor expenses. I have used a few hours of AWS servers because I found the Wikimedia servers very unstable for some required tasks. This can happen again.
What are the challenges
- In order to speed up the proofreading, the main nuisance is the presence of sentences (quotations, bibliography, titles...) in other languages or in non-standard language (old or dialectal). These sentences should not be edited and should be discarded as automatically as possible. Black lists of sentences and whole articles are a useful tool. However, the presence of a lot of Spanish sentences in the Catalan Wikipedia is proving to be a really difficult challenge. Both languages are very close and therefore difficult to distinguish. Better methods of language detection are needed.
What is working well
The best strategy is to classify the errors we want to correct as soon as possible depending on the kind of supervision they need, as explained in this learning pattern:
Next steps and opportunities
What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.
- Keep the pace of corrections so the bulk of the errors in Catalan wikipedia are fixed.
- Try more ways to speed up the supervision: black lists, better language detection...
The project has been difficult and time-consumig at some points. Most of the time I don’t know if it is going to work as expected at the end. Major and minor improvements along the way are hopeful. Good feedback and words of thanks from other Wikipedians are encouraging.