Update 28 november: This entire proposal has been updated because I found a better way to search for typos: I am currently using fuzzy search. And the new idea is to write 2 standalone programs instead of 1 AWB plugin.
What is the problem you're trying to solve?
Everyone makes typos. This is inevitable. Bots are unable to fix typos without human assistance. Finding and fixing typos is a task that is boring and repetitive. There are many developers who have made anti-vandalism tools, but there aren't many people who have developed software that helps users to find and fix typos. So I did.
I have a list of the top 10.000 most frequently used words.
Most typos fall in the following categories:
I want to make typofixing a lot quicker and a bit less boring.
What is your solution?
I am writing software that enables AWB users to find and fix typos quickly and easily.
Lists vs RegExp vs Bruteforcing vs Fuzzy search
It is possible to make a list like en:Wikipedia:Lists_of_common_misspellings/For_machines but as you can imagine this isn't a very effective approach because there are so many different words and each word can be misspelled in many different ways.
The current typofixing tools are usually RegExp based. I like RegExps, but they are not intended for this kind of stuff. In certain cases RegExps can be a good way of detecting typos, but the 4 categories above are the most common causes of typos afaik, and for these categories fuzzy searching is way more effective than using RegExps. RegExps are quite difficult to write.
A long time ago (April 2012) there was a project called en:Wikipedia:WikiProject_TypoScan, which was RegExp-based. It is no longer active, and it does not work anymore.
Here is an example of what my contributions look like while I am bruteforcing typos. I added the words "current" and "currently" to my script. I find and fix typos like currenlty, currnet, curerntly, crrently, curent, Curently, Currenty, urrently, curreently, currenttly, Currentlly, currentl & Currrently.
Bruteforcing typos is very effective but I've discovered that fuzzy searching is even better.
Fuzzy Search Example
If we, for example, do a fuzzy search for the word "previously" and look inside the span class="searchmatch" tag for first 50 results then we find many typos:
|List of potential typos
prevoiusly prviously previosly previosly previouslye previouse previoulsy Previosuly previouse previouse previosuly proviously previosly prviously preciously previosuly prevciously previosuly previousy Previosly previouslu previosly Previoulsly previosuly preivously preciously proviously previoiusly previosuly Previosly prebviously previousiy prviously prviously previouse proviously prеviously previousle previosuly previoulsy prevously preciously preciously Previosuly Previouse Proviously priviously preciously previousy previoulsy
Many of the typos occur more than once in the top 50 search results:
|List of potential typos, sorted by number of occurrences
6 | previosuly 5 | preciously 4 | previosly 4 | previouse 4 | prviously 3 | previoulsy 3 | proviously 2 | Previosly 2 | Previosuly 2 | previousy 1 | prebviously 1 | preivously 1 | prevciously 1 | previoiusly 1 | Previoulsly 1 | Previouse 1 | previousiy 1 | previousle 1 | previouslu 1 | previouslye 1 | prevoiusly 1 | prevously 1 | priviously 1 | Proviously 1 | prеviously
The task of the human is to distinguish between typos and correctly spelled words that happen to be similar, like preciously.
New software for fuzzy searching
I am writing new software that enables the user to turn that list of potential typos into a todo-list for AWB (and makes it easy to exclude words that aren't typos and stores those decisions). When this project is finished it will be released as a free opensource project (it will be standalone, not an AWB plugin, but I hope it will be possible to distribute it together with AWB).
If you host a copy of a recent dump on your local machine you can make as many requests as you like. But using a fuzzy search tool like agrep on a local file is probably a lot quicker.
I am researching how to:
- strip the dump file, after removing the stuff I don't need searching will be quicker
- search through a dumpfile with agrep
- turn the results into a todo-list for AWB
Why use AWB instead of creating a standalone tool?
I do not want to reinvent the wheel. AWB has a lot of the functionality that I need already built in.
- I don't have to create a website/webpage where it can be downloaded, I don't have to spend a lot of time promoting it, and I can use the existing infrastructure for bugreports/feature requests/etc.
- In order to get permission to use AWB you have to show that you are (somewhat) competent and you have to be able to make a few hundred edits.
- AWB is already able to authenticate users and make edits.
- AWB can preparse.
- AWB purposely avoids fixing typos in certain areas of the wiki-text. Typo fixing is prevented within: image names, templates names and parameters, wikilink targets, text in quotations and italics, and any text that follows a colon or asterisk.
- AWB understands templates like notatypo and sic
Why do you care?
Typos affect our perception of reliability.
Typographical errors and broken links hurt a site's credibility more than most people imagine.— Stanford Web Credibility Research, stanford.edu
"Each misspelled word, bad apostrophe, garbled grammatical construction, weird cutline [photo description], and mislabeled map erodes public confidence in a newspaper's ability to get anything right", noted the report from the American Society of Newspaper Editors.— Regret the Error: How Media Mistakes Pollute the Press and Imperil Free Speech, Craig Silverman & Jeff Jarvis
My goal is to enable all AWB users to find and fix typos very very quickly. This makes typohunting a bit less boring and a lot more effective.
- Done Get a good idea in the middle of the night (and write it down before falling asleep again)
- Done Test alternatives, list pro's and con's, learn from mistakes made by other people
- Done Make a list of possible features following the MoSCoW method
- Done Create interface
- Done Create proof of concept that works. Use it for a while.
- Done Rich Farmbrough helped me by making a list of the most frequently used words on Wikipedia. It is more efficient to start with frequently used words, because they are more frequently misspelled. I start with words that contain more than 6 characters because bruteforcing typos is more effective for longer words.
- Done Figure out how to write to AWB's XML file and how to give AWB a list of typos to search for. (Wiki search (text) -- "Typo1 OR Typo2" etc.)
- Get a grant
- Translate script to VB.NET with the help of a good programmer. Use the grant as a reward.
- AutoWikiBrowser developer will check the code and include the plugin.
- Write documentation
- optional: Promote usage by spamming talkpages of active Wikipedians who have permission to use AWB
Rich Farmbrough helped me by making a list of the most frequently used words on Wikipedia. I selected VB.NET partly because many people are able to understand that language, so we do not have to rely on a single developer.
The finished product will be available to all users of AutoWikiBrowser. If we exclude bots then that is 4021 users on just the English version of Wikipedia! It is possible to use this software for all projects in all languages. The software is not limited to Wikimedia projects, all MediaWiki users will benefit. For example. you can use it on wikia.com or on your own selfhosted MediaWiki installation. If we ignore sites where AWB has logged less than 50 edits then we are still talking about more than 4000 websites in 60+ languages (nota bene: over 800 million edits!).
Theoretically I can contact every active editor that uses AWB (on all projects in all languages). Starting with the Typo Team on en.wiki is probably a good idea.
It will be completely free and opensource, so anyone who wants to can to use it and develop improvements. On the English Wikipedia you need permission to use AWB, but on many other places you don't. I have a long list of feature requests, but I hate feature creep and I love the KISS principle. Changes to MediaWiki won't really affect this plugin, as long as AWB works the plugin will work.
It is possible to add stuff like blacklists, and collect the most frequently used words on all projects and in all languages, and use a computer that creates a big to-do list, like I described in the Why API calls?-section.
The best way to ensure people will keep using it is an opt-out (or even opt-in) public online log and leaderboards to encourage editcountitis. Something like the List of Wikipedians by number of edits and List of Wikipedians by article count...
Measures of success
I can easily fix more than a thousand typos per day with the script I made (while watching movies/TV on my secondary monitor). Once the new software is finished it will be much quicker and more convenient than my old script.
It is probably possible to log how many people have used the plugin, but it may be difficult to log how many typos they've fixed and how many proposed fixes they've skipped because that happens in AWB, outside of the plugin.
Maybe the AWB developers are willing to log this stuff.
- The Quixotic Potato. I am a nerd with 12 years’ experience working in IT. I do not like bragging; on my (en.wiki) userpage I claim to be a potato...
- I found a VB.NET programmer who is willing to help me. I know him IRL, he is not a Wikipedia user. He also has 10+ years' experience with many computerlanguages.
Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
- See talkpage
Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).
- Sounds like a great tool and typos are and always will be a quality issue involving a large editor workload. Any help is a good thing. Jason Quinn (talk) 18:05, 29 September 2015 (UTC)
- Community member: add your name and rationale here.