Learning patterns/Cleaning the Augean stables: DIY discovery of fermented vandalisms, copyvios and other unwelcome stuff
What problem does this solve?
We have all heard of the twelve labours of Hercules (or if not, we have Wikipedia to tell us the story :) ). The least attractive among Hercules' heroic adventures was the fifth one: he had to clean the filth from the stables of Augeas in a single day. Augeas' livestock were divinely healthy (and immortal) and therefore produced an enormous quantity of dung. The stables themselves had not been cleaned in over 30 years, and over 1,000 cattle lived there.
Starting with this myth, let's see the analogy behind: a wiki that has not been "cleaned" (monitored for vandalisms, spam, copyvio, etc.) from quite a lot of time, while being regularly sullied in the meanwhile. This might especially be true with smaller languages, where the active editing community is concentrated in Wikipedia, and the rest sister projects – though existing – do not enjoy regular attention and maintenance. Such projects may even lack local administrators speaking the language, thus being served only by the Small Wiki Monitoring Team and global administrators.
Some observations and working solutions are shared here, based on experience gained with Bulgarian Wikisource and Bulgarian Wiktionary.
What is the solution?
After a period of inactivity, a local admin (i.e. one who has command of the local language and can evaluate the adequacy of the content) may want to do some housekeeping and clean-up. Or it may be the first locally elected admin ever since the project has been curated under the Admin activity review, and any inactive local administrators/bureaucrats have been de-admined by stewards. This means that in the meanwhile - both during the period of admin inactivity, and during the period of no local admin whatsoever, many pages have been created as vandalisms, copyright violations, advertisement/spam, test pages, or simply with mistaken pagenames, broken/double redirects, etc.
- Recent changes and New pages
However, there is a problem. Simply checking the Special:Recentchanges or Special:Newpages will probably return nothing, or only a tiny fraction of all the stale unwelcome stuff – only edits/pages that have been made/created within the last 30 days. This is definitely not enough.
So, how to find more problems inside the wiki's database, especially when you do not operate a bot? Some of the special pages, like Special:BrokenRedirects will immediately provide you with pages worth deleting. Others, however, may return results where content for deletion is mixed with valid content. Some special pages return the results alphabetically listed, others have another specific order (chronological, or by page size), which can prove especially helpful for detecting things for deletion.
- Short pages
Depending on the particular sister project, it may be or it may not be expected to have valid content pages that are very short. For a project like Wiktionary it can be either way, for a project like Wikisource, however, the natural expectation would be for more lengthy pages. Your observations and experience will quickly help you get your personal feeling of how short are the legitimately short pages in the project. However, I can formulate from my experience that up to one point all Special:ShortPages usually contain a simple obscenity or out-of-sandbox experiments. After all, you cannot reasonably expect to have useful, well written and referenced content for any of the Wikimedia projects, within less than, say 100 bytes.
Among the shortest short pages, of course, there might be some disambiguation pages, but they may easily feature problems on their own: no disambiguation template, no category, no formatting. Don't forget that each of these useful and widely adopted page elements also consumes bytes. So even if the short page does not deserve being deleted, its short size may easily indicate the need to repair it.
- Long pages
Again depending on the scope of the project, Special:LongPages may or may not be of any help in housekeeping. Wikisource, for instance, may contain enormous pages of valid content, and this special page may prove useless here. However, my experience has shown that among the long pages in Wiktionary, there have been many things for deletion, as well. These usually are not profanities, but copyright violations or any copy-pasted content from elsewhere in the Internet, including Wikipedia itself. It may or it may not even fall within the scope of the project, it may even be a freely licensed content copied from Wikipedia to Wiktionary, presumably done in good faith to "fill in a gap". Copyright violations are easily detected here: large chunks of non-formatted text, writing style typical of other non-encyclopedic sources (e.g., popular science or yellow press), and they immediately appear in Google results when pieces of the text are being searched for.
Always check for such a content against a search engine, and even if you consider it worth transferring to Wikipedia (under the assumption that the user wanted to contribute to the encyclopedia but somehow finished in a sister project), this is rarely advisable.
- Orphaned (Lonely) pages
Orphaned pages are pages which do not receive links from any other page in the wiki: no page links TO an orphaned page. Thus, the special page Special:LonelyPages may prove helpful in detecting content which is not naturally anticipated to appear in the project. This can be anything from original research, advertisements/spam/promotion of non-notable people, products or enterprises tp invented/meaningless words and profanities. In addition, even if the content of an orphaned page is valid per se, its inclusion in this special page again can be indicative of another problem (like a vandalised/broken template).
- Other special pages (for further elaboration)
Things to consider
- Check page history
- Sometimes, content pages appear in the special pages (like Special:ShortPages, Special:LongPages, Special:DeadendPages) as a result of unwelcome edit (blanking, vandalism, copyvio, etc.), whereas their history shows that they used to have valid content before that. Checking the page history definitely requires more time spent in housekeeping, but can help for re-discovery and keeping more of the valid content of the project.
- Check talk pages
- Sometimes, vandals edit not only the pages from the main namespace, but also their adjacent talk pages. Have in mind that talk pages are not listed in (all) special pages, and vandalisms there are thus harder to discover.
- Check for signatures within the main namespace pages
- No matter the project (Wikipedia, Wiktionary, Wikisource, etc.), signatures are generally unwelcome in the pages from the main namespace, and their presence can often be indicative of vandalisms or out-of-sandbox experiments. An easy way to discover signatures is by searching for the names of the months.
- Check for placeholders and interface messages
- Check for the local equivalents (translations) of some messages like:
- "From Wikipedia, the free encyclopedia"
- "Heading text", "Bulleted list item", "Numbered list item", "Bold text", "Italic text", "Insert non-formatted text here", etc.