Spamfilter

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Crystal wordprocessing.png This is an essay. It expresses the opinions and ideas of some Wikimedians but may not have wide support. This is not policy on Meta, but it may be a policy or guideline on other Wikimedia projects. Feel free to update this page as needed, or use the discussion page to propose major changes.
Blue Glass Arrow.svg MediaWiki logo.png
A proposal to move this page to MediaWiki.org was rejected.
Because the Template:MoveToMediaWiki tag was on the page for a year without any MediaWiki.org importers seeing fit to transwiki it, the move proposal was regarded as rejected by the MediaWiki.org community.

I have recently been thinking again how wonderful my bayesian spamfilter, implemented by Spambayes, is working to filter my e-mail. For an explanation of Bayesian spamfiltering, see the Spambayes homepage. I was thinking whether it would be possible to do something like that for Newpages. It could reduce human work and might prove a very interesting experiment as well.

The bot I am thinking of would follow Newpages live. It fetches each page, and checks it against it database. If it's classified as ham, then continue. If it's classified as unsure, ask the user whether it is {{delete}}-material: if yes, train as spam and prepend {{delete}} to the article. If no, train as ham. It could add a comment to the article or a message to the talk page: <!-- classified by ... as ... with score ... --> If it's classified as spam, show the user (part of) the content to confirm that it's really true (if not, treat as unsure-ham). If it already contains '{{delete}}', train as spam and continue. When no user is using the program, create a stack of articles to work through when a user starts with the program again.

This would be implemented using an enhanced Pywikipediabot and the library coming with Spambayes. I foresee some problems. For example, each user would have its own 'hammy.db'. As we are all working on the same thing, we would want to have a central hammy.db, probably one per language. This would be at a central server (need not to be Wikipedia: I volunteer with my server for this task). Initially, it would be a command-line tool, although a web interface might prove very useful as well.

Additional to the contents of the page, clues can also be given by the user contributing, whether the user is logged-in or anonymous, the range of the IP, name of the page, and, why not, the time of day, although the latter might have less value than the former ones.

Perhaps it could also be done for RecentChanges. It would then be fed the diffs. This would require a lot more work, because there is a major difference between removing a line and adding a line (in fact, when one would be a spam-hint, the inverse would be a ham-hint with clue 1-other). This is much more difficult and I do not have the knowledge to write such a thing. It does not seem impossible, though.

Comments welcome.

Gerrit 11:43, 22 Nov 2004 (UTC)