Research:Identifying bot accounts

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This page, in a nutshell. This project will explore strategies for identifying bot accounts in Wikipedia projects.

Robotic editing represents a major class of automated editing within Wikipedia projects. While English Wikipedia has a strong norm for flagging bot users via the user_groups table, it's unclear whether other language editions follow a similar pattern. For this reason, some alternative strategies have been explored to detect robots broadly. In this report, we'll analyze and discuss the differences between a few of the most common bot identification strategies.

Strategies[edit]

Bot flag[edit]

Many wikis make use of a user group to flag and track the activities of bot accounts. E.g. user_groups.ug_group = "bot". Wikimedia communities employ the use of this flag consistently, then one could make use of the user_groups table to efficiently identify bot accounts.

Username regex[edit]

Example
en:User:HBC_AIV_helperbot5
Counter-example
en:User:I_Jethrobot

Curated lists[edit]

For example: en:Wikipedia:Bots/Status and "bots.csv" from Wikistat csv

Hybrid strategies[edit]

Wikistats
  1. Is there a bot flag in user group table?
  2. Does it sound like a bot? (nowadays only allowed for bot, on many wikis).
    • Perl: if (($user =~ /bot\b/i) || ($user =~ /_bot_/i))
  3. Is it known to be an unregistered bot (Wikipedia has a list of false negatives at [1])
  4. Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project
  5. Three names that sound like bot are hard coded exceptions (people who wrote ErikZ to tell him they are human): Paucabot, Niabot, & Marbot

Bots don't sleep[edit]

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014

Analysis[edit]

Username matching vs. the bot flag.[edit]

Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.
Bot counts by matching strategy. Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.

#Bot counts by matching strategy plots the count of bots by detection method for the top 25 wikis by the most inclusive regex matching strategy. The plot makes two things salient: (1) All of the top wikis saw user accounts registered that were given the bot flag -- so presumably the bot flag is being used. (2) We can also see that there's often an order of magnitude more active, non-blocked user accounts that fit the regex criteria.


Questions[edit]

  • How are bot accounts registered?
    • Can we filter for non-bot accounts via the logging table if we look at log_action="create" and log_type="newusers"?
    • It seems more likely that bot accounts would be registered by proxy (e.g. newusers/create2).