Research talk:Identifying bot accounts

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Work log[edit]

Archive

It's been a few months. I've finally had a chance to sit down with this again. TL;DR: The bot regex works very badly unless we filter to users that make at least one edit -- then it works OK, but we still catch more regular users than we do flagged bots.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     LEFT(user_registration, 6) AS month,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match,
    ->     SUM(ug_group IS NOT NULL) AS group_match,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND
    ->         ug_group IS NOT NULL) AS both_match
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration BETWEEN "20140101" AND "20140901"
    -> GROUP BY 1;
+--------+------------+-------------+------------+
| month  | name_match | group_match | both_match |
+--------+------------+-------------+------------+
| 201401 |        100 |           5 |          4 |
| 201402 |         88 |           0 |          0 |
| 201403 |         82 |           0 |          0 |
| 201404 |         73 |           0 |          0 |
| 201405 |         92 |           3 |          3 |
| 201406 |         70 |           1 |          1 |
| 201407 |         73 |           1 |          1 |
| 201408 |        121 |           0 |          0 |
+--------+------------+-------------+------------+
8 rows in set (20.54 sec)

So the regular expression above matches "bot" when it appears at the end of a username, before a space or before some other type of word boundary. This roughly corresponds to Erik's "sounds like a bot" regular expressions. You'll note that we see about two orders of magnitude more bot names than we see bot flags. A quick query suggests, qualitatively, that names captured by this regex do, in fact, sound like a bot.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     user_id,
    ->     user_name,
    ->     ug_group
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration > "20140101" AND
    ->     user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )"
    -> LIMIT 100;
+----------+--------------------------------+----------+
| user_id  | user_name                      | ug_group |
+----------+--------------------------------+----------+
| 20483999 | Monkbot                        | bot      |
| 20499663 | BabyBruiserBot                 | NULL     |
| 20508333 | AnomieBOT III                  | bot      |
| 20611691 | Filedelinkerbot                | bot      |
| 20648241 | OrginBot                       | NULL     |
| 20692668 | Stanfordpandabot               | NULL     |
| 20760835 | ArteBot                        | NULL     |
| 20791540 | Guardianbot                    | NULL     |
| 20792760 | FragBot                        | NULL     |
| 20811948 | Trellbot                       | NULL     |
| 20827123 | FeltonBot                      | NULL     |
<snip>

So I was wondering if maybe there were a lot of bot-like names registered that never made any edits. So I filtered the query to those users who had saved at least 1 edit.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     LEFT(user_registration, 6) AS month,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match,
    ->     SUM(ug_group IS NOT NULL) AS group_match,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND
    ->         ug_group IS NOT NULL) AS both_match
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration BETWEEN "20140101" AND "20140901" AND
    ->     user_editcount > 1
    -> GROUP BY 1;
+--------+------------+-------------+------------+
| month  | name_match | group_match | both_match |
+--------+------------+-------------+------------+
| 201401 |         17 |           4 |          3 |
| 201402 |         19 |           0 |          0 |
| 201403 |         18 |           0 |          0 |
| 201404 |         14 |           0 |          0 |
| 201405 |         28 |           3 |          3 |
| 201406 |         16 |           1 |          1 |
| 201407 |         18 |           1 |          1 |
| 201408 |         21 |           0 |          0 |
+--------+------------+-------------+------------+
8 rows in set (10.99 sec)

Now that's much more reasonable. We're still looking at an order of magnitude difference. Let's look at a few qualitatively.

  • User:EraserBot -- blocked
  • User:Gloombot -- Regular user
  • User:Coldwarrobot -- Regular user
  • User:Nixonbot -- blocked
  • User:BabyBruiserBot -- Regular user
  • User:WhiteBOT II -- blocked
  • User:S$annerbot -- Regular user
  • User:GunBot -- blocked

So, it looks about 50/50 regular users and blocked users. --Halfak (WMF) (talk) 19:01, 5 September 2014 (UTC)


DAndreescu had a good suggestion. We can use the ipblocks table to filter out users who were blocked. He proposed the following query:

SELECT LEFT(user_registration, 6) AS month,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )") as regex,
        SUM(ug_group IS NOT NULL)  AS group_match,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND
            ug_group IS NOT NULL)  AS group_and_regex,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND
            ipb_user IS NULL)      AS not_blocked_and_regex
   FROM user
                LEFT JOIN
        user_groups     ON user_id = ug_user
                        AND ug_group = "bot"
                LEFT JOIN
        ipblocks        ON user_id = ipb_user
  WHERE user_registration BETWEEN "20140101" AND "20140901"
    AND user_editcount > 1
  GROUP BY 1;

month	regex	group_match	group_and_regex	not_blocked_and_regex
201401	17	4		3		8
201402	19	0		0		9
201403	18	0		0		12
201404	14	0		0		9
201405	28	3		3		19
201406	16	1		1		8
201407	18	1		1		3
201408	21	0		0		10

So, I took this query, ran it for ALL THE WIKIS for the last year (Sept 2013 - Sept 2014). #Bot counts by matching strategy plots the top 25 wikis by number of regex matched users.

Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.
Bot counts by matching strategy. Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.

We can see two things here. All of the top wikis saw user accounts registered that were given the bot flag -- so presumably the bot flag is being used. We can also see that there's often an order of magnitude more active, non-blocked user accounts that fit the regex criteria. --Halfak (WMF) (talk) 19:13, 5 September 2014 (UTC)

Discussion[edit]

Sessions[edit]

The work you did on editing sessions may aid this effort too: if we know how a "human" editing session looks like, we may also identify non-human ones. Also useful to exclude, say, the two thousands edits User:Nemo_bis once made in an afternoon with Widar on Wikidata. --Nemo 18:29, 19 May 2014 (UTC)

Hi Nemo. This is a fine idea. We should add it to the list of strategies and explore it. In the edit sessions work, I found a lot of bots that were unflagged by looking for super-human activities -- like multi-day sessions. The longest single-human edit session that I found was a little over 21 hours (!!!) long -- which is already a little super-human. Most top out at 8 hours. Anyway, could you link me to the edits that you made on Wikidata (at least the date ranges) so they could be used for comparison? And just for clarity, were they automated or manual? --Halfak (WMF) (talk) 19:52, 19 May 2014 (UTC)
I'm not sure it's helpful, but as you asked: definitely automated, 599 page creations on 2014-04-25; then hundreds of edits on 2014-05-01 automatically executed but requiring hours of manual "instructing", a sort of middle way; finally some dozens on 2014-05-15 confirmed manually one by one. All this variance in edits performed with one single OAuth tool, just to demonstrate the difficulty of distinction... --Nemo 13:12, 20 May 2014 (UTC)