Jump to content

Research talk:Are the bots really fighting/Work log/2017-03-21

Add topic
From Meta, a Wikimedia project coordination wiki

Tuesday, March 21, 2017

[edit]

Staeiou

[edit]

Updated work on the comment parser in this Jupyter notebook. This uses the heuristic that comments with wiki language codes in between punctuation indicate an interwiki link update, but categorizes it as "interwiki link cleanup -- suspected". I think it might take peeking into content diffs to really be confident that those are actually interwiki link actions.

Comment parsing is a good way to find new cases of potential interesting cases. I've found a few more and included them in the parser too. However, I still think I'm missing some, and I want to go and manually look for cases of bot-vs-bot reverts I remember from various BAG/ANI/etc threads and see how the diffs appear in the dataset.

I ran the notebook with Halfak's updated bot2bot dataset based on this Quarry query that joins to get rev_comments. This gives us the following breakdown:

All namespaces

[edit]
type count percent
interwiki link cleanup 180293 37.54%
fixing double redirect 90013 18.74%
AIV helperbot 77390 16.11%
interwiki link cleanup -- suspected 58577 12.2%
other w/ per justification 19761 4.11%
deleted revision 16046 3.34%
other 10414 2.17%
archiving 8268 1.72%
clearing sandbox 5080 1.06%
other w/ revert in comment 3992 0.83%
moving category 3302 0.69%
protection template cleanup 2819 0.59%
category redirect cleanup 1517 0.32%
orphan template cleanup 1028 0.21%
mathbot mathlist updates 519 0.11%
other redirect 352 0.07%
botfight: reverting CommonsDelinker 318 0.07%
botfight: 718bot vs ImageRemovalBot 173 0.04%
redirect tagging/sorting 163 0.03%
link syntax fixing 111 0.02%
botfight: infoboxneeded 96 0.02%
template cleanup 68 0.01%
template tagging 24 0.0%
commons image migration 5 0.0%

ns0 only

[edit]
type percent percent
interwiki link cleanup 82244 38.3%
fixing double redirect 81907 38.14%
interwiki link cleanup -- suspected 36265 16.89%
deleted revision 3545 1.65%
protection template cleanup 2631 1.23%
moving category 1987 0.93%
other 1622 0.76%
orphan template cleanup 1020 0.48%
category redirect cleanup 977 0.45%
other w/ revert in comment 519 0.24%
mathbot mathlist updates 515 0.24%
other w/ per justification 480 0.22%
botfight: reverting CommonsDelinker 222 0.1%
other redirect 183 0.09%
botfight: 718bot vs ImageRemovalBot 170 0.08%
redirect tagging/sorting 163 0.08%
botfight: infoboxneeded 96 0.04%
link syntax fixing 85 0.04%
template cleanup 68 0.03%
template tagging 24 0.01%
commons image migration 3 0.0%
clearing sandbox 1 0.0%
template tagging 24 0.0%
commons image migration 5 0.0%

Staeiou (talk) 00:30, 21 March 2017 (UTC)Reply

Update: using lang codes might have an issue

[edit]

I just realized lang codes contain very common two-letter English words: it, or, an, is. This might make it problematic to use this as a heuristic. I also included commons, meta, and simple. There are probably a lot of false positives in that. Maybe need to scope the punctuation that counts as a valid bordering character, or just use this as a way to filter candidates that will be examined closer with diffs. Also maybe only call that function for bots approved for interwiki tasks. Staeiou (talk) 00:51, 21 March 2017 (UTC)Reply