Research talk:Revision scoring as a service/Work log/2016-01-30

Saturday, January 30, 2016[edit]

Working on figuring out the lay of "edits worth reviewing" in wikidata today. I wrote the following query to help me break down a random sample of "human" edits to wikidata from 2015:

SELECT
  rev_id,
  rev_user = 0 AS anon_user,
  trusted.ug_user IS NOT NULL AS trusted_user,
  user.user_editcount IS NOT NULL AND user.user_editcount >= 1000 AS trusted_edits,
  rev_comment RLIKE '/\* clientsitelink-(remove|update):' AS client_edit,
  rev_comment RLIKE '/\* wbmergeitems-(to|from):' AS merge_edit
FROM revision
LEFT JOIN user ON rev_user = user_id
LEFT JOIN user_groups trusted ON
  trusted.ug_user = rev_user AND
  trusted.ug_group IN (
    'bureaucrat', 'checkuser', 'flood', 'ipblock-excempt',
    'oversight', 'property-creator', 'rollbacker', 'steward',
    'sysop', 'translationadmin', 'wikidata-staff'
  )
LEFT JOIN user_groups bot ON
  bot.ug_user = rev_user AND
  bot.ug_group = 'bot'
WHERE
  rev_timestamp BETWEEN "2015" AND "2016" AND
  bot.ug_user IS NULL
ORDER BY RAND()
LIMIT 1000000;

A quick analysis shows that most edits are performed by "trusted_users" or users with a trustworthy number of edits. Anons perform a very small fraction of the edits.

> select anon_user, trusted_user, trusted_edits, client_edit, merge_edit, COUNT(*) FROM wikidata_nonbot_sample GROUP BY 1,2,3,4,5;
+-----------+--------------+---------------+-------------+------------+----------+
| anon_user | trusted_user | trusted_edits | client_edit | merge_edit | COUNT(*) |
+-----------+--------------+---------------+-------------+------------+----------+
|         0 |            0 |             0 |           0 |          0 |    31924 |
|         0 |            0 |             0 |           0 |          1 |     1757 |
|         0 |            0 |             0 |           1 |          0 |     8122 |
|         0 |            0 |             1 |           0 |          0 |   517100 |
|         0 |            0 |             1 |           0 |          1 |     8119 |
|         0 |            0 |             1 |           1 |          0 |    11466 |
|         0 |            1 |             0 |           0 |          0 |       90 |
|         0 |            1 |             0 |           0 |          1 |        4 |
|         0 |            1 |             0 |           1 |          0 |       17 |
|         0 |            1 |             1 |           0 |          0 |   399397 |
|         0 |            1 |             1 |           0 |          1 |     6978 |
|         0 |            1 |             1 |           1 |          0 |      848 |
|         1 |            0 |             0 |           0 |          0 |    14136 |
|         1 |            0 |             0 |           0 |          1 |       42 |
+-----------+--------------+---------------+-------------+------------+----------+
14 rows in set (0.49 sec)

Next step is labeling these edits as reverted and seeing how those breakdowns work out. --EpochFail (talk) 19:00, 30 January 2016 (UTC)[reply]

Literature Review[edit]

heindorf2015towards[edit]

Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis Poster

They checked 24M edits automatically and reported 100K cases of vandalism.
Answer of up to 48% of web queries can be found in knowledge databases (cited from another article)
Two researches has been done to investigate vandalism in knowledge bases: 1- Neis et al. in OSM they checked if users if user is blocked or not (so an edit is not vandalism if editor is not banned) 2- Tan et al. in Freebase they built a test time when a user is an edit is deleted within 4 weeks of submission is considered low quality
They used dump of Wikidata at 14 November, 2014. It consists 167M edits, 24M human.
In order to classify 24M edits they used rollback vs. restore
Based on their detection there are 103,205 rollbacks and since they use edit summaries to catch restore there are at least 64,820 restores. They didn't consider restores as vandalism!
They manually checked 1K rollback, 1K restore, and 1K other edits.
Based on their check they found out: 86%±3% (95% confidence level) of rollbacks are vandalism, 62%±3% of restores and 4% of other edits
They think 86% is good enough for SVMs
Their analysis of vanadalisms: 1- they categorized top 1K items that got vandalized (most-vandalized items). Biggest categories are places by 31%, then people and nature (note: these categories are too broad to be useful). Also [I think] there was a big noise regarding India, 11% of vandalism while 0.5% items are related to India.
57% of vandalism happens in textual part of item (label, description, aliases) while 40% happens in structural part (site links and statements). I think this is biased, they analysed all edits in Wikidata in two years but wikidata didn't have statement support at the first six months which mean no vandalism was able to done in that part. Also they weren't able to determine type of 2% of vandalisms. They noted some of these vandalisms were merges.
86% of vandalism in Wikidata is IPs. 41% of them were IPs that already made vandalism. On average 1.7 vandalism per IP.
IPs tend to vandalize textual part more and vandal users tend to vandalize structural part.