WSoR datasets/revert

From Meta, a Wikimedia project coordination wiki

The revert dataset contains information about reverting edits and the revision that was reverted back to. Reverts are identified by looking for revisions with identical content in the history of a page.

Location[edit]

db42:halfak.revert_20110115

Fields[edit]

halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN revert_20110115;SELECT * FROM revert_20110115 LIMIT 3" halfak
+---------------+------------+------+-----+---------+-------+
| Field         | Type       | Null | Key | Default | Extra |
+---------------+------------+------+-----+---------+-------+
| rev_id        | int(11)    | YES  | UNI | NULL    |       |
| rvtto_id      | int(11)    | YES  |     | NULL    |       |
| is_vandalism  | tinyint(1) | YES  | MUL | NULL    |       |
| revs_reverted | int(11)    | YES  |     | NULL    |       |
+---------------+------------+------+-----+---------+-------+
+--------+----------+--------------+---------------+
| rev_id | rvtto_id | is_vandalism | revs_reverted |
+--------+----------+--------------+---------------+
|     60 |       43 |            0 |             1 |
|    309 |      308 |            0 |             2 |
|    323 |      318 |            0 |             1 |
+--------+----------+--------------+---------------+

There is a row in this table for every reverting revision in a page that was included in the January, 2011 database dump of enwiki.

  • rev_id: The reverting revision's identifier
  • rvtto_id: The identifier of the revision that was reverted to
  • is_vandalism: True when the D_STRICT/D_LOOSE regexp matched the reverting revision's comment, False otherwise.
  • revs_reverted: The number of revisions between the reverting revision and the reverted to revision. This represents the number of revisions that were reverted. Note that this is limited to < 15 to remove noise.

Reproduction[edit]

To reproduce this dataset, install Wikimedia Utilities and run this script.

For example:

$ python reverts.py --output_prefix=test1/ enwiki.*.7z
Aug-29 16:51:36 INFO     Starting test1/ run...
Aug-29 16:51:36 INFO     Creating output file: test1/revert.tsv
Aug-29 16:51:36 INFO     Creating output file: test1/reverted.tsv
Aug-29 16:51:36 INFO     Prcoessing...
Aug-29 16:51:36 INFO     Processing dump file enwiki-20110829-pages-meta-history.sample.xml.
|.||||..|.|....|......|........|..........|...........|.|.|.|.|...|.|||...|.|.|

This will produce two output files that can be loaded into the database test1/revert.tsv and test1/reverted.tsv. Create the table to load into:

CREATE TABLE revert (
	rev_id        INT UNSIGNED,
	rvtto_id      INT UNSIGNED,
	is_vandalism  BOOL,
	revs_reverted INT(2)
);

Then load the output file:

$ mysqlimport --local --skip-opt halfak test1/revert.tsv

This gets the base dataset into the MySQL so it can be denormalized. To denormalize, run the following command:

CREATE TABLE revert_denorm
SELECT
        rvtg.rev_id AS reverting_id,
        rvtg.rev_user_text AS reverting_username,
        rvtg.rev_user AS reverting_user_id,
        rvtg.rev_comment AS reverting_comment,
        rvtto.rev_id AS revertto_id,
        rvtto.rev_user_text AS revertto_username,
        rvtto.rev_user AS revertto_user_id,
        rvtto.rev_comment AS revertto_comment,
        revert.revs_reverted
FROM  revert
LEFT JOIN revision rvtg
        ON revert.rev_id = rvtg.rev_id
LEFT JOIN revision rvtto
        ON revert.rvtto_id = rvtto.rev_id;

Finally, move the denormalized table over the original.

DROP TABLE revert;
ALTER TABLE revert_denorm RENAME TO revert;


Enjoy!

Notes[edit]

The current dataset is complete up to the January 2011 dump.