WSoR datasets/reverted

From Meta, a Wikimedia project coordination wiki

The reverted dataset contains information about reverted revisions, who they were reverted by, which revision was reverted back to and whether the revert was for vandalism.

Location[edit]

db42:halfak.reverted_20110115

Fields[edit]

halfak@internproxy:~/Sandbox/wsor$ mysql -h db42 -e "EXPLAIN reverted_20110115;SELECT * FROM reverted_20110115 LIMIT 3" halfak
+----------------+----------------+------+-----+---------+-------+
| Field          | Type           | Null | Key | Default | Extra |
+----------------+----------------+------+-----+---------+-------+
| rev_id         | int(11)        | YES  | MUL | NULL    |       |
| username       | varbinary(255) | YES  |     | NULL    |       |
| user_id        | int(11)        | YES  |     | NULL    |       |
| comment        | varbinary(255) | YES  |     | NULL    |       |
| rvtg_id        | int(11)        | YES  | MUL | NULL    |       |
| rvtg_username  | varbinary(255) | YES  |     | NULL    |       |
| rvtg_user_id   | int(11)        | YES  |     | NULL    |       |
| rvtg_comment   | varbinary(255) | YES  |     | NULL    |       |
| rvtto_id       | int(11)        | YES  |     | NULL    |       |
| rvtto_username | varbinary(255) | YES  |     | NULL    |       |
| rvtto_user_id  | int(11)        | YES  |     | NULL    |       |
| rvtto_comment  | varbinary(255) | YES  |     | NULL    |       |
| is_vandalism   | tinyint(1)     | YES  | MUL | NULL    |       |
| revs_reverted  | int(11)        | YES  |     | NULL    |       |
+----------------+----------------+------+-----+---------+-------+
+-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+
| rev_id      | username       | user_id | comment                                             | rvtg_id   | rvtg_username | rvtg_user_id | rvtg_comment                                                                                                                                     | rvtto_id  | rvtto_username | rvtto_user_id | rvtto_comment                                       | is_vandalism | revs_reverted |
+-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+
|   261295963 | 88.109.188.145 |       0 |                                                     | 261309052 | Numyht        |      4660442 | [[WP:UNDO|Undid]] revision 261295963 by [[Special:Contributions/88.109.188.145|88.109.188.145]] ([[User talk:88.109.188.145|talk]]) rv unsourced | 260061081 | Red Sismey     |       1991048 | /* Acting */                                        |            0 |             1 |
|   269434849 | 75.53.209.45   |       0 | /* Total Nonstop Action Wrestling (2008-present) */ | 269442144 | JakeDHS07     |       743183 | /* Total Nonstop Action Wrestling (2008-present) */ week by week as well as incorrect                                                            | 269376850 | PCE            |       2202642 | /* Total Nonstop Action Wrestling (2008-present) */ |            0 |             1 |
|   280667690 | 86.44.89.61    |       0 |                                                     | 280684923 | TheFBH        |      5122755 | [[WP:UNDO|Undid]] revision 280667690 by [[Special:Contributions/86.44.89.61|86.44.89.61]] ([[User talk:86.44.89.61|talk]])                       | 279423089 | TheFBH         |       5122755 | /* In wrestling */                                  |            0 |             1 |
+-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+

Each row represents a reverted revision.

  • Reverted Revision - One of the reverted revisions
    • rev_id: The identitifier of the reverted revision PRIMARY KEY
    • username: The username of the editor who created the reverted revision (same as revision.rev_user_text)
    • user_id: The identifier of the editor who created the reverted revision (same as revision.rev_user)
    • comment: The comment left by the reverted editor (same as revision.rev_comment)
  • Reverting Revision - The revision that performed the revert.
    • rvtg_id: The identitifier of the reverting revision
    • rvtg_username: The username of the editor who performed the revert (same as revision.rev_user_text)
    • rvtg_user_id: The identifier of the editor who performed the revert (same as revision.rev_user_text)
    • rvtg_comment: The comment left by the reverting editor
  • Reverted to Revision - The revision that was re-instated by the reverting edit.
    • rvtto_id: The identitifier of the revision that was reverted back to
    • rvtto_username: The username of the editor who created the reverted to revision
    • rvtto_user_id: The identifier of the editor who created the reverted to revision
    • rvtto_comment: The comment left by the editor who created the reverted to revision
  • Meta
    • is_vandalism: True when rvtg_comment matched D_LOOSE/D_STRICT regexp
    • revs_reverted: The total number of revisions that were reverted in this action.

Reproduction[edit]

To reproduce this dataset, install Wikimedia Utilities and run this script.

For example:

$ python reverts.py --output_prefix=test1/ enwiki.*.7z
Aug-29 16:51:36 INFO     Starting test1/ run...
Aug-29 16:51:36 INFO     Creating output file: test1/revert.tsv
Aug-29 16:51:36 INFO     Creating output file: test1/reverted.tsv
Aug-29 16:51:36 INFO     Prcoessing...
Aug-29 16:51:36 INFO     Processing dump file enwiki-20110829-pages-meta-history.sample.xml.
|.||||..|.|....|......|........|..........|...........|.|.|.|.|...|.|||...|.|.|

This will produce two output files that can be loaded into the database test1/revert.tsv and test1/reverted.tsv. Create the table to load into:

CREATE TABLE reverted (
	rev_id        INT UNSIGNED,
	rvtg_id       INT UNSIGNED,
	rvtto_id      INT UNSIGNED,
	is_vandalism  BOOL,
	revs_reverted INT(2)
);

This gets the base dataset into the MySQL so it can be denormalized. To denormalize, run the following command:

CREATE TABLE reverted_denorm
SELECT
	rvtd.rev_id,
	rvtd.rev_user_text as username,
	rvtd.rev_user as user_id,
	rvtd.rev_comment as comment,
	rvtg.rev_id as rvtg_id,
	rvtg.rev_user_text as rvtg_username,
	rvtg.rev_user as rvtg_user_id,
	rvtg.rev_comment as rvtg_comment,
	rvtto.rev_id as rvtto_id,
	rvtto.rev_user_text as rvtto_username,
	rvtto.rev_user as rvtto_user_id,
	rvtto.rev_comment as rvtto_comment,
	reverted.is_vandalism,
	reverted.revs_reverted
FROM reverted
LEFT JOIN revision rvtd
	ON reverted.rev_id = rvtd.rev_id
LEFT JOIN revision rvtg
	ON reverted.rvtg_id = rvtg.rev_id
LEFT JOIN revision rvtto
	ON reverted.rvtto_id = rvtto.rev_id;

Finally, move the denormalized table over the original.

DROP TABLE reverted;
ALTER TABLE reverted_denorm RENAME TO reverted;

Notes[edit]

The current dataset is complete up to the January 2011 dump.