WSoR datasets/reverted
From Meta, a Wikimedia project coordination wiki
The reverted dataset contains information about reverted revisions, who they were reverted by, which revision was reverted back to and whether the revert was for vandalism.
Contents |
Location [edit]
db42:halfak.reverted_20110115
Fields [edit]
halfak@internproxy:~/Sandbox/wsor$ mysql -h db42 -e "EXPLAIN reverted_20110115;SELECT * FROM reverted_20110115 LIMIT 3" halfak +----------------+----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------+----------------+------+-----+---------+-------+ | rev_id | int(11) | YES | MUL | NULL | | | username | varbinary(255) | YES | | NULL | | | user_id | int(11) | YES | | NULL | | | comment | varbinary(255) | YES | | NULL | | | rvtg_id | int(11) | YES | MUL | NULL | | | rvtg_username | varbinary(255) | YES | | NULL | | | rvtg_user_id | int(11) | YES | | NULL | | | rvtg_comment | varbinary(255) | YES | | NULL | | | rvtto_id | int(11) | YES | | NULL | | | rvtto_username | varbinary(255) | YES | | NULL | | | rvtto_user_id | int(11) | YES | | NULL | | | rvtto_comment | varbinary(255) | YES | | NULL | | | is_vandalism | tinyint(1) | YES | MUL | NULL | | | revs_reverted | int(11) | YES | | NULL | | +----------------+----------------+------+-----+---------+-------+ +-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+ | rev_id | username | user_id | comment | rvtg_id | rvtg_username | rvtg_user_id | rvtg_comment | rvtto_id | rvtto_username | rvtto_user_id | rvtto_comment | is_vandalism | revs_reverted | +-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+ | 261295963 | 88.109.188.145 | 0 | | 261309052 | Numyht | 4660442 | [[WP:UNDO|Undid]] revision 261295963 by [[Special:Contributions/88.109.188.145|88.109.188.145]] ([[User talk:88.109.188.145|talk]]) rv unsourced | 260061081 | Red Sismey | 1991048 | /* Acting */ | 0 | 1 | | 269434849 | 75.53.209.45 | 0 | /* Total Nonstop Action Wrestling (2008-present) */ | 269442144 | JakeDHS07 | 743183 | /* Total Nonstop Action Wrestling (2008-present) */ week by week as well as incorrect | 269376850 | PCE | 2202642 | /* Total Nonstop Action Wrestling (2008-present) */ | 0 | 1 | | 280667690 | 86.44.89.61 | 0 | | 280684923 | TheFBH | 5122755 | [[WP:UNDO|Undid]] revision 280667690 by [[Special:Contributions/86.44.89.61|86.44.89.61]] ([[User talk:86.44.89.61|talk]]) | 279423089 | TheFBH | 5122755 | /* In wrestling */ | 0 | 1 | +-------------+----------------+---------+-----------------------------------------------------+-----------+---------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+---------------+-----------------------------------------------------+--------------+---------------+
Each row represents a reverted revision.
- Reverted Revision - One of the reverted revisions
rev_id: The identitifier of the reverted revision PRIMARY KEYusername: The username of the editor who created the reverted revision (same as revision.rev_user_text)user_id: The identifier of the editor who created the reverted revision (same as revision.rev_user)comment: The comment left by the reverted editor (same as revision.rev_comment)
- Reverting Revision - The revision that performed the revert.
rvtg_id: The identitifier of the reverting revisionrvtg_username: The username of the editor who performed the revert (same as revision.rev_user_text)rvtg_user_id: The identifier of the editor who performed the revert (same as revision.rev_user_text)rvtg_comment: The comment left by the reverting editor
- Reverted to Revision - The revision that was re-instated by the reverting edit.
rvtto_id: The identitifier of the revision that was reverted back torvtto_username: The username of the editor who created the reverted to revisionrvtto_user_id: The identifier of the editor who created the reverted to revisionrvtto_comment: The comment left by the editor who created the reverted to revision
- Meta
is_vandalism: True whenrvtg_commentmatched D_LOOSE/D_STRICT regexprevs_reverted: The total number of revisions that were reverted in this action.
Reproduction [edit]
To reproduce this dataset, install Wikimedia Utilities and run this script.
For example:
$ python reverts.py --output_prefix=test1/ enwiki.*.7z Aug-29 16:51:36 INFO Starting test1/ run... Aug-29 16:51:36 INFO Creating output file: test1/revert.tsv Aug-29 16:51:36 INFO Creating output file: test1/reverted.tsv Aug-29 16:51:36 INFO Prcoessing... Aug-29 16:51:36 INFO Processing dump file enwiki-20110829-pages-meta-history.sample.xml. |.||||..|.|....|......|........|..........|...........|.|.|.|.|...|.|||...|.|.|
This will produce two output files that can be loaded into the database test1/revert.tsv and test1/reverted.tsv. Create the table to load into:
CREATE TABLE reverted ( rev_id INT UNSIGNED, rvtg_id INT UNSIGNED, rvtto_id INT UNSIGNED, is_vandalism BOOL, revs_reverted INT(2) );
This gets the base dataset into the MySQL so it can be denormalized. To denormalize, run the following command:
CREATE TABLE reverted_denorm SELECT rvtd.rev_id, rvtd.rev_user_text AS username, rvtd.rev_user AS user_id, rvtd.rev_comment AS comment, rvtg.rev_id AS rvtg_id, rvtg.rev_user_text AS rvtg_username, rvtg.rev_user AS rvtg_user_id, rvtg.rev_comment AS rvtg_comment, rvtto.rev_id AS rvtto_id, rvtto.rev_user_text AS rvtto_username, rvtto.rev_user AS rvtto_user_id, rvtto.rev_comment AS rvtto_comment, reverted.is_vandalism, reverted.revs_reverted FROM reverted LEFT JOIN revision rvtd ON reverted.rev_id = rvtd.rev_id LEFT JOIN revision rvtg ON reverted.rvtg_id = rvtg.rev_id LEFT JOIN revision rvtto ON reverted.rvtto_id = rvtto.rev_id;
Finally, move the denormalized table over the original.
DROP TABLE reverted; ALTER TABLE reverted_denorm RENAME TO reverted;
Notes [edit]
The current dataset is complete up to the January 2011 dump.