WSoR datasets/rev len changed
The rev_len_changed dataset was generated by matching each revision with its preceding revision and tracking the change in the rev_len
column in the database. The rev_len
column contains the length of a revision text in UTF8 encoded bytes. len_change could be interpreted as the number of "bytes changed" for the total length of a revision.
It would be more appropriate to perform a real diff of article text between revisions to detect content added and removed during the same edit. This approach offers a rough approximation under the assumption that the vast majority of edits either add or remove content.
Location
[edit]db42:halfak.rev_len_changed
halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN rev_len_changed;SELECT * FROM rev_len_changed LIMIT 3" halfak
+---------------+-----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------+------+-----+---------+-------+
| rev_id | int(8) unsigned | NO | PRI | 0 | |
| rev_timestamp | varbinary(14) | NO | | | |
| rev_year | int(4) | YES | | NULL | |
| rev_month | int(2) | YES | | NULL | |
| rev_day | int(2) | YES | | NULL | |
| rev_len | int(8) unsigned | YES | | NULL | |
| user_id | int(5) unsigned | NO | MUL | 0 | |
| user_text | varbinary(255) | NO | | | |
| page_id | int(8) unsigned | NO | | 0 | |
| namespace | int(11) | NO | | 0 | |
| parent_id | int(8) unsigned | YES | | NULL | |
| len_change | decimal(12,0) | YES | | NULL | |
+---------------+-----------------+------+-----+---------+-------+
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+
| rev_id | rev_timestamp | rev_year | rev_month | rev_day | rev_len | user_id | user_text | page_id | namespace | parent_id | len_change |
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+
| 62681568 | 20060708052419 | 2006 | 7 | 8 | 32 | 294714 | Kyorosuke | 5878274 | 0 | 0 | 32 |
| 62681934 | 20060708052829 | 2006 | 7 | 8 | 30 | 294714 | Kyorosuke | 5878274 | 0 | 62681568 | -2 |
| 77574939 | 20060924184839 | 2006 | 9 | 24 | 62 | 0 | 82.33.170.190 | 5878274 | 0 | 62681934 | 32 |
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+
Fields
[edit]Each row represents a revision and the change in article length from the edit that created it.
rev_id
: Row identifier. Same as revision.rev_idrev_timestamp
: Same as revision.rev_timestamprev_year
: Extracted from rev_timestamp to be indexed individually.rev_month
: Extracted from rev_timestamp to be indexed individually.rev_day
: Extracted from rev_timestamp to be indexed individually.rev_len
: Same as revision.rev_lenuser_id
: Same as revision.rev_useruser_text
: Same as revision.rev_user_textpage_id
: Same as revision.rev_pagenamespace
: Namespace of page at time of query.parent_id
: Same as revision.parent_idlen_change
: Change in total length since last revision of page.
Reproduction
[edit]Run this query. If things are indexed well, it shouldn't even need to create a temporary table. However, creating this table took 3+ days on a server by itself.
CREATE TABLE halfak.rev_len_changed
SELECT
c.rev_id,
c.rev_timestamp,
YEAR(c.rev_timestamp) AS rev_year,
MONTH(c.rev_timestamp) AS rev_month,
DAY(c.rev_timestamp) AS rev_day,
c.rev_len,
c.rev_user AS user_id,
c.rev_user_text AS user_text,
c.rev_page AS page_id,
cp.page_namespace AS namespace,
c.rev_parent_id AS parent_id,
c.rev_len - IFNULL(p.rev_len, 0) AS len_change
FROM revision c
LEFT JOIN revision p
ON c.rev_parent_id = p.rev_id
INNER JOIN page cp
ON c.rev_page = cp.page_id;
The table can be (much more quickly) updated with the following query:
INSERT INTO halfak.rev_len_changed
SELECT
c.rev_id,
c.rev_timestamp,
YEAR(c.rev_timestamp) AS rev_year,
MONTH(c.rev_timestamp) AS rev_month,
DAY(c.rev_timestamp) AS rev_day,
c.rev_len,
c.rev_user AS user_id,
c.rev_user_text AS user_text,
c.rev_page AS page_id,
cp.page_namespace AS namespace,
c.rev_parent_id AS parent_id,
c.rev_len - IFNULL(p.rev_len, 0) AS len_change
FROM revision c
LEFT JOIN revision p
ON c.rev_parent_id = p.rev_id
INNER JOIN page cp
ON c.rev_page = cp.page_id
WHERE c.rev_id > (SELECT MAX(rev_id) FROM halfak.rev_len_changed);
Notes
[edit]This table was generated from July 27th, 2011 to Aug. 2nd, 2011, so some pages will have revisions up until July 27th, while others may have revisions up until Aug. 2nd. To negate this problem, simply limit rev_timestamp <= "20110727000000"
.