Jump to content

WSoR datasets/rev len changed

From Meta, a Wikimedia project coordination wiki

The rev_len_changed dataset was generated by matching each revision with its preceding revision and tracking the change in the rev_len column in the database. The rev_len column contains the length of a revision text in UTF8 encoded bytes. len_change could be interpreted as the number of "bytes changed" for the total length of a revision.

It would be more appropriate to perform a real diff of article text between revisions to detect content added and removed during the same edit. This approach offers a rough approximation under the assumption that the vast majority of edits either add or remove content.

Location

[edit]

db42:halfak.rev_len_changed

halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN rev_len_changed;SELECT * FROM rev_len_changed LIMIT 3" halfak
+---------------+-----------------+------+-----+---------+-------+
| Field         | Type            | Null | Key | Default | Extra |
+---------------+-----------------+------+-----+---------+-------+
| rev_id        | int(8) unsigned | NO   | PRI | 0       |       |
| rev_timestamp | varbinary(14)   | NO   |     |         |       |
| rev_year      | int(4)          | YES  |     | NULL    |       |
| rev_month     | int(2)          | YES  |     | NULL    |       |
| rev_day       | int(2)          | YES  |     | NULL    |       |
| rev_len       | int(8) unsigned | YES  |     | NULL    |       |
| user_id       | int(5) unsigned | NO   | MUL | 0       |       |
| user_text     | varbinary(255)  | NO   |     |         |       |
| page_id       | int(8) unsigned | NO   |     | 0       |       |
| namespace     | int(11)         | NO   |     | 0       |       |
| parent_id     | int(8) unsigned | YES  |     | NULL    |       |
| len_change    | decimal(12,0)   | YES  |     | NULL    |       |
+---------------+-----------------+------+-----+---------+-------+
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+
| rev_id   | rev_timestamp  | rev_year | rev_month | rev_day | rev_len | user_id | user_text     | page_id | namespace | parent_id | len_change |
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+
| 62681568 | 20060708052419 |     2006 |         7 |       8 |      32 |  294714 | Kyorosuke     | 5878274 |         0 |         0 |         32 |
| 62681934 | 20060708052829 |     2006 |         7 |       8 |      30 |  294714 | Kyorosuke     | 5878274 |         0 |  62681568 |         -2 |
| 77574939 | 20060924184839 |     2006 |         9 |      24 |      62 |       0 | 82.33.170.190 | 5878274 |         0 |  62681934 |         32 |
+----------+----------------+----------+-----------+---------+---------+---------+---------------+---------+-----------+-----------+------------+

Fields

[edit]

Each row represents a revision and the change in article length from the edit that created it.

  • rev_id: Row identifier. Same as revision.rev_id
  • rev_timestamp: Same as revision.rev_timestamp
  • rev_year: Extracted from rev_timestamp to be indexed individually.
  • rev_month: Extracted from rev_timestamp to be indexed individually.
  • rev_day: Extracted from rev_timestamp to be indexed individually.
  • rev_len: Same as revision.rev_len
  • user_id: Same as revision.rev_user
  • user_text: Same as revision.rev_user_text
  • page_id: Same as revision.rev_page
  • namespace: Namespace of page at time of query.
  • parent_id: Same as revision.parent_id
  • len_change: Change in total length since last revision of page.

Reproduction

[edit]

Run this query. If things are indexed well, it shouldn't even need to create a temporary table. However, creating this table took 3+ days on a server by itself.

CREATE TABLE halfak.rev_len_changed
SELECT
	c.rev_id,
	c.rev_timestamp,
	YEAR(c.rev_timestamp)             AS rev_year,
	MONTH(c.rev_timestamp)            AS rev_month,
	DAY(c.rev_timestamp)              AS rev_day,
	c.rev_len,
	c.rev_user                        AS user_id,
	c.rev_user_text                   AS user_text,
	c.rev_page                        AS page_id,
	cp.page_namespace                 AS namespace,
	c.rev_parent_id                   AS parent_id,
	c.rev_len - IFNULL(p.rev_len, 0)  AS len_change
FROM revision c
LEFT JOIN revision p
	ON c.rev_parent_id = p.rev_id
INNER JOIN page cp
	ON c.rev_page = cp.page_id;

The table can be (much more quickly) updated with the following query:

INSERT INTO halfak.rev_len_changed
SELECT
	c.rev_id,
	c.rev_timestamp,
	YEAR(c.rev_timestamp)             AS rev_year,
	MONTH(c.rev_timestamp)            AS rev_month,
	DAY(c.rev_timestamp)              AS rev_day,
	c.rev_len,
	c.rev_user                        AS user_id,
	c.rev_user_text                   AS user_text,
	c.rev_page                        AS page_id,
	cp.page_namespace                 AS namespace,
	c.rev_parent_id                   AS parent_id,
	c.rev_len - IFNULL(p.rev_len, 0)  AS len_change
FROM revision c
LEFT JOIN revision p
	ON c.rev_parent_id = p.rev_id
INNER JOIN page cp
	ON c.rev_page = cp.page_id
WHERE c.rev_id > (SELECT MAX(rev_id) FROM halfak.rev_len_changed);

Notes

[edit]

This table was generated from July 27th, 2011 to Aug. 2nd, 2011, so some pages will have revisions up until July 27th, while others may have revisions up until Aug. 2nd. To negate this problem, simply limit rev_timestamp <= "20110727000000".