WSoR datasets/user cohort

From Meta, a Wikimedia project coordination wiki

The user_cohort dataset was generated to allow faster selection of editors by the year and month in which they started editing.

Location[edit]

db42:halfak.user_cohort

halfak@internproxy:~/Sandbox/wsor/scripts$ mysql -h db42 -e "EXPLAIN halfak.user_cohort;SELECT * FROM halfak.user_cohort LIMIT 3;"
+------------------+-----------------+------+-----+---------+-------+
| Field            | Type            | Null | Key | Default | Extra |
+------------------+-----------------+------+-----+---------+-------+
| user_id          | int(5) unsigned | NO   | PRI | 0       |       |
| user_name        | varbinary(255)  | NO   |     |         |       |
| first_edit       | varbinary(14)   | YES  | MUL | NULL    |       |
| first_edit_year  | int(4)          | YES  | MUL | NULL    |       |
| first_edit_month | int(2)          | YES  |     | NULL    |       |
| last_edit        | varbinary(14)   | YES  |     | NULL    |       |
+------------------+-----------------+------+-----+---------+-------+
+---------+------------------+----------------+-----------------+------------------+----------------+
| user_id | user_name        | first_edit     | first_edit_year | first_edit_month | last_edit      |
+---------+------------------+----------------+-----------------+------------------+----------------+
|       1 | Damian Yerrick   | 20010929004320 |            2001 |                9 | 20110715131605 |
|       2 | AxelBoldt        | 20010726145009 |            2001 |                7 | 20110715175901 |
|       3 | Tobias Hoevekamp | 20010326202105 |            2001 |                3 | 20040329205621 |
+---------+------------------+----------------+-----------------+------------------+----------------+

Fields[edit]

Each row represents a user. There should be a row in this table for every user who made at least one edit (even if it was in a deleted page).

  • user_id: The identifier of a row. The user identifier from user.user_id.
  • user_name: The user's registered name.
  • first_edit: The date the user made their first edit.
  • first_edit_year: The year the user made their first edit.
  • first_edit_month: The month the user made their first edit.
  • last_edit: The date the user made their last recorded edit.

Reproduction[edit]

Run this query. It should be very fast because it only has to look at the user_id_timestamp index on the revision and archive tables.

CREATE TABLE halfak.user_cohort
SELECT
	user_id,
	user_name,
	MIN(first_edit)         AS first_edit,
	YEAR(MIN(first_edit))   AS first_edit_year,
	MONTH(MIN(first_edit))  AS first_edit_month,
	MAX(first_edit)         AS last_edit
FROM
(
SELECT
	user_id,
	user_name,
	MIN(rev_timestamp)         AS first_edit,
	YEAR(MIN(rev_timestamp))   AS first_edit_year,
	MONTH(MIN(rev_timestamp))  AS first_edit_month,
	MAX(rev_timestamp)         AS last_edit
FROM revision r
INNER JOIN user u
	ON u.user_id = r.rev_user
GROUP BY user_id
UNION
SELECT
	user_id,
	user_name,
	MIN(ar_timestamp)         AS first_edit,
	YEAR(MIN(ar_timestamp))   AS first_edit_year,
	MONTH(MIN(ar_timestamp))  AS first_edit_month,
	MAX(ar_timestamp)         AS last_edit
FROM archive a
INNER JOIN user u
	ON u.user_id = a.ar_user
GROUP BY user_id
) AS whocares_doesntmatter
GROUP BY user_id, user_name;

Notes[edit]

This table was first generated in late July of 2011 and until someone remembers to update this line to say they regenerated it, you should assume that it hasn't been updated since.