Jump to content

WSoR datasets/user approx registration

From Meta, a Wikimedia project coordination wiki

The user_approx_registration dataset was generated to resolve a problem with old user data. User who registered before 2006 were not given a registration date. When registration date was introduced, a script was run that back-populated registration dates based on first edit (incorrect, but pretty close). Some users never made a first edit though so their registration date stayed NULL. This table fills in the gaps by approximating the registration dates of users based on the ordering of user_id.

For example, imagine we have three users who registered in order: A, B, and C. A and C made first edits, but B didn't. Therefor, A and C have registration dates in the tables that closely reflect when they actually acquired an account. Since we knot that A, B, and C registered in that order, we can infer that B's actual registration date is between A and C. This table takes advantage of that and simply guesses that B registered an account when C did. Since new registrations come in pretty quick, this ends up being a close-enough approximation for most analyses.

Location

[edit]

db42:halfak.user_approx_registration

halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN halfak.user_approx_registration;SELECT * FROM halfak.user_approx_registration LIMIT 3;"
+-------------------+-----------------+------+-----+---------+-------+
| Field             | Type            | Null | Key | Default | Extra |
+-------------------+-----------------+------+-----+---------+-------+
| user_id           | int(5) unsigned | YES  |     | NULL    |       |
| user_registration | varbinary(14)   | YES  |     | NULL    |       |
+-------------------+-----------------+------+-----+---------+-------+
+---------+-------------------+
| user_id | user_registration |
+---------+-------------------+
|  723304 | 20051225171019    |
|  723284 | 20051225165527    |
|  723260 | 20051225164334    |
+---------+-------------------+

Fields

[edit]

Each row represents a user without a registration date and the

  • user_id: Row identifier. Same as user.user_id
  • user_registration: The approximated registration date.

Reproduction

[edit]

Determine the when users were first given real registration dates. This should ever change.

(env)halfak@internproxy:~$ mysql -h db42 -e "SELECT user_registration FROM enwiki.user WHERE user_id = (SELECT MAX(user_id)+1 FROM enwiki.user WHERE user_registration IS NULL);"
+-------------------+
| user_registration |
+-------------------+
| 20051225171019    |
+-------------------+

Run this script using the registration date as a cutoff. It doesn't hurt to go a little bit after the above cutoff to make sure you start with sane data. When the script prints a ".", it found a user with a registration date. When it prints a "!", it found a user without one and approximated.

halfak@internproxy:~/Sandbox/wsor/scripts$ python user_approx_registration.py -s db42 20060101000000 > user_approx_registration.tsv
Aug-17 01:33:39 INFO     Connecting to db42:enwiki using /home/halfak/.my.cnf.
.!..................!.......................!.........................!....!.
....!..................!.....................................................
.......!...........................................................!.........
..........!..........................................!.......................
.............................................................................
.............................................!.........!.....................
................................................................!............
..........................!..!...............................................
...........................................................!.................
.............................................................................
.......................................................................!.....
...!.........................................................................
...........!..!...........................................!..................
............................!..................................

Use mysqlimport to load the output file into the database

mysqlimport --local -h db42 --skip-opt halfak user_approx_registration.tsv

Notes

[edit]

This table should never need to be updated unless someone trips over a the cord that powers the machine that keeps track of user registration dates. If that day ever comes, just do what it says above.