WSoR datasets/user approx registration
The user_approx_registration dataset was generated to resolve a problem with old user data. User who registered before 2006 were not given a registration date. When registration date was introduced, a script was run that back-populated registration dates based on first edit (incorrect, but pretty close). Some users never made a first edit though so their registration date stayed NULL. This table fills in the gaps by approximating the registration dates of users based on the ordering of user_id.
For example, imagine we have three users who registered in order: A, B, and C. A and C made first edits, but B didn't. Therefor, A and C have registration dates in the tables that closely reflect when they actually acquired an account. Since we knot that A, B, and C registered in that order, we can infer that B's actual registration date is between A and C. This table takes advantage of that and simply guesses that B registered an account when C did. Since new registrations come in pretty quick, this ends up being a close-enough approximation for most analyses.
Location
[edit]db42:halfak.user_approx_registration
halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN halfak.user_approx_registration;SELECT * FROM halfak.user_approx_registration LIMIT 3;" +-------------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+-----------------+------+-----+---------+-------+ | user_id | int(5) unsigned | YES | | NULL | | | user_registration | varbinary(14) | YES | | NULL | | +-------------------+-----------------+------+-----+---------+-------+ +---------+-------------------+ | user_id | user_registration | +---------+-------------------+ | 723304 | 20051225171019 | | 723284 | 20051225165527 | | 723260 | 20051225164334 | +---------+-------------------+
Fields
[edit]Each row represents a user without a registration date and the
user_id
: Row identifier. Same as user.user_iduser_registration
: The approximated registration date.
Reproduction
[edit]Determine the when users were first given real registration dates. This should ever change.
(env)halfak@internproxy:~$ mysql -h db42 -e "SELECT user_registration FROM enwiki.user WHERE user_id = (SELECT MAX(user_id)+1 FROM enwiki.user WHERE user_registration IS NULL);" +-------------------+ | user_registration | +-------------------+ | 20051225171019 | +-------------------+
Run this script using the registration date as a cutoff. It doesn't hurt to go a little bit after the above cutoff to make sure you start with sane data. When the script prints a ".", it found a user with a registration date. When it prints a "!", it found a user without one and approximated.
halfak@internproxy:~/Sandbox/wsor/scripts$ python user_approx_registration.py -s db42 20060101000000 > user_approx_registration.tsv Aug-17 01:33:39 INFO Connecting to db42:enwiki using /home/halfak/.my.cnf. .!..................!.......................!.........................!....!. ....!..................!..................................................... .......!...........................................................!......... ..........!..........................................!....................... ............................................................................. .............................................!.........!..................... ................................................................!............ ..........................!..!............................................... ...........................................................!................. ............................................................................. .......................................................................!..... ...!......................................................................... ...........!..!...........................................!.................. ............................!..................................
Use mysqlimport to load the output file into the database
mysqlimport --local -h db42 --skip-opt halfak user_approx_registration.tsv
Notes
[edit]This table should never need to be updated unless someone trips over a the cord that powers the machine that keeps track of user registration dates. If that day ever comes, just do what it says above.