User:EpochFail/Journal/2011-10-11

From Meta, a Wikimedia project coordination wiki

Tuesday, Oct 11th[edit]

I'm gathering huggle data for the last experiment. First up is matching the huggle message events with message reading events.

CREATE TABLE halfak.huggle_posting_mk2
SELECT
	posted.user_id          AS user_id,
	posted.user_text        AS user_text,
	posted.TIMESTAMP        AS time_posted,
	MIN(consumed.TIMESTAMP) AS time_consumed
FROM halfak.huggle_posting_action_mk2 posted
LEFT JOIN halfak.huggle_posting_action_mk2 consumed
	ON posted.user_text = consumed.user_text
	AND posted.TIMESTAMP <= consumed.TIMESTAMP
	AND consumed.action = "read"
WHERE posted.action = "received"
GROUP BY posted.user_id, posted.user_text, posted.TIMESTAMP;

--mysql> select time_consumed IS NULL, count(*) from halfak.huggle_posting_mk2 group by 1;
--+-----------------------+----------+
--| time_consumed IS NULL | count(*) |
--+-----------------------+----------+
--|                     0 |     7842 |
--|                     1 |     6206 |
--+-----------------------+----------+
--2 rows in set (0.03 sec)

CREATE INDEX user_text_timestamp_idx ON halfak.huggle_posting_mk2 (user_text, time_posted);

DROP TABLE halfak.huggling_agg_mk2;
CREATE TABLE halfak.huggling_agg_mk2
SELECT
	ha2.rev_id,
	def,
	personal,
	nodirectives,
	exp_case != 'none' as experimental,
	reverted_id,
	ha2.rev_page,
	ha2.rev_user,
	ha2.rev_user_text,
	ha2.rev_timestamp,
	page_id,
	warned_user,
	IF(hp2.time_posted IS NULL, ha2.rev_timestamp, MIN(hp2.time_consumed)) AS message_consumed,
	COUNT(a.ar_rev_id) > 0 AS deleted_revs
FROM staeiou.huggling_agg_mk2 ha2
LEFT JOIN halfak.huggle_posting_mk2 hp2
	ON ha2.warned_user = hp2.user_text
	AND hp2.time_posted >= ha2.rev_timestamp
LEFT JOIN enwiki.archive a
	ON ha2.warned_user = a.ar_user_text
GROUP BY ha2.rev_id;


SELECT 
	experimental,
	deleted_revs,
	message_consumed IS NOT NULL as message_consumed,
	warned_user NOT RLIKE "[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}" as registered, 
	count(*) 
FROM halfak.huggling_agg_mk2 
GROUP BY 1, 2, 3
WITH ROLLUP;
--+--------------+--------------+------------------+------------+----------+
--| experimental | deleted_revs | message_consumed | registered | count(*) |
--+--------------+--------------+------------------+------------+----------+
--|            0 |            0 |                0 |          0 |     1488 |
--|            0 |            0 |                1 |          0 |     2060 |
--|            0 |            0 |             NULL |          0 |     3548 |
--|            0 |            1 |                0 |          0 |       60 |
--|            0 |            1 |                1 |          1 |      366 |
--|            0 |            1 |             NULL |          1 |      426 |
--|            0 |         NULL |             NULL |          1 |     3974 |
--|            1 |            0 |                0 |          0 |     1917 |
--|            1 |            0 |                1 |          1 |     1984 |
--|            1 |            0 |             NULL |          1 |     3901 |
--|            1 |            1 |                0 |          1 |      107 |
--|            1 |            1 |                1 |          0 |      504 |
--|            1 |            1 |             NULL |          0 |      611 |
--|            1 |         NULL |             NULL |          0 |     4512 |
--|         NULL |         NULL |             NULL |          0 |     8486 |
--+--------------+--------------+------------------+------------+----------+
--15 rows in set (0.05 sec)

Everything looks good. I spot checked a couple and they look to be correct. It also appears that registered users are more likely to read their messages which makes a lot of sense.

Now it is time to generate the HTML files for Jonathan.

20:28, 11 October 2011 (UTC)

I actually just re-worked the above to include a field for editors with deleted revisions. It looks like we have 504 instances of an editors who read their message, but have deleted revisions. That's a lot more than I thought, but I'm trudging forward. We'll deal with them later.  :\

20:56, 11 October 2011 (UTC)

SELECT
	warned_user NOT RLIKE "[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}" as registered, 
	count(*) 
FROM halfak.huggling_agg_mk2 
WHERE exp_case != "none"
AND NOT deleted_revs
AND message_consumed IS NOT NULL
GROUP BY 1
WITH ROLLUP;
--+------------+----------+
--| registered | count(*) |
--+------------+----------+
--|          0 |     1793 |
--|          1 |      191 |
--|       NULL |     1984 |
--+------------+----------+
--3 rows in set (0.06 sec)

It looks like we are going to have 1984 users to code with our volunteers (along with the 504 noted above).

21:02, 11 October 2011 (UTC)

halfak@internproxy:~/data/newbie_warnings/mk2$ mysql -hdb1047 -e "SELECT warned_user, message_consumed FROM huggling_agg_mk2 WHERE NOT deleted_revs AND experimental AND message_consumed IS NOT NULL" -N halfak > hugglings.message_consumed.no_deleted.tsv
halfak@internproxy:~/data/newbie_warnings/mk2$ wc hugglings.message_consumed.no_deleted.tsv 
 1984  3992 57118 hugglings.message_consumed.no_deleted.tsv

I better limit my after-coding to only those editors who performed work afterward.

DROP TABLE halfak.huggling_agg_revs_after;
CREATE TABLE halfak.huggling_agg_revs_after
SELECT 
	ha2.rev_id, 
	ha2.warned_user, 
	ha2.message_consumed, 
	COUNT(DISTINCT r.rev_id)+COUNT(DISTINCT a.ar_rev_id) AS revs_after
FROM halfak.huggling_agg_mk2 ha2
LEFT JOIN enwiki.revision r
	ON r.rev_user_text = warned_user
	AND r.rev_timestamp > ha2.message_consumed
LEFT JOIN enwiki.archive a
	ON a.ar_user_text = warned_user
	AND a.ar_timestamp > ha2.message_consumed
GROUP BY ha2.rev_id;
CREATE INDEX rev_idx ON halfak.huggling_agg_revs_after (rev_id);

ALTER TABLE halfak.huggling_agg_mk2 ADD COLUMN revs_after INT;
UPDATE halfak.huggling_agg_mk2 ha2, halfak.huggling_agg_revs_after hara
SET ha2.revs_after = hara.revs_after
WHERE ha2.rev_id = hara.rev_id;
halfak@internproxy:~/data/newbie_warnings/mk2$ mysql -hdb1047 -e "SELECT warned_user, message_consumed FROM huggling_agg_mk2 WHERE NOT deleted_revs AND experimental AND message_consumed IS NOT NULL AND revs_after > 0" -N halfak > hugglings.message_consumed.no_deleted.revs_after.tsv

OK... Well this is weird, but I can't seem to write a file to /a on internproxy. I'll move on to script writing in the meantime.

00:06, 12 October 2011 (UTC)

I wrote a script to take advantage of Stu's script for gathering contrib histories for users in HTML files. It is working and I've done a couple of test runs, but now I'm stuck on internproxy again. I have Asher working on the problem and he said it doesn't look good. Well... we've got our datasets and I have my code committed so hopefully we can catch up in the morning.

Singing off. --EpochFail 00:06, 12 October 2011 (UTC)