Research talk:Active editor spike 2015/Work log/2015-03-15

From Meta, a Wikimedia project coordination wiki

Sunday, March 15, 2015[edit]

Today, I'm going to try to kick off a job to re-generate modeling monthly active editors for the new timespan.

First I want to check how far my old data goes.

> select month, sum(revisions) from editor_month GROUP BY month order by month DESC;
+--------+----------------+
| month  | sum(revisions) |
+--------+----------------+
| 202505 |              5 |
| 201406 |        8906963 |
| 201405 |       26546076 |
| 201404 |       21148315 |
| 201403 |       23712258 |
| 201402 |       21989246 |

So it looks like I want to trim off 201406 and re-generate that data.

> select count(*) from editor_month where month >= "201406";
+----------+
| count(*) |
+----------+
|   190671 |
+----------+
1 row in set (13.18 sec)

It looks like that will get quite a few rows.

> delete from editor_month where month >= "201406";
Query OK, 190671 rows affected (27.89 sec)

Done OK. Now to update my data gathering query so that I can add only the missing data since may, 2014.

SELECT
    wiki,
    month,
    user_id,
    user_name,
    user_registration,
    SUM(revisions * archived) AS archived,
    SUM(revisions) AS revisions
FROM (
    SELECT
        LEFT(rev_timestamp, 6) AS month,
        DATABASE() AS wiki,
        rev_user AS user_id,
        FALSE AS archived,
        COUNT(*) AS revisions
    FROM revision
    WHERE rev_timestamp >= "201406"
    GROUP BY LEFT(rev_timestamp, 6), rev_user

    UNION ALL

    SELECT
        LEFT(ar_timestamp, 6) AS month,
        DATABASE() AS wiki,
        ar_user AS user_id,
        TRUE AS archived,
        COUNT(*) AS revisions
    FROM archive
    WHERE ar_timestamp >= "201406"
    GROUP BY LEFT(ar_timestamp, 6), ar_user
) AS editor_months
INNER JOIN user USING (user_id)
GROUP BY wiki, month, user_id
ORDER BY wiki, month;

... And that's going to take a bit. --Halfak (WMF) (talk) 18:57, 15 March 2015 (UTC)[reply]