Research talk:Onboarding new Wikipedians/Rollout/Work log/2014-03-07

From Meta, a Wikimedia project coordination wiki

Friday, March 7th[edit]

The daily proportion of newly registered users who registered on a non-mobile device with a recorded Schema:GettingStartedRedirectImpression is plotted with a dashed vertical line representing the deployment datetime.
GettingStarted deployments. The daily proportion of newly registered users who registered on a non-mobile device with a recorded Schema:GettingStartedRedirectImpression is plotted with a dashed vertical line representing the deployment datetime.

Today, I'm working on trying to reason about the impact of GettingStarted across the wikis where it was deployed. In Enwiki, we ran a series of control experiments to test and refine GettingStarted there, but for the rest of the wikis, we just deployed our best version of the system.

So, in order to reason about the impact that GettingStarted had on these other wikis, I'd like to assume that the deployment represents a en:natural experiment -- and that I can reason about the effect of GettingStarted by comparing the state of the wikis before and after deployment. First things first, I want to check on the deployments to make sure that users were not seeing GettingStarted before they were supposed to and that users were seeing GettingStarted after they were supposed to.

Figure #GettingStarted deployments (to the right) plots the proportion of users eligible to receive a GettingStarted impression that also had one recorded. The vertical line represents the deployment date. It looks clear from this figure that the deployment assumptions hold. However, there are a couple of wikis that don't have enough daily registrations to plot a proportion: astwiki and glwiki. I'll have to figure out what to do with those. We may not be able to get enough observations from them to reason about GS's effectiveness there. --Halfak (WMF) (talk) 16:42, 7 March 2014 (UTC)[reply]


Time to pull a sample of users registered before and after deployments. My scripts for generating R:Productive new editor and other metrics run kind of slow because they need to detect reverts, so I don't want to process that many. It would be nice if I could process more than 5k, but less than 10k. Since all but enwiki and jawiki had deployments on 2014-02-11 18:13:00, I'll just check around that since I'm ballparking anyway.

SELECT COUNT(*) FROM (
SELECT wiki, event_userId
FROM ServerSideAccountCreation_5487345 
WHERE 
    timestamp BETWEEN 
        DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND 
        DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL 14 DAY), "%Y%m%d%H%i%S") AND
    event_isSelfMade AND 
    NOT event_displayMobile AND
    wiki IN (
        "dewiki", "elwiki", "eswiki", "fawiki", "frwiki", "fowiki", "glwiki", 
        "hewiki", "huwiki", "iswiki", "itwiki", "kowiki", "lbwiki", "mkwiki", 
        "mlwiki", "nlwiki", "plwiki", "ptwiki", "ruwiki", "simplewiki", 
        "svwiki", "viwiki", "ukwiki", "zhwiki"
    )
GROUP BY 1,2) AS sampled_users
+----------+
| COUNT(*) |
+----------+
|    92245 |
+----------+
1 row in set (8.42 sec)

Well... that's too many. It will probably make sense for me to down-sample. I could get some solid results if I look at 500 users from each wiki before and after the deployment. I have 24 (non-en) wikis. There are a few that I know I'm not going to get 500 newly registered non-mobile users from, but assuming I did, I'd need to be able to process 24000 users. That's going to take a little while, but it might be necessary. Let's see how many I'll actually get.

SELECT wiki, COUNT(*)  
FROM (
    SELECT wiki, event_userId
    FROM ServerSideAccountCreation_5487345 
    WHERE 
        timestamp BETWEEN 
            DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND 
            DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL 14 DAY), "%Y%m%d%H%i%S") AND
        event_isSelfMade AND 
        NOT event_displayMobile AND
        wiki IN (
            "dewiki", "elwiki", "eswiki", "fawiki", "frwiki", "fowiki", "glwiki", 
            "hewiki", "huwiki", "iswiki", "itwiki", "kowiki", "lbwiki", "mkwiki", 
            "mlwiki", "nlwiki", "plwiki", "ptwiki", "ruwiki", "simplewiki", 
            "svwiki", "viwiki", "ukwiki", "zhwiki"
        )
    GROUP BY 1,2
) AS sampled_users
GROUP BY wiki
ORDER BY COUNT(*) DESC;
+------------+----------+
| wiki       | COUNT(*) |
+------------+----------+
| eswiki     |    21871 |
| frwiki     |    11919 |
| zhwiki     |    10613 |
| ruwiki     |    10439 |
| dewiki     |     7599 |
| ptwiki     |     7333 |
| itwiki     |     4564 |
| fawiki     |     3392 |
| nlwiki     |     2307 |
| plwiki     |     2231 |
| viwiki     |     2159 |
| svwiki     |     1592 |
| ukwiki     |     1239 |
| kowiki     |     1196 |
| huwiki     |     1163 |
| hewiki     |     1126 |
| elwiki     |      627 |
| simplewiki |      471 |
| mlwiki     |      228 |
| mkwiki     |       72 |
| iswiki     |       69 |
| glwiki     |       25 |
| fowiki     |        5 |
| lbwiki     |        5 |
+------------+----------+
24 rows in set (8.15 sec)

So, 8 wikis don't have enough users. That leaves 16. 16000 should be OK. Time to build some SQL to sample from each of the wikis. --Halfak (WMF) (talk) 21:53, 7 March 2014 (UTC)[reply]


So, I'll be running the following query for each Wiki:

SET @first_deploy = "20140211181300";
SET @second_deploy = "20140227001800";
(
    SELECT DISTINCT wiki AS wiki, event_userId AS user_id, timestamp AS user_registration, "before deployment" AS bucket
    FROM ServerSideAccountCreation_5487345
    WHERE wiki = "dewiki" AND timestamp BETWEEN
            DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND 
            DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL -1 SECOND), "%Y%m%d%H%i%S")
    ORDER BY RAND()
    LIMIT 500
)
UNION
(
    SELECT DISTINCT wiki AS wiki, event_userId AS user_id, timestamp AS user_registration, "after deployment" AS bucket
    FROM ServerSideAccountCreation_5487345
    WHERE wiki = "dewiki" AND timestamp BETWEEN @first_deploy AND
            DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL 14 DAY), "%Y%m%d%H%i%S")
    ORDER BY RAND()
    LIMIT 500
)

--Halfak (WMF) (talk) 22:45, 7 March 2014 (UTC)[reply]


Bah! The above queries don't filter out non-self-created users and mobile users. Just fixed that and started it again.

$ wc datasets/sample_users.tsv 
 16001  80004 750038 datasets/sample_users.tsv

Alright! Sample complete. I think that's it for today. Stay tuned for building statistics about these users and looking for trends that might invalidate the experiment. --Halfak (WMF) (talk) 23:07, 7 March 2014 (UTC)[reply]