Research talk:VisualEditor's effect on newly registered editors/Work log/2015-04-10

From Meta, a Wikimedia project coordination wiki

Friday, April 10, 2015[edit]

OK. Looks like we're doing this experiment again. It's been a while, but we're doing roughly the same thing as last time, so we should be able to re-use some old work. The only thing that went wrong from an experimental point of view with the last study was that logging was broken so that we were not able to look at edit success rates.

So, I'm going to focus on logging this time. I had a meeting with User:DAndreescu to get a brain dump of what he knew about Schema:Edit based on the funnel analysis work that he did for the mw:Editing team. It turns out that there are a whole host of issues with the schema. But I'm primarily interested in being able to extract the following class of measurements:

  • edit time: Time taken to complete edit (save time - page load time)
  • edit completion rate: Proportion of edits completed

So, really, I need to reliably get an event that represents the start of an "edit session" (note that the term here is overloaded to refer to the process of performing an edit -- not the collection of edits discussed in R:edit session).

In a perfect world, I'd look for sessions in Schema:Edit that do not abort with action.abort.type == "nochange" (since leaving the editor without touching anything is a common action that isn't necessarily a failure).

However, it seems that there are potential show-stopper bugs.

user.class is sometimes wrong[edit]

I just checked the rates at which we see weirdness. It looks like 0.7% of edits saved by registered editors have saveSuccess events associated with user.class = "IP". 99.5% of revisions saved by logged in users do not have user.class = "IP".

    > select rev_user = 0, sum(`event_user.id` = "IP"), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" GROUP BY 1;
    +--------------+-----------------------------+----------+
    | rev_user = 0 | sum(`event_user.id` = "IP") | COUNT(*) |
    +--------------+-----------------------------+----------+
    |            0 |                         628 |    82642 |
    |            1 |                       27063 |    27198 |
    +--------------+-----------------------------+----------+
    2 rows in set, 1 warning (4 min 39.80 sec)

I also noticed that the user_id stored in the revision table sometimes doesn't match the user.id in the schema. This happens 1.5% of the time when the saved edit has rev_user != 0. It happens substantially less often when the user who saved an edit was logged out.

    > select rev_user = 0, sum(rev_user != `event_user.id`), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" GROUP BY 1;
    +--------------+----------------------------------+----------+
    | rev_user = 0 | sum(rev_user != `event_user.id`) | COUNT(*) |
    +--------------+----------------------------------+----------+
    |            0 |                             1244 |    82642 |
    |            1 |                              135 |    27198 |
    +--------------+----------------------------------+----------+
    2 rows in set (5 min 28.93 sec)

My conclusion is that these rates are small enough that they'll probably be a non-issue. --Halfak (WMF) (talk) 21:36, 10 April 2015 (UTC)[reply]