Research talk:VisualEditor's effect on newly registered editors/Work log/2015-04-29

From Meta, a Wikimedia project coordination wiki

Wednesday, April 29, 2015[edit]

Hey folks. I did some work with User:DAndreescu last night to dig into one of the problems that he saw in the data. Specifically that a very large proportion of edits through VE come from anonymous editors. First we confirmed what he saw in Event Logging and then we checked it with change tags. Here's my SQL log.

> select event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp > "20150428" GROUP BY 1,2;
+--------------+------------------+----------+
| event_action | event_user.class | count(*) |
+--------------+------------------+----------+
| abort        | NULL             |    26272 |
| abort        | IP               |   270049 |
| init         | NULL             |    77003 |
| init         | IP               |   581612 |
| ready        | NULL             |   139378 |
| ready        | IP               |   323338 |
| saveAttempt  | NULL             |    64121 |
| saveAttempt  | IP               |    33596 |
| saveFailure  | NULL             |     3537 |
| saveFailure  | IP               |    12602 |
| saveIntent   | NULL             |     5316 |
| saveIntent   | IP               |     8372 |
| saveSuccess  | NULL             |    60661 |
| saveSuccess  | IP               |    20898 |
+--------------+------------------+----------+
14 rows in set (8.81 sec)

Things look reasonable when we group VE and Wikitext together.

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp > "20150428" GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | abort        | NULL             |     3574 |
| visualeditor | abort        | IP               |    31082 |
| visualeditor | init         | NULL             |     8720 |
| visualeditor | init         | IP               |    41415 |
| visualeditor | ready        | NULL             |     8649 |
| visualeditor | ready        | IP               |    39927 |
| visualeditor | saveAttempt  | NULL             |     5311 |
| visualeditor | saveAttempt  | IP               |     8493 |
| visualeditor | saveFailure  | NULL             |      687 |
| visualeditor | saveFailure  | IP               |     2681 |
| visualeditor | saveIntent   | NULL             |     5318 |
| visualeditor | saveIntent   | IP               |     8377 |
| visualeditor | saveSuccess  | NULL             |     4775 |
| visualeditor | saveSuccess  | IP               |     5759 |
| wikitext     | abort        | NULL             |    22727 |
| wikitext     | abort        | IP               |   239153 |
| wikitext     | init         | NULL             |    68347 |
| wikitext     | init         | IP               |   540630 |
| wikitext     | ready        | NULL             |   130835 |
| wikitext     | ready        | IP               |   283641 |
| wikitext     | saveAttempt  | NULL             |    58856 |
| wikitext     | saveAttempt  | IP               |    25127 |
| wikitext     | saveFailure  | NULL             |     2853 |
| wikitext     | saveFailure  | IP               |     9927 |
| wikitext     | saveSuccess  | NULL             |    55925 |
| wikitext     | saveSuccess  | IP               |    15157 |
+--------------+--------------+------------------+----------+
26 rows in set (9.61 sec)

But when we split Wikitext and VE, we see that, for VE, there are more IP-labeled edits than NULL (registered).

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp > "20150428" AND wiki = "enwiki" GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | abort        | NULL             |      339 |
| visualeditor | abort        | IP               |       34 |
| visualeditor | init         | NULL             |     1109 |
| visualeditor | init         | IP               |       45 |
| visualeditor | ready        | NULL             |     1095 |
| visualeditor | ready        | IP               |       44 |
| visualeditor | saveAttempt  | NULL             |      780 |
| visualeditor | saveAttempt  | IP               |        5 |
| visualeditor | saveFailure  | NULL             |       69 |
| visualeditor | saveFailure  | IP               |        5 |
| visualeditor | saveIntent   | NULL             |      786 |
| visualeditor | saveIntent   | IP               |        6 |
| visualeditor | saveSuccess  | NULL             |      729 |
| wikitext     | abort        | NULL             |     7991 |
| wikitext     | abort        | IP               |    15795 |
| wikitext     | init         | NULL             |    22660 |
| wikitext     | init         | IP               |    99031 |
| wikitext     | ready        | NULL             |    44353 |
| wikitext     | ready        | IP               |    27382 |
| wikitext     | saveAttempt  | NULL             |    20215 |
| wikitext     | saveAttempt  | IP               |     9397 |
| wikitext     | saveFailure  | NULL             |     1222 |
| wikitext     | saveFailure  | IP               |     3154 |
| wikitext     | saveSuccess  | NULL             |    18956 |
| wikitext     | saveSuccess  | IP               |     6225 |
+--------------+--------------+------------------+----------+
25 rows in set (10.02 sec)

Things look as expected when I limit my analysis to enwiki. See Research:Anonymous_editor_acquisition/Volume_and_impact. I also checked main namespace only to be sure.

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp > "20150428" AND wiki = "enwiki" AND `event_page.ns` = 0 GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | abort        | NULL             |      266 |
| visualeditor | abort        | IP               |       33 |
| visualeditor | init         | NULL             |      942 |
| visualeditor | init         | IP               |       44 |
| visualeditor | ready        | NULL             |      929 |
| visualeditor | ready        | IP               |       43 |
| visualeditor | saveAttempt  | NULL             |      672 |
| visualeditor | saveAttempt  | IP               |        5 |
| visualeditor | saveFailure  | NULL             |       39 |
| visualeditor | saveFailure  | IP               |        5 |
| visualeditor | saveIntent   | NULL             |      691 |
| visualeditor | saveIntent   | IP               |        6 |
| visualeditor | saveSuccess  | NULL             |      647 |
| wikitext     | abort        | NULL             |     5239 |
| wikitext     | abort        | IP               |    14475 |
| wikitext     | init         | NULL             |    14900 |
| wikitext     | init         | IP               |    88553 |
| wikitext     | ready        | NULL             |    29872 |
| wikitext     | ready        | IP               |    24940 |
| wikitext     | saveAttempt  | NULL             |    13678 |
| wikitext     | saveAttempt  | IP               |     8186 |
| wikitext     | saveFailure  | NULL             |      786 |
| wikitext     | saveFailure  | IP               |     2393 |
| wikitext     | saveSuccess  | NULL             |    12860 |
| wikitext     | saveSuccess  | IP               |     5777 |
+--------------+--------------+------------------+----------+
25 rows in set (4.79 sec)

Next, we check a big wiki with VE enabled.

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp > "20150428" AND wiki = "frwiki" AND `event_page.ns` = 0 GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | abort        | NULL             |      346 |
| visualeditor | abort        | IP               |     4195 |
| visualeditor | init         | NULL             |      967 |
| visualeditor | init         | IP               |     5870 |
| visualeditor | ready        | NULL             |      960 |
| visualeditor | ready        | IP               |     5648 |
| visualeditor | saveAttempt  | NULL             |      632 |
| visualeditor | saveAttempt  | IP               |     1459 |
| visualeditor | saveFailure  | NULL             |       63 |
| visualeditor | saveFailure  | IP               |      361 |
| visualeditor | saveIntent   | NULL             |      709 |
| visualeditor | saveIntent   | IP               |     1573 |
| visualeditor | saveSuccess  | NULL             |      595 |
| visualeditor | saveSuccess  | IP               |     1090 |
| wikitext     | abort        | NULL             |      809 |
| wikitext     | abort        | IP               |    13262 |
| wikitext     | init         | NULL             |     3056 |
| wikitext     | init         | IP               |    20227 |
| wikitext     | ready        | NULL             |     5800 |
| wikitext     | ready        | IP               |    15383 |
| wikitext     | saveAttempt  | NULL             |     3019 |
| wikitext     | saveAttempt  | IP               |      716 |
| wikitext     | saveFailure  | NULL             |       43 |
| wikitext     | saveFailure  | IP               |      159 |
| wikitext     | saveSuccess  | NULL             |     2973 |
| wikitext     | saveSuccess  | IP               |      556 |
+--------------+--------------+------------------+----------+
26 rows in set (1.00 sec)

OK. Here, we see a huge ratio. Of VE edits, they are mostly from IP editors. (Nearly 2:1) Let's check the change_tag table. First I need to know the exact name of the tag.

> select distinct ct_tag from change_tag;
+--------------------------------------------------------+
| ct_tag                                                 |
+--------------------------------------------------------+
| Ajout d'adresse mail                                   |
| Ajout de pronom personnel                              |
<... snip ...>
| visualeditor                                           |
| visualeditor-needcheck                                 |
| visualeditor-switched                                  |
| vocabulaire potentiellement non neutre                 |
| âge hors modèle                                        |
+--------------------------------------------------------+
87 rows in set (0.54 sec)

OK. I need to use "visualeditor".

mysql:research@analytics-store.eqiad.wmnet [frwiki]> select rev_user = 0 AS IP, count(*) from revision inner join change_tag on ct_rev_id = rev_id and ct_tag = "visualeditor" and rev_timestamp > "20150428" group by 1; 
+----+----------+
| IP | count(*) |
+----+----------+
|  0 |     1233 |
|  1 |     1549 |
+----+----------+
2 rows in set (2 min 4.13 sec)

I realized here that the count of edits between the revision table and the EL table did not match. It turns out that this is due to a big chunk of data loss, so I switched to the previous day.

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp BETWEEN "20150427" AND "20150428" AND wiki = "frwiki" GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | abort        | NULL             |      635 |
| visualeditor | abort        | IP               |     5271 |
| visualeditor | init         | NULL             |     1840 |
| visualeditor | init         | IP               |     7470 |
| visualeditor | ready        | NULL             |     1814 |
| visualeditor | ready        | IP               |     7222 |
| visualeditor | saveAttempt  | NULL             |     1233 |
| visualeditor | saveAttempt  | IP               |     1908 |
| visualeditor | saveFailure  | NULL             |      142 |
| visualeditor | saveFailure  | IP               |      443 |
| visualeditor | saveIntent   | NULL             |     1236 |
| visualeditor | saveIntent   | IP               |     1991 |
| visualeditor | saveSuccess  | NULL             |     1134 |
| visualeditor | saveSuccess  | IP               |     1460 |
| wikitext     | abort        | NULL             |     1659 |
| wikitext     | abort        | IP               |    18936 |
| wikitext     | init         | NULL             |     5607 |
| wikitext     | init         | IP               |    29728 |
| wikitext     | ready        | NULL             |    10301 |
| wikitext     | ready        | IP               |    21898 |
| wikitext     | saveAttempt  | NULL             |     5347 |
| wikitext     | saveAttempt  | IP               |     1112 |
| wikitext     | saveFailure  | NULL             |      101 |
| wikitext     | saveFailure  | IP               |      405 |
| wikitext     | saveSuccess  | NULL             |     5244 |
| wikitext     | saveSuccess  | IP               |      706 |
+--------------+--------------+------------------+----------+
26 rows in set (9.97 sec)

Here, we see 1460 anon edits and 1134 registered edits through VE.

> select rev_user = 0 AS IP, count(*) from revision inner join change_tag on ct_rev_id = rev_id and ct_tag = "visualeditor" and rev_timestamp BETWEEN "20150427" AND "20150428" group by 1; 
+----+----------+
| IP | count(*) |
+----+----------+
|  0 |     1282 |
|  1 |     1658 |
+----+----------+
2 rows in set (1 min 35.21 sec)

So, we're definitely losing events. If these queries are right, we're missing data for 148 (11.5%) registered edits and 198 (11.9%) anon edits from the log.

(This concludes my work from last night. It was late. Now I'm picking up again from today.) --Halfak (WMF) (talk) 14:37, 29 April 2015 (UTC)[reply]

OK. So I want to know if there was a substantial data loss during that day that could explain the 12% of missing events.

> SELECT left(timestamp, 10) AS hour, count(*) from Edit_11448630 WHERE timestamp BETWEEN "20150427" AND "20150428" GROUP BY 1;
+------------+----------+
| hour       | count(*) |
+------------+----------+
| 2015042700 |    64167 |
| 2015042701 |    62939 |
| 2015042702 |    60735 |
| 2015042703 |    59658 |
| 2015042704 |    57441 |
| 2015042705 |    62807 |
| 2015042706 |    78073 |
| 2015042707 |    90961 |
| 2015042708 |   102515 |
| 2015042709 |   102862 |
| 2015042710 |   106522 |
| 2015042711 |    96756 |
| 2015042712 |   107464 |
| 2015042713 |   108489 |
| 2015042714 |   122455 |
| 2015042715 |   101626 |
| 2015042716 |   120032 |
| 2015042717 |    73147 |
| 2015042718 |   105266 |
| 2015042719 |    86674 |
| 2015042720 |    80608 |
| 2015042721 |    77749 |
| 2015042722 |    70496 |
| 2015042723 |    70002 |
+------------+----------+
24 rows in set (1.98 sec)

I'm a little bit skeptical of the 17th hour. It seems incredibly low compared to the neighboring hours. Let's try the same day last week.

> SELECT left(timestamp, 10) AS hour, count(*) from Edit_11448630 WHERE timestamp BETWEEN "20150420" AND "20150421" GROUP BY 1;
+------------+----------+
| hour       | count(*) |
+------------+----------+
| 2015042000 |    60499 |
| 2015042001 |    59375 |
| 2015042002 |    58921 |
| 2015042003 |    55978 |
| 2015042004 |    55174 |
| 2015042005 |    61330 |
| 2015042006 |    77320 |
| 2015042007 |    92542 |
| 2015042008 |    99475 |
| 2015042009 |   103273 |
| 2015042010 |   102306 |
| 2015042011 |   109020 |
| 2015042012 |   119052 |
| 2015042013 |   129933 |
| 2015042014 |   129694 |
| 2015042015 |   124571 |
| 2015042016 |   119745 |
| 2015042017 |   116431 |
| 2015042018 |   115538 |
| 2015042019 |   106446 |
| 2015042020 |    99518 |
| 2015042021 |    86276 |
| 2015042022 |    69657 |
| 2015042023 |    60680 |
+------------+----------+
24 rows in set (1.93 sec)

That looks more reasonable. Let's use this day to compare EL to change_tags.

> select event_editor, event_action, `event_user.class`, count(*) from Edit_11448630 WHERE timestamp BETWEEN "20150420" AND "20150421" AND wiki = "frwiki" AND event_action = "saveSuccess" AND event_editor = "visualeditor" GROUP BY 1,2,3;
+--------------+--------------+------------------+----------+
| event_editor | event_action | event_user.class | count(*) |
+--------------+--------------+------------------+----------+
| visualeditor | saveSuccess  | NULL             |     1005 |
| visualeditor | saveSuccess  | IP               |     1662 |
+--------------+--------------+------------------+----------+
2 rows in set (23.62 sec)
> select rev_user = 0 AS IP, count(*) from revision inner join change_tag on ct_rev_id = rev_id and ct_tag = "visualeditor" and rev_timestamp BETWEEN "20150420" AND "20150421" group by 1;
+----+----------+
| IP | count(*) |
+----+----------+
|  0 |      970 |
|  1 |     1642 |
+----+----------+
2 rows in set (2 min 16.76 sec)

Here, it looks like we are missing 35/1005=3.5% and 20/1642=1.2% which seems much more reasonable. EL is a lossy logging system, so losses like this are something we must live with and account for. --Halfak (WMF) (talk) 14:57, 29 April 2015 (UTC)[reply]

Ahh! Wait a tick. I looked at that in reverse. It looks like EL reports *more* events than we see in frwiki. I bet this is related to deleted edits. If I look in the archive, I might see these missing events.

> select ar_user = 0 AS IP, count(*) from archive inner join change_tag on ct_rev_id = ar_rev_id and ct_tag = "visualeditor" and ar_timestamp BETWEEN "20150420" AND "20150421" group by 1;
+----+----------+
| IP | count(*) |
+----+----------+
|  0 |       34 |
|  1 |       45 |
+----+----------+
2 rows in set (45.53 sec)

That makes 'em basically match up. :) --Halfak (WMF) (talk) 15:03, 29 April 2015 (UTC)[reply]

jawiki discrepancy[edit]

I noticed a weird thing, so I wanted to document it quick. When I was looking at edit success rates in jawiki, I noticed that the proportion of anon vs registered sessions was way out of proportion. It looks like the registered editor sessions with a saveSuccess event are about twice as numerous than those from anons.

> SELECT event_editor, `event_user.class`, COUNT(*) AS sessions, SUM(readies > 0) AS ready_sessions, SUM(attempts > 0) AS attempt_sessions, SUM(successes > 0) AS success_sessions FROM (select event_editor, `event_user.class`, event_editingSessionId, SUM(event_action = "ready") as readies, SUM(event_action = "saveAttempt") as attempts, SUM(event_action = "saveSuccess") as successes from Edit_11448630 WHERE timestamp BETWEEN "20150420" AND "20150421" AND wiki = "jawiki" GROUP BY 1,2,3) AS foo GROUP BY 1,2;
+--------------+------------------+----------+----------------+------------------+------------------+
| event_editor | event_user.class | sessions | ready_sessions | attempt_sessions | success_sessions |
+--------------+------------------+----------+----------------+------------------+------------------+
| visualeditor | NULL             |       33 |             30 |                9 |                9 |
| wikitext     | NULL             |     3787 |           3340 |             2478 |             2473 |
| wikitext     | IP               |    43766 |          32877 |             1244 |             1212 |
+--------------+------------------+----------+----------------+------------------+------------------+

But when I look at the revision table directly, it looks like there's we should expect more than an order of magnitude more saveSuccess events for registered editors.

> SELECT rev_user = 0, count(*) from revision where rev_timestamp BETWEEN "20150420" AND "20150421" GROUP BY 1;
+--------------+----------+
| rev_user = 0 | count(*) |
+--------------+----------+
|            0 |    59434 |
|            1 |     2517 |
+--------------+----------+
2 rows in set (43.02 sec)

Could it be that some of the saveSuccess events aren't getting logged. I'll be digging into that later. --Halfak (WMF) (talk) 02:00, 30 April 2015 (UTC)[reply]