Research talk:VisualEditor's effect on newly registered editors/June 2013 study

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Removal of results related to EventLogging instrumentation[edit]

A set of anomalies related to event logging for this experiment were discovered and could not be reconciled, so a substantial portion of this analysis was removed from the "Editing ease" section. We are not able to confidently report figures related to:

  • Edit save time The time between when a newcomer was presented with the edit pane and when he/she clicked "save". I.e. the time between when a user begins and completes an edit.
  • Edit completion proportion The proportion of users that complete an edit after viewing the edit pane.
  • Edit funnel: The proportion of user who cross thresholds related to editing.

Note that this data has not been used in any other part of the analysis or in any of the public dashboards, so it should't affect the other results in any way. --EpochFail (talk) 21:36, 18 July 2013 (UTC)

Distinction between the two groups is understated[edit]

If 27.5% of editors were able to successfully make an edit with the wikitext editor and only 15.7% were able to make an edit with VE, that's a 43% difference. The formula is 100*((27.5-15.7)/27.5) , not 27.5-15.7.Kww (talk) 03:09, 20 July 2013 (UTC)

When we are talking about proportions (which have an upper bound), I prefer to reference the difference in proportion (which is what's measured by a x^2 test). I find it to be less confusing overall than the math that you've just demonstrated. I'm glad that you understood me well enough to do the math yourself though. --EpochFail (talk) 19:00, 23 July 2013 (UTC)

Re: Editing ease[edit]

Is it possible that we just have more people curious to see what is this "new Visual Editor thing" and after trying it, go away because they were not interested in changing the article (just to test the tool)?

Also, what if people makes some changes to an article and then after clicking on "Save page" they wanted to click on "Review your changes" and then were scared by the two walls of wikitext which were displayed in the diff (which being a source code diff, don't really look similar to what they were changing visually in the article)? Helder 16:52, 20 July 2013 (UTC)

Nice idea, but people didn't choose to be given the visual editor. If there were any people who made such experimental edits they are vastly outnumbered by the people who would have edited if they'd had the normal editor but didn't because of V/E. WereSpielChequers (talk) 01:42, 21 July 2013 (UTC)
Not sure anyone will read this, but I wonder if the lower success rate doesn't have something to do with the fact that to save the changes, the user has to click twice on a "Save changes" button versus once with the wikitext editor. This isn't actually very intuitive and some users may have thought that the changes were already saved after the first click. Maybe adding "..." after the first "Save changes" button would make it clearer. The RedBurn (talk) 23:01, 10 February 2017 (UTC)
Excellent comment, hope @Whatamidoing (WMF): is still involved in V/E. WereSpielChequers (talk) 16:38, 11 February 2017 (UTC)
The RedBurn, this idea was discussed a couple of years ago, but in-depth user testing did not identify that as a problem in practice.
On Helder's original comment above, about the wikitext diffs, it will be possible in a few months to review some diffs visually, with a toggle switch that flips back and forth between the two-column wikitext diff and the newer visual diff, which looks more like a word processing document with "Show changes" enabled. Whatamidoing (WMF) (talk) 22:16, 15 February 2017 (UTC)
On the other hand, the action=edit save button is often not even visible above the fold. People have to know it exists, scroll down to find it and be careful enough to identify it in the middle of random stuff. (Unless they're tech-savvy enough to use tab.) I often conduct informal usability tests and I regularly see people fail at this simple step.
As for VisualEditor, nowadays the biggest showstopper is likely to be the huge scary recurring splash screen which is shown when people try to edit (see phabricator:T136076 for a partial list of issues). This affects all users no matter what they use, though. Nemo 10:53, 16 February 2017 (UTC)

Slowness[edit]

One of the reasons why those of us who tested it thought it wasn't ready for this sort of deployment was that it is so much slower - several times slower in my experience. It won't be the only reason why the current slow buggy version of v/e is so much less popular with new editors than the existing software. But it is something to test for. Would it be possible to reanalyse your test groups by geography? Ideally we'd do it by connection speed, but I'm assuming you don't have that data and that geography would have to do as a substitute. One of the risks of v/e is that slowing down the software is much more noticable to those on slow connections than to people on fast connections, but the developers will mostly have commercial grade US Internet connections. My fear is that even if enough bugs were fixed that the average level of editing was to get back to previous levels, if we could analyse by edits from the global south v the parts of the world with fast internet connections we'd see that slowing down the editing experience undermines our efforts to globalise the movement. WereSpielChequers (talk) 01:42, 21 July 2013 (UTC)

The slowness of the VisualEditor is primarily due to CPU limitations, not bandwidth. Due to some logging anomalies, we're not entirely sure what logging data we can trust. We're sifting through that at the moment. If it looks like we can take a reliable measurements, I'll post more about load time. --EpochFail (talk) 15:16, 23 July 2013 (UTC)
Thanks for the feedback, in some respects that could be good news - I've got more confidence in the speed with which Moore's law and built in obsolescence will deal with CPU limitations than with bandwidth ones. More importantly my home setup is pretty fast cable internet but with computers that were nothing special four or more years ago and today are way behind the curve. So if I'm personally experiencing V/E as excruciatingly slow then if it was a bandwidth issue most people would be having a worse experience with v/e than me, but if it is more CPU limited then I suspect a lot of people, maybe a majority will have less of a problem than me. If we were a standard Internet organisation, most interested in the 20% who account for 80% of the spending power then it would be OK to implement something that excluded people with old computers. But our remit is global, we should be at least as concerned with the proverbial wikimedian in an internet cafe in a favela than we are with people who could afford to upgrade their CU every five years or so. WereSpielChequers (talk) 18:11, 23 July 2013 (UTC)
You're right. We're making updates to the logging setup so, even if we don't have data for the experimental period, we should be able to show VE's performance more recently. Stay tuned. --EpochFail (talk) 18:23, 23 July 2013 (UTC)
Does that mean you think V/E is fixable short term? I was assuming it would get pulled and maybe come back in some future guise like Flow as a new attempt at liquid threads. WereSpielChequers (talk) 22:06, 23 July 2013 (UTC)
I can't be sure, but I'll agree that it is my suspicion. We've recently deployed an update to the logging system, so I should be able to do better than speculation soon. --EpochFail (talk) 21:54, 26 July 2013 (UTC)
The slowness of the VisualEditor is neither due to CPU limitations nor to bandwidth, it's the slow javascript technique or simply bad programming. I don't know if you ever tried to edit large articles like en:World War II with VE, it's horrible!--Sinuhe20 (talk) 10:18, 28 July 2013 (UTC)
Ahem... JavaScript runs on the CPU. --EpochFail (talk) 14:49, 1 August 2013 (UTC)
Yes, but it's the programmers fault to waste the user CPU. ;) Or in other words: why is editing a Word-Document with 100 pages on a 10 year old PC much faster than using the VE on an actual high-end machine? --Sinuhe20 (talk) 19:59, 1 August 2013 (UTC)
Does the current version of MS Word even run on a ten-year-old PC? And how long does it take you to open Word on that machine (not counting opening any document: just to open Word and to get a blank page so you could start typing something)? Whatamidoing (WMF) (talk) 15:57, 5 August 2013 (UTC)
There are a couple of aspects to this slowness issue that will affect newbies particularly. Firstly the lack of section editing, I consider that we should simply disable the edit button on large articles, or have some dropdown button on it to make people choose a section. Editing a large article has to involve many times the I/O of editing a section, if the V/E mens that you are editing the whole article when you try to edit a section then bandwidth again becomes a big issue. But more importantly is that slowness gets newbies into what is almost certainly a bigger deterrent to new editors than slowness or markup, it gets them into more edit conflicts and ensures that as the losing editor in an edit conflict they are the one who is bitten and loses their edit. This is particularly an issue at newpage patrol, and the easy fix , resolving edit conflicts so that adding a category at the end or a template at either end is not deemed a conflict with a change in between, is unachievable because it involves a tweak to mediawiki code. Unfortunately the devs just aren't available to do something that would uncontentiously increase editing and make Wikipedia a less bitey place. WereSpielChequers (talk) 07:31, 13 August 2013 (UTC)

Blocking rates[edit]

If I read this correctly, the editors using the visual editor on average made 43% fewer edits, but somehow the same proportion of those editors got blocked. Which on the surface of it implies that the 43% loss in editing came entirely from goodfaith newbies and any vandals or spammers who were driven away by v/e's slowness and bugs were entirely offset by goodfaith newbies being blocked for edits that looked like vandalism but actually were just v/e bugs. If so, unless I've misread that it sounds like a pretty good proof of the theory that extra barriers to editing are a deterrent to goodfaith editors and a challenge to badfaith ones. After all the vandals expect us to be a little challenging for them to vandalise. The goodfaith editors are offering to help - if we make it difficult for them then they will go away.

The upside of this if true is that if we can get v/e right then the additional edits that would bring would be disproportionately goodfaith ones. But unless it goes back into opt in Beta test mode very soon then I suspect it will go the way of the image filter. Remember the objective of WYSIWYG software was to increase editing levels, if it is doing the opposite then it has to be considered a failure. WereSpielChequers (talk) 02:11, 21 July 2013 (UTC)

Raw data[edit]

Probably I am missing something obvious, but where can I download the raw data of this study? Thanks. --Cyclopia (talk) 16:45, 21 July 2013 (UTC)

That's a good question. All public data used in the experiment will be released with the final report. Since we're still adding to the report, it's difficult to say exactly when that will be. --EpochFail (talk) 18:27, 23 July 2013 (UTC)
Cyclopia, see [1] for our sample of newcomers used in this analysis. --EpochFail (talk) 21:56, 26 July 2013 (UTC)
Thanks. I notice live graphs have been added, and that data can be downloaded apparently. That's good. --Cyclopia (talk) 11:48, 30 July 2013 (UTC)

Major update to the results[edit]

Hey folks, I made two major updates tonight.

Removal of "autocreated" users
Using the logging table, I was able to identify which "newcomer" accounts were created by users who already had created an account on another language wiki. This did not have a substantial effect on any of the reported results.
Update to figure 12
It turns out that Figure 12 was based on some faulty data from the EventLogging system. I updated it based on data from the revision table in MediaWiki's database. The significant difference between the control and test conditions was lost, but the relationship between the measured quantities remains.

I'm still going through and making updates. Thanks for your patience. --EpochFail (talk) 02:35, 25 July 2013 (UTC)

You have to admit that suddenly discovering VisualEditor is better than it was thought to have been looks a little suspicious. Could you explain a bit more? How was the data faulty? How do you know it was faulty? What evidence is there that the reversion table is better? Adam Cuerden (talk) 08:09, 26 July 2013 (UTC)
Sadly, the logging data we had for the timespan of the split test had some serious issues, so we decided not to use it. For example, we recorded substantially more impressions of the edit pane than we did edit link clicks, which should be impossible (see [2] for a visualization of the discrepancy). After looking through the code and a history of the log events around the experiment, it wasn't clear exactly what caused the issue or how we might be able to work around it, so we opted to discard that datasource completely. However, I mistakenly left Figure 12, which was, until recently, based on log data untouched. I made this mistake because the statistics in Figure 12 did not need log data. When I noticed this error, I immediately updated the figure and posted here so that you'd know that I'd changed something.
As for the revision table, if the edit doesn't show up there, it doesn't exist in the wiki. This is a characteristic of the MediaWiki software. It uses the revision table to track all revisions to pages. When you view the history tab, MediaWiki queries the same revision table that I do in order to show you which edits were performed by whom and when. --EpochFail (talk) 21:27, 26 July 2013 (UTC)
That's fair enough, then. Thank you. And I hope that I didn't come off as too aggressive - I did think it was likely fine, but when major changes like this are made to a study, documenting what went wrong and why it should be better after the change is a good policy. =) Adam Cuerden (talk) 06:41, 27 July 2013 (UTC)
It may actually be possible to see more hits on the edit pane than clicks of the edit button. If a new users hits the "sandbox" link, for example, they'll leap straight to an edit window; ditto (probably) any redlinks. Or does event-logging count any load of a URL with &action=edit as an edit link click? Andrew Gray (talk) 16:33, 27 July 2013 (UTC)
Indeed. This was my original thought on the subject and why I originally included the graph. However, it seems that the logging errors were systematically biased against visual editor (even when the recorded events should have not been effected by the editor at all [e.g. edit link clicks]). Also, the logging does not correspond to the data from the revision table. This is killer, because when it comes to what happened in MediaWiki, the revision table in the database is truth. --EpochFail (talk) 14:47, 1 August 2013 (UTC)
Clicking on redlinks takes me to the old editor, as does clicking on my sandbox. I suppose you could hand-code the URL, but that's not at all likely for new users. Whatamidoing (WMF) (talk) 16:48, 5 August 2013 (UTC)

One more caveat[edit]

One key point to note, I think, is that the test took place during late June - so the new users were in a situation where all help pages, etc, didn't mention the existence of VisualEditor or any interface changes, because it was presumed to be voluntary-beta-only. Failure to make edits may in part be due to the confusing help environment for these users, rather than direct effects from the VE itself. Andrew Gray (talk) 12:21, 25 July 2013 (UTC)

Comparing distributions in Figures 4-6[edit]

Fascinating preliminary findings! In reading them I wondered whether the test/control distributions that you have smoothed & plotted in Figures 4-6 were different from each other? A handy, and reasonably straightforward non-parametric statistic for this sort of thing is the two-sample Kolmogorov-Smirnov test, which was designed for comparing arbitrary cumulative distribution functions. I think the results of a KS-test would probably complement (and echo) the findings you've already plotted in Figures 1-3, but it still couldn't hurt to see what you find! Let me know if you'd like any help running the test or thinking about how to report/interpret the results. Aaronshaw (talk) 16:29, 26 July 2013 (UTC)

Aaronshaw, time and resources for this analysis are limited and already far over budget (at least when it comes to my hours), so I'm not too excited about performing another analysis that we both expect will simply confirm our findings. However, I really appreciate your suggestion and intend to incorporate it in future work. If you'd like to run the test yourself, I'd be happy to supply you with a dataset that's appropriately structured. Just let me know. --EpochFail (talk) 14:44, 1 August 2013 (UTC)
Thanks EpochFail! Re: the time/budget constraints, that makes total sense. Since it's the kind of thing that would only add nuance to the findings, it's probably only worth pursuing with the current dataset if you plan to publish something from it. If we have time at Wikisym or Wikimania, maybe that would be a good place to talk more about this? I'm totally happy to run the test. 203.218.48.88 04:12, 3 August 2013 (UTC)
Hmm, thought I was logged in for that last one...apparently not. Aaronshaw (talk) 04:13, 3 August 2013 (UTC)
Totally. :) --EpochFail (talk) 10:15, 4 August 2013 (UTC)

Updates on VE data analysis[edit]

I posted an update on VE data analysis and created a top-level page linking to various reports and dashboards.--DarTar (talk) 22:47, 26 July 2013 (UTC)

Groups relevance[edit]

Hi,

When I read the article, it seems to me that an important point is missing in the analysis. The test group is composed of users with VE enabled, but the analysis seems to consider that it's the group using VE, which is quite different. I believe that the analysis should take into account that the test group is composed of several populations:

  • People using an unsupported browser, who should behave close to users in the control group because they only see wikitext editor (it's an approximation)
  • People using a supported browser and really editing with VE
  • People using a supported browser and not editing with VE, who should behave close to users in the control group (also an approximation)
  • People belonging to several groups (several browsers, mixed edits, ...)

The analysis seems to consider that the test group is only the second category. But if you take into account the diversity, you should end up with differences being more important between using wikitext and using VE.

Can someone explain if I misunderstood the analysis ? --NicoV (talk) 12:06, 27 July 2013 (UTC)

NicoV, limiting the analysis to those who opted (and were able) to use VE vs. wikitext would cause a systematic bias the analysis. It seems likely that users with unsupported browsers and knowledge of preferences (to opt-out) would behave differently from other newcomers regardless of the availability of VE. Unless there is some way for me to figure out which users in the control condition would have used VE if it were available, I can't limit the analysis in the way you suggest and preserve the validity of the experiment. --EpochFail (talk) 14:41, 1 August 2013 (UTC)
I understand, but the current study doesn't even limit to users that have really used VE. Given the current % for VE use on enwiki, it's quite possible that 2/3 of the test population simply used wikitext instead of using VE, which is quite a major factor and the analysis simply seems to overlook that. --NicoV (talk) 16:10, 5 August 2013 (UTC)

Observational study with haphazard non-random assignment to VE or Wikitext[edit]

The researchers should have generated a list of random assignments, which would have been sequentially assigned to new editors along with their user id.

Instead, the parity of the user id (either odd or even) determined the exposure to VE or Wikitext. While this association was haphazard and without obvious bias (imho), it was not under the control of the researchers and was not randomized.

You should consult with a competent statistician with experience in experiments. Kiefer.Wolfowitz (talk) 17:49, 21 June 2014 (UTC)

I retract my statements "observational" and "not under the control of the researchers", the better to focus on the lack of randomization and since the study does seem to have satisfied conventional definitions of controlled experiments. Kiefer.Wolfowitz (talk) 17:18, 27 June 2014 (UTC)
Reverted your edits. As you admit, there's no apparent bias to the bucketing strategy and therefor the statistical validity of the study. Your changes to the copy were less than helpful. If you would like to bring forward substantial concerns about the potential for bias in the described study, I welcome such a discussion. --Halfak (WMF) (talk) 19:15, 23 June 2014 (UTC)
@Halfak (WMF):
To clarify:
  1. There was no randomized assignment of users to treatments. The study relied on the haphazard arrival of users and their labeling to determine exposure.
  2. You assert that an observational study using haphazard non-random exposures have "statistical validity".
  3. You are putting words in my mouth and, I'd like to think, misreading what I wrote. That I cannot name a bias in the haphazard exposure does not mean that none exist. Guarding against unforeseen biases is one of the reasons competent experimenters randomize when possible.
Please revert your reversion.
Kiefer.Wolfowitz (talk) 22:05, 25 June 2014 (UTC)
Hey Kiefer.Wolfowitz. Assignment in this test was not haphazard; as the text suggests, a round-robin bucketing strategy was used. This provides for unbiased and even assignment to the experimental conditions. I leave it up to you to explain how the this assignment strategy might bias the results. At the very least, I would appreciate if you could cite something suggesting that round-robin assignment is problematic. --Halfak (WMF) (talk) 00:33, 26 June 2014 (UTC)
Halfak (WMF)
You did not use randomization. Would you please explain any sense that your "unbiased" have?
In randomized samples and experiments, the objective randomization specified in the protocol induces an objective probability distribution on the outcomes, of which one is observed in a particular study. With respect to this distribution, unbiased estimates of population parameters and of treatment effects have been studied.
In particular, my and your opinion of the subjective likelihood of a bias due to your failure to randomize is irrelevant. There is no objective basis for your "unbiased" claim. Textbooks on randomized experiments typically discuss hidden biases that became apparent upon autopsy of a failed experiment, in cases where the experiment was actually of interest to other researchers, rather than Potemkin science for write-only conferences. One example I recall was an animal experiment in which there was a shelf effect on litters (each kept on cages), which was only discovered after a failed experiment. I suppose the International Cancer Society's 3rd volume on long-term animal experiments has a discussion of the importance of randomization, even within a cage. Assigning treatments to mice based on the haphazard order an assistant pulls them out of cages is not recommended.
In observational studies, a probability model may also conjectured. A parametric model may be taken seriously if a model has been previously validated, although such examples are rare. If somebody pretends that the data be a random sample from a distribution (with a subjective probability model), then some unbiased estimates may also be available. (However, such subjective methods are weaker, as students are warned in basic courses.) So, are you pretending that the editors were a random sample from a distribution? Parametric?
You might look at David A. Freedman et alia's Statistics for warnings about non-randomized studies. Kempthorne and Hinkleann's experimental design book is thorough. Speaking with authority, John Tukey advised experiments "to randomize unless you are a damned fool".
Exposing editors to a predictor (wikitext or VE) based on their id being even or odd is trivial. There are non-trivial round-robin designs discussed e.g. in Ian Anderson's book on combinatorial designs.
Kiefer.Wolfowitz (talk) 11:53, 26 June 2014 (UTC)
Kiefer.Wolfowitz, this conversation is not getting us anywhere. I'd appreciate if you could give me full citations and phrase your questions in ways that are answerable. I don't wish to attack your intentions, but it looks to me that you are simply mixing in statistical terms. I'd like you to define "objective probability distribution" as it doesn't seem to be a thing. Also, what were you asking when you asked "Parametric?" I'd also like you to consider the definition of en:observational study before we continue. If, in the end, you are speaking sense and I'm missing something, I'd appreciate a description of a "non-trivial" round-robin "design" so that we could contrast it with the round-robin bucketing strategy used for this controlled experiment. --Halfak (WMF) (talk) 12:58, 26 June 2014 (UTC)
Halfak (WMF)
I wrote the following question, which I repeat:
You did not use randomization. Would you please explain any sense that your "unbiased" have?
I had trouble finding a discussion of your so-called "round-robin" design, which seemed to have been alternating exposures. You have not objected to my previous assertions that it was alternating based on the parity (odd/even); would you confirm that this was the method, please?
  • Jerzy Neyman essentially killed such designs in survey sampling with his paper on the so-called "representative method", since such purposive designs fail to generate a probability distribution and lack a theoretical basis for estimation (e.g. being unbiased or of low variance) or confidence intervals. In design of experiments, non-randomized designs have the same problems---lack of a theoretical basis for estimation and confidence intervals. In practice, the estimates from such systematic methods seem to have greater biases and often systematic biases, in comparison to properly randomized studies. These points are discussed in a first course of statistics.
  • An example where your odd--even alternating design induced a bias was discussed in David R. Cox's book on Experimental Designs (c. 1956---not his book with Nancy Reid, c. 1994), in the chapter on randomization, in the subsection on systematic non-randomized designs.
Regarding bibliographic details. For Cox's book like the previous books I mentioned (Kempthorne, Hinkelmann; Freedman), you may consult the appropriate articles on English Wikipedia (and if stumped try the search function). You can try to Google(books) Ian Anderson+combinatorial design+tournament for non-trivial round-robin designs; the book was an Oxford UP (Clarendon?) monograph.
Your tone is unsurprising coming from Sue Gardner's WMF, but your suggestion that I consult an article to which I've contributed is particularly WMF-ed up.
Kiefer.Wolfowitz (talk) 16:09, 27 June 2014 (UTC)

┌─────────────────────────────────┘
"explain any sense that your "unbiased" have?" sigh OK fine. It seems trivial to me, but since you won't tell me what's wrong with it, I'll start with the fundamental bits. Given that the user_id field auto-increments, the order of signup determines whether you'll end up with an odd or even user_id. The only way to determine which bucket you'll end up in is to observe the last registered user's id before registering -- and even then, registrations happen so quickly that you're likely to miss your opportunity while submitting your registration to the server. Here's a query that shows the rate of registration for the first 24 hours of the experiment.:

> select LEFT(user_registration, 10), count(*) from user where user_registration between 20130625070000 AND 20130626070000 group by 1;
+-----------------------------+----------+
| LEFT(user_registration, 10) | count(*) |
+-----------------------------+----------+
| 2013062507                  |      220 |
| 2013062508                  |      216 |
| 2013062509                  |      237 |
| 2013062510                  |      258 |
| 2013062511                  |      234 |
| 2013062512                  |      286 |
| 2013062513                  |      303 |
| 2013062514                  |      324 |
| 2013062515                  |      299 |
| 2013062516                  |      311 |
| 2013062517                  |      268 |
| 2013062518                  |      279 |
| 2013062519                  |      262 |
| 2013062520                  |      223 |
| 2013062521                  |      185 |
| 2013062522                  |      163 |
| 2013062523                  |      166 |
| 2013062600                  |      170 |
| 2013062601                  |      179 |
| 2013062602                  |      154 |
| 2013062603                  |      178 |
| 2013062604                  |      172 |
| 2013062605                  |      181 |
| 2013062606                  |      209 |
+-----------------------------+----------+
24 rows in set (7.05 sec)

This data suggests that somewhere between 3 and 6 accounts are registered every minute. This is cool because it also means that I'll have users placed into both buckets during all times of the day. E.g. during an hour when 208 accounts are registered, I can guarantee that 104 were put into control and 104 were put into test -- yet neither I nor the registering user can practically predict which bucket they will fall into at the time of registration.

Seriously Kiefer.Wolfowitz, your attack on the bucketing strategy used in this experiment is absurd and you've provided no reasoning to suggest that it introduced bias into the study. Further, your attacks on "Sue's WMF" suggest you have an agenda that I don't want to be involved in. If it is that agenda that has brought you here, I hope that you'll quit wasting my time and let me get back to trying to build theory about how we can make Wikipedia better. --Halfak (WMF) (talk) 17:00, 27 June 2014 (UTC)

Halfak (WMF)
Again, Cox discusses a case where deterministic parity alternation introduced a bias into a study, contrary to your "provided no reasoning". (On the same page, Cox notes that such a non-randomized alternation does not provide a basis to evaluate mean bias (or median bias, I'll note) or confidence interval, which is consistent with what I wrote earlier.)
Anybody familiar with the community's rejection of Visual Editor and complaints about WMF staff behavior, upon reading your comments, would understand my concerns about your tone.
I shall reply below, specifically about checking for balance among covariates.
Kiefer.Wolfowitz (talk) 19:59, 27 June 2014 (UTC)

Suggested heuristic checks on systematic biases[edit]

You should have randomized the study. One way to check rather than assume that associating editors with id numbers by parity was not terrible would be to look at covariates, e.g. browser type, etc. Do the distributions of covariates agree for the two exposure groups? Another method would be to partition the user-groups by id number modulo 4, creating two odd (1,3) and two even (2,4) groups. Do you get roughly the same results? Passing these heuristic checks might reduce worries that the design was terrible in practice as it was in theory. Kiefer.Wolfowitz (talk) 16:20, 27 June 2014 (UTC)

I'm really reticent to waste time on this, but I hope that it would put an end to this discussion. I don't have the browser that users used during the experiment handy, but I do happen to have the returnTo. Here's a count of registrations based on the first character in the returnTo title. Given that we know the returnTo is associated with a new user's likelihood of becoming a successful new editor and is determined before the user completes their registration and obtains an ID, I hope that you'll be satisfied.
> SELECT returnto, evens, odds, evens/odds FROM (SELECT LEFT(event_returnTo, 1) AS returnto, COUNT(*) AS evens FROM ServerSideAccountCreation_5487345 WHERE timestamp between "20130625070000" AND "20130626070000" AND event_userId % 2 = 0 GROUP BY 1) AS evens INNER JOIN (SELECT LEFT(event_returnTo, 1) AS returnto, COUNT(*) AS odds FROM ServerSideAccountCreation_5487345 WHERE timestamp between "20130625070000" AND "20130626070000" AND event_userId % 2 = 1 GROUP BY 1) AS odds USING(returnto);
+----------+-------+------+------------+
| returnto | evens | odds | evens/odds |
+----------+-------+------+------------+
| .        |     1 |    1 |     1.0000 |
| 1        |     6 |    6 |     1.0000 |
| 2        |    10 |   14 |     0.7143 |
| 3        |     1 |    3 |     0.3333 |
| 5        |     1 |    1 |     1.0000 |
| 9        |     1 |    1 |     1.0000 |
| ?        |   629 |  594 |     1.0589 |
| A        |   188 |  203 |     0.9261 |
| B        |   101 |  102 |     0.9902 |
| C        |   198 |  183 |     1.0820 |
| D        |   108 |  107 |     1.0093 |
| E        |   137 |  127 |     1.0787 |
| F        |   141 |  141 |     1.0000 |
| G        |    93 |   96 |     0.9688 |
| H        |   118 |  116 |     1.0172 |
| I        |    89 |   72 |     1.2361 |
| J        |    50 |   55 |     0.9091 |
| K        |    64 |   70 |     0.9143 |
| L        |   118 |  124 |     0.9516 |
| M        |   539 |  540 |     0.9981 |
| N        |    65 |   49 |     1.3265 |
| O        |    38 |   43 |     0.8837 |
| P        |   184 |  201 |     0.9154 |
| Q        |    12 |   10 |     1.2000 |
| R        |    82 |   85 |     0.9647 |
| S        |   457 |  413 |     1.1065 |
| T        |   168 |  176 |     0.9545 |
| U        |    93 |   92 |     1.0109 |
| V        |    51 |   44 |     1.1591 |
| W        |   635 |  643 |     0.9876 |
| X        |     3 |    5 |     0.6000 |
| Y        |    27 |   24 |     1.1250 |
| Z        |    17 |    6 |     2.8333 |
+----------+-------+------+------------+
33 rows in set (0.19 sec)
Note that the only major deviations are when there are few observations. I picked on the returnTo's starting with "S" because there were a lot of observations for the amount of deviation from 50/50 and performed a binomial test.
> binom.test(457, (457+413))

	Exact binomial test

data:  457 and (457 + 413)
number of successes = 457, number of trials = 870, p-value = 0.1448
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4914927 0.5589102
sample estimates:
probability of success 
             0.5252874
I rest my case. I hope you'll rest yours. --Halfak (WMF) (talk) 17:27, 27 June 2014 (UTC)
Halfak (WMF) (talk · contribs)
I am glad that you examined groups for balance with respect to one covariate. Was this the first time you examined the covariates? Have you examined others?
In the future, studies should check for covariate balance, as part of the study protocol....
As I mentioned above, you could also do a subpartition of the users by id (modulo 4) and examine the 2 control groups. Having multiple control groups is discussed by in the book on observational studies by Paul R. Rosenbaum and in the book on Quasiexperimental studies by Cooke and Campbell.
The study would have been strengthened by having been randomized, e.g., by preparing random assignments in advance, perhaps in blocks of say 30, which could have been popped as new editors appeared. This would have been trivial to implement.
Kiefer.Wolfowitz (talk) 20:12, 27 June 2014 (UTC)