Research talk:VisualEditor's effect on newly registered editors

"Wikipedia's current quality control mechanisms (ClueBot, Huggle, STiki, XLinkBot, etc.)"[edit]

Rectius: English Wikipedia's etc. (the others don't have those bots, Huggle is used on few wikis and STiki even less). Does "Wikipedia" mean "English Wikipedia" across the whole page, or just for H2.3? It's always useful to use correct terminology so that readers are less confused, by the way. --Nemo 07:21, 2 July 2013 (UTC)[reply]

I believe that this particular study is only happening at the English Wikipedia. Whatamidoing (WMF) (talk) 08:29, 2 July 2013 (UTC)[reply]

Bucketing[edit]

Doesn't pushing new editors off the wikitext default mean that all possible long-term data is now moot, due to corruption of the control group? (One would expect a high user loss just from the change in editor).

Also, it's now been two weeks since the test ended. Can we get preliminary results? I don't imagine they're very meaningful, given the VisualEditor was (and is still) extremely buggy, but if they don't show a benefit for VisualEditor, the continued rollout isn't a very good idea without further development and testing. Adam Cuerden (talk) 01:51, 13 July 2013 (UTC)[reply]

Yes it does corrupt the control group, so we won't be measuring behavior after that point. A results write-up will be coming today. Stay tuned. --EpochFail (talk) 14:05, 16 July 2013 (UTC)[reply]

Block rate[edit]

It is unclear to me why there is an expectation that block rate of new editors will increase with VE. The only reason I can see for that assumption is that VE edits might be more closely scrutinized during the period of A/B testing than non-VE edits; however, that would be an artifact. Why do you expect an increase in block rate? Risker (talk) 03:19, 13 July 2013 (UTC)[reply]

Not an official answer, as I am not involved in this research, but one could imagine that if VE is perceived as making vandalism easier then there might be more vandals. That's not necessarily an accurate theory, but it is a hypothesis one might consider. Personally I would guess that VE will have a roughly proportional impact on people wanting to make good edits and people wanting to make bad edits, so that the fraction of edits that are vandalism won't change much (and hence the block rate wouldn't change much). But it would be worth testing. Dragons flight (talk) 20:02, 13 July 2013 (UTC)[reply]

Oh! That's almost funny. VE doesn't make vandalism any easier or more difficult, and I have no idea why anyone would think that. I'd suggest that a more reasonable metric would be the percentage of blocks where vandalism is the reason, since vandalism isn't responsible for most of the blocks on enwiki nowadays, and hasn't been for years. (Spam accounts are way ahead, if I remember correctly.) Risker (talk) 23:57, 13 July 2013 (UTC)[reply]

Even though you believe it's a silly concern, it's a real one for some editors. Several people have asserted that VE was a bad idea specifically because making editing easier would encourage more inappropriate editing. The general line of reasoning is that if editing is somewhat difficult, then only smart adults will do it. The unspoken and invalid assumption is that smart adults are never spammers, POV pushers, self-promoters, etc. You can see one example of this at en:Wikipedia talk:VisualEditor/Archive 2#Attracting the wrong type of editors. WhatamIdoing (talk) 16:42, 14 July 2013 (UTC)[reply]

The problem is in the inherent bias of the hypothesis, which is that the block rate will increase for VE users. The hypothesis should be that the block rate will not be affected by VE (i.e., that there will be no change). And what should be measured is whether there is a change in the block rate specific to content modification (i.e., vandalism), rather than an overall change in the block rate. The fact that VE editors are under the microscope right now means that there should be a higher rate of other kinds of blocks too. There's also an inherent bias in the study itself: we know VE edits are being reviewd at almost 100%; we also know that is not the case for non-VE edits. This will by itself affect the block rate. Risker (talk) 17:16, 14 July 2013 (UTC)[reply]

A hypothesis is supposed to have a bias. That's how science works. You form a theory, test it, and prove or disprove it. Whether the theory is that VE will or will not affect the block rate, either is a hypothesis which can be answered with objective data. You're right about there being a problem with the test in that VE edits are tagged publicly and subject to more scrutiny. But if it's known we can note it in any interpretation of the results, since the data doesn't really tell us why an effect happens, just whether it does. Steven Walling (WMF) • talk 19:06, 14 July 2013 (UTC)[reply]

I think that Risker is correct, though, in stating that the usual method is to formally hypothesize that there will be no change, and then to try to disprove that statement. Formally speaking, one doesn't prove a hypothesis to be correct; one proves its opposite to be wrong.

As for vandalism-specific blocks, that might be interesting, but the overall rate is probably more relevant, because the overall rate is what affects the community the most. If VE produced 1 extra vandal, but removed 3 spammers, it would be a net improvement, even though the vandalism-specific blocks had increased. WhatamIdoing (talk) 20:23, 14 July 2013 (UTC)[reply]

Perhaps before making that statement, some basic research on pre-VE blocking patterns might be useful. A very significant percentage of blocks include: accounts that are blocked because they're identified through cross-wiki approaches, and may never have edited on enwiki; blocked proxies; schoolblocks and IP range blocks (which can't really be reflected well here, as it will not be obvious how many editors are affected); username blocks; and general behavioural blocks. There is no point in measuring only registered accounts; after all, the original thesis is that it will be easier to vandalize, and at all times in the project's history the majority of vandalism was done by IPs. So when IP editing is given the go-ahead (I'm hoping it hasn't yet, since there are so many usability bugs at this point), only then is it reasonable to look at block rate. Risker (talk) 22:41, 14 July 2013 (UTC)[reply]

Hey guys. I'm sorry for the late response. I've been traveling recently. A hypothesis of no change is called a en:null hypothesis and it is not an interesting hypothesis in science. An good hypothesis is one that makes a testable assertion that's built on some theory about how the world works. For example, "The sun is made of Hydrogen and Helium" is a classical example of a good hypothesis statement. Note the non-neutrality.

The only thing I am looking at are username blocks of editors who has saved at least a single edit in English Wikipedia. When we are discussing the burden that new editors represent on Wikipedians, I don't suspect that any block is free whether it is for vandalism or spam (arguably a class of vandalism). Certainly this sounds like something that would be good to measure. It doesn't appear that anyone in this conversation disagrees with that, so I'll keep moving forward. My dataset will be published after I've completed the analysis. You should feel free to pick it apart.

Now finally, the discussion of measuring VE's effect on anons. I totally agree. I've been overruled. It's sad. Here we are. --EpochFail (talk) 14:04, 16 July 2013 (UTC)[reply]

"edit time: Time taken to complete edit (save time - page load time)"[edit]

VE had a very long load time (it's been reduced now, but is still relatively long), so this could be a potential bias issue, as it's not quite certain if the delay will decrease time (by giving more time to think), increase time (user doesn't notice page has finally loaded for some time, due VE's loading bar being fairly faint), or do nothing, on average. Adam Cuerden (talk) 09:26, 16 July 2013 (UTC)[reply]

Of course, the biggest source of bias is one that works against VE: the VE at the time was very, very buggy and flawed. Adam Cuerden (talk) 11:54, 16 July 2013 (UTC)[reply]

I'm sorry to leave you hanging. It looks like we might be seeing the effects of buggy/slow behavior in VE in the results. However, this wouldn't represent a statistical bias. The metric you describe is actually intended to measure *how* slow VE actually is. --EpochFail (talk) 13:51, 16 July 2013 (UTC)[reply]

"page load time" is subtracted - or did this not include the additional time for VE to load? Adam Cuerden (talk) 19:12, 16 July 2013 (UTC)[reply]

I'm sorry, but I don't understand your question. --EpochFail (talk) 19:16, 16 July 2013 (UTC)[reply]

Once the page loads, VisualEditor starts processing the page, which takes about 5 seconds for me at present. That could be considered part of the page loading, but is VE-only overhead. Adam Cuerden (talk) 10:45, 17 July 2013 (UTC)[reply]

I don't understand what you're talking about. You seem to have divided it up like this (with arbitrary numbers):

Load page (to read it): three seconds
Click the 'Edit' button: one second
Load VisualEditor: eight seconds
'Process' the page: five seconds
Start typing.

I don't think that anyone else is splitting up items 4 and 5. As I understand it, everything between the point at which the person clicks on the edit button and the point at which the person is able to start typing is "page load time". Whatamidoing (WMF) (talk) 17:05, 5 August 2013 (UTC)[reply]

Results write-up taking longer than expected -- Preview available at Research:VisualEditor's_effect_on_newly_registered_editors/Results[edit]

Sadly, I wasn't able to complete the whole write-up today. Happily, it's because I added a substantial amount of new analysis. Anyone who is impatient can follow the development of my draft here: Research:VisualEditor's_effect_on_newly_registered_editors/Results. --EpochFail (talk) 00:32, 17 July 2013 (UTC)[reply]

Updates on VE data analysis[edit]

I posted an update on VE data analysis and created a top-level page linking to various reports and dashboards.--DarTar (talk) 22:47, 26 July 2013 (UTC)[reply]

New study[edit]

User:Halfak (WMF): is the rationale behind H2.1 the same which is behind H2.2? Also, if the study were about a Wikipedia which is not enwiki, I'm assuming we'd know if conditions for H2.3 actually exist? Thanks. --Elitre (WMF) (talk) 20:23, 16 April 2015 (UTC)[reply]

I'm also adding a direct link to a thread about success criteria on mediawiki.org. --Elitre (WMF) (talk) 20:45, 16 April 2015 (UTC)[reply]

These are good questions. Since I originally wrote that hypothesis, I've taken to including rationales with them. E.g. Research:Asking_anonymous_editors_to_register/Study_1#Hypotheses. Re. H2.1 & H2.2, the rationale is different. For block rates, the hypothesis is that VE makes it easier to do vandalism (which is hard to justify IMO), but still a fine hypothesis to check. For revert rates, a rise could be due to vandalism or good-faith mistakes. For H2.3, it depends on which wikis we run the experiment on. I'd like to have someone local for non-enwiki wikis to help me understand how quality control work is done in that wiki, but I can learn a lot from patterns in the data about what tools are used (so long as they leave me some hints like structured comments). --Halfak (WMF) (talk) 21:52, 16 April 2015 (UTC)[reply]

Work log[edit]

Hey folks. I just wanted to post to point out that I have added a work log to this research project. You can access my entries (or even add your own!) by interacting with the "work log" widget in the upper-right of this page. I use work logs like this as a sort of lab journal. I'll do my best to record my analysis work there. --Halfak (WMF) (talk) 16:22, 17 April 2015 (UTC)[reply]

Halfak (WMF), it was interesting reading the work log. Could you try looking at the effect of usage of Visual Editor? I realize the complication that usage of Visual Editor would be X% of edits for each individual, but I suspect individuals will probably be bi-modal high% or low%. Does usage of VE vs Wikitext correlate with retention and total productive edit activity? (I'd especially like to see this after at least a month.) There is little value in a handful of inexperienced pot-luck edits from a new account that quickly quits. Someone who learns our policies and learns editing skills and sticks around for substantial long term editing is major asset. Alsee (talk) 17:11, 5 June 2015 (UTC)[reply]

Hi Alsee. There's two ways I can read your request.

Let's compare newcomers who use VisualEditor with those who do not.: Regretfully, this requires some messy statistics around en:propensity score matching to do well. That's means it would be a substantial time investment and it's not clear whether we'd be able to believe the results (requires a solid propensity prediction). I certainly would support someone else digging into this, but I think the answer would be more academic (and political) than practical. What I think we really want to know is "What would happen if we enabled VisualEditor for newly registered users?"
Let's look at the types of newcomers who choose to use visualeditor.: That's a good idea. I think that we can learn a lot by looking at what type of newcomers choose to use VE. Since software changes like these tend to have small effects (2-5%) on robust measures, we shouldn't see a lot of population-level shifts. If that's true, we can examine what cross-sections of the population of newcomers are likely to choose to use VE/not to use WikiText when given the option. I think this might be an interesting question to answer for the design of VE and to address the priorities given to the improvement of the Wikitext editor.

There's another bit that you've brought up before about looking at long-term survival. I'd be happy to look at that -- in a couple of months. We'll see better measurements the longer we wait, but we'll get good signal in short time periods too. In the meantime, I'm curious if you have a hypothesis for how enabling VisualEditor could affect short and long-term survival measures differently. We might be able to look for some short-term effects of the hypothesized underlying cause. :) --Halfak (WMF) (talk)

Questions[edit]

Aaron, What does "ve.k" mean? Used VisualEditor? Opted in? Opted in and VisualEditor works in that browser? Opened VisualEditor?

Also, "It looks like 34% of edits sessions were VE": is that 34% of all editing sessions or 34% of the sessions in that bucket?

If I open the wikitext editor and play around for a bit, and then immediately open VisualEditor and play around for a bit, is that one editing session or two? Whatamidoing (WMF) (talk) 18:23, 15 June 2015 (UTC)[reply]

Hi Whatamidoing (WMF)!

"ve.k" represents the number of edit sessions where editor = "visualeditor". I've used a more explicit label in the write-up so that should be more clear.

The 34% number represents the proportion of edit sessions where editor = "visualeditor" within the experimental bucket. It turns out that there are some very high activity users who chose to use "wikitext". If we look at the proportion of editors in the experimental bucket who primarily used VE, it's more like 57.5%. So, less than half of newcomers in the experimental condition couldn't use VE or chose not to.

If you open an editor, make a change and don't switch immediately to the other editor, I'll count that as an aborted session. However, if you open the wikitext editor just to copy-paste from it into another editor tab (a common pattern in Wikitext), I'll filter that session out with Schema:Edit's action.abort.type = "nochange". --Halfak (WMF) (talk) 22:31, 15 June 2015 (UTC)[reply]

Next question: "I limited the sampled sessions to maximum of 5 per user": First five, last five, random? Whatamidoing (WMF) (talk) 00:40, 16 June 2015 (UTC)[reply]

First five. Updated in the text. --Halfak (WMF) (talk) 16:04, 16 June 2015 (UTC)[reply]

Would it be possible to get a large handful of examples of the diffs that were being studied in the "Time to completion" section? I understand that looking at single instances risks jumping to extrapolated-conclusions, but I'm curious as to whether there are any noticable patterns in the content of the edits themselves, such as more usage of templates, or text formatting, by the editors using VE. Thanks. Quiddity (WMF) (talk) 15:59, 16 June 2015 (UTC)[reply]

Yes! That's a great idea. I can load a sample of diffs into Wiki labels (see the enwiki project page). That way we can satisfy people's curiosity and potentially learn some cool things about the types of edits. I can load a custom form into Wiki labels along with the same so that we can ask people to answer questions about the diffs they see and aggregate the results. What questions do you think would be interesting to ask people while they review the diffs? --Halfak (WMF) (talk) 16:11, 16 June 2015 (UTC)[reply]

Can we ask things like, "Did the edit add a citation?" Or a template that is not part of a citation? Or would it make more sense to do that through analyzing the text automatically?

How about a new sentence or new paragraph? Whatamidoing (WMF) (talk) 17:21, 16 June 2015 (UTC)[reply]

Woops. Looks like I missed this question. Sorry to leave you hanging, Whatamidoing (WMF). I think that a lot of that we can do with text analysis. But we can learn other things about the edits too by looking at them -- some things that we hadn't thought about until we see them. I just started a page for organizing the Wiki labels campaign. See en:Wikipedia:Labels/VE experiment edits. I figure that we can use the interface to review edits and share hypotheses. If we come up with something that would lend itself to a quantitative analysis, then we can dig into that too. This will help me prioritize follow-ups so that we don't go off measuring ALL THE THINGS and Halfak doesn't get to do any other science again.

Right now, we need to figure out what kinds of questions we might like to ask people while they review edits that we can't answer quantitatively (at least easily). These are usually subjective (e.g. "was this edit any good?"). I started a short list on the campaign talk page. --Halfak (WMF) (talk) 22:01, 26 June 2015 (UTC)[reply]

Scientist on vacation until June 24th[edit]

Mosquito. The primary predator of the hapless camper.

Hey folks, I want to let you know that I have some vacation coming up right after I complete the May 2015 study report. I expect there will be questions and ideas for follow-up analyses. Please don't hold back. :) But know that I won't respond until I get back on Wednesday, June 24th. In the meantime, I'll ask Dario, Whatamidoing and Elitre to address what comments they can.

In case you're curious, I'll be traveling deep into the Boundary Waters Canoe Area Wilderness, portaging some fur trading routes used by the Voyageurs. --Halfak (WMF) (talk) 13:38, 16 June 2015 (UTC)[reply]

Sunrise's questions[edit]

Hey folks, Sunrise asked some questions at the English Wikipedia Village Pump that are probably best addressed here. So, I'll copy-paste them here and answer them as best I can. Here comes the paste.

How many editors in the experimental group disabled VE? There appears to be data on editors in the control group that enabled it, but not vice versa. (And if there's a difference, is there any asymmetry, like editors in the control having a link encouraging them to try VE but not vice versa?)

When I make comparisons, I compare entire groups regardless of what type of editor was used. For stats on the proportion of new editors who left VE enabled or disabled it, see the table in the Newcomer productivity and survival subsection of the results. TL;DR: 0.4% of experimental users disabled VE and 1.1% of control users enabled VE. This observation was taken as a snapshot one week after the experimental bucketing period.

What are the absolute numbers for numbers of reverted edits (the metric for which VE performed better)? I can't see any information here, only the information for number of editors blocked. Also, why was this comparison only analyzed with Wilcoxon given that most of the others used both Wilcoxon and chi-squared?

"What are the absolute number of reverted edits" -- do you want a count of reverted edits per editor? That's the statistic I tested and requires raw data. Look at the experimental_user_metrics.tsv here.

The en:Wilcoxon signed-rank test is a test for paired data - how and when was the pairing done, and shouldn't the control and experimental groups have the same number of editors if the data were paired? (I haven't used this test much, so apologies if I'm missing anything basic.)

While you can do a pairwise wilcoxon rank test, you don't need to do it pairwise. I did not use the pairwise wilcoxon rank since no pairing could cleanly be done.

What correction was used for multiple hypothesis testing?

None. Given the massive number of observations and the the extreme levels of significance, I don't suspect it would make any difference for the small number of statistical tests we performed given the very low p-values we saw.

Where can we find the raw data, e.g. in CSV format?

Thanks, Sunrise (talk) 02:01, 21 June 2015 (UTC)

[1]. And thank you for your questions. Usually I go about this work without much fanfare. It's rare when science get's to be on the front line of a socio-political issue like VE in Wikipedia. Thanks for taking the time to read through the study and asking questions. --Halfak (WMF) (talk) 14:11, 24 June 2015 (UTC)[reply]

Ooops. I just saw that there was another one.

One thing that concerns me is that according to the work logs for revert rate here, Halfak seems quite doubtful that an effect exists, but on the summary page and in the above discussion, it seems to be presented as a solid result.

That's a good point and I'm sorry the change was not clear. This is an example of where, as I was writing up the report, I started to convince myself that the effect was real. As I was writing up that section of the report, I realized that I had measured the number of reverted edits per newly registered user -- which includes all users who sign up for an account but never save a single edit. This means the dataset was full of a bunch of zeros and that could have a large, normalizing (read p-value lowering) effect on the test. In the write-up, I limited the observations to new editors -- which only includes editors who edit something -- and the test showed strong significance. I noted this more appropriate denominator in the write-up too.

From Research:VisualEditor's_effect_on_newly_registered_editors/May_2015_study#Burden_on_current_Wikipedians:

To examine revert rates, we compared the raw counts of reverted edits per new editor using a wilcoxon test ...

There also seems to have been a translation from "no difference found," which is what this type of analysis tells us, to "definitely no difference" which is a very different statement that a p-value analysis can't tell us about.

Fair critique. However, I must write so that my audience understands what I think we have learned from the data. When I say, "No difference", I mean "No significance difference was detected and we had the statistical power to detect a substantial difference if it did exist". I believe that you will only find the "no difference" wording in summaries. If not, I'm happy to make some fixes if you could direct me to more specific wording issues. --Halfak (WMF) (talk) 14:25, 24 June 2015 (UTC)[reply]

Thanks for your answers Halfak - I hope you had a good vacation. Following up in the same order:

Sounds good. I’m interested to know - is that difference significant? (again, assuming the situation is symmetrical).
Yes, that’s what I was looking for. It seemed surprising to me that the specific counts weren’t reported, since they were for many (most?) of the other statistics that were tested. Could I recommend that you add it in? A table like the one for blocked editors would be ideal. Also, could you please answer the second part of my question? :-)
Thanks! I seem to have been misled when I used the enWP page for a quick refresher on the test.
Sorry, when you say “extreme levels of significance” are you referring to the result of p = 0.007 for number of reversions? (My calibrations for “extreme” may differ from yours, as I’ve had some experience with GWAS papers where p<10^-6 isn’t unreasonable.)
Thanks – where can I find the column definitions? Also, welcome to the world of fanfare. :-)
The selection of only new editors seems reasonable. That said, I have to say that “I started to convince myself that the effect was real” sends up warning lights for me (the easiest person to fool is yourself, and all that). Some of my concern would go away if e.g. there’s a plausible a priori reason to think that VE might affect this metric and not the others.
I have no objection to the abbreviation “no difference” as you’ve described it – I was primarily thinking of the comment which opened the enWP RfC, which says e.g. “New editors who use VisualEditor create no additional burden on existing community members. They are no more likely to be reverted or blocked…” I could abbreviate this in a similar way, but to me it feels much more definitive than I would have written based on these results.

Thanks, Sunrise (talk) 06:28, 25 June 2015 (UTC)[reply]

We're getting a big list here. I like the use of numbered lists, so I'll keep it up.

1. A two tailed X^2 test suggests that the difference in the proportion of users who change their preference significant. (X^2 = 37.5461, p-value = 8.928e-10)

2. I don't think that a table would work for reporting the statistics around reverted edits. Oh! And re. using a X^2 on revert, I would need to set a threshold and it doesn't seem like a threshold makes sense. This isn't the only place where I choose to use just one test. I didn't think that using a count of the # of blocks made sense either so I only used a threshold proportion there (>= 1). I could run a test of the proportion of editors who had at least one reverted edit. X^2 = 4.474, p-value = 0.03441

bucket	reverted.k	reverted.p	new editors
control	1067	0.3172762	3363
experimental	993	0.2932664	3386

4. Yeah, it might not be fair to call p = 0.007 "extreme", but it is small enough to not worry me too much given that (1) blocks props are in the same direction and (2) you can take 7 times more draw from 0.007 as you can from 0.05 (3) there's hard-to-reconcile differences in the data -- e.g. the prop test above shows the same direction, productivity (non-reverted article edits) is not lower, the direction of effect doesn't change when I include user who made zero edits, etc. I'd be much more concerned if out p value was closer to 0.05.

5. Good Q. I don't have time to write something substantial now, but you can review my code here: https://github.com/halfak/mwmetrics/blob/master/mwmetrics/utilities/new_users.py

user_id -- User identifier
user_registration -- Timestamp of user registration
day_revisions -- # of revisions saved (1st 24 hours)
day_reverted_main_revisions -- # of revisions make to Articles that were R:reverted (1st 24 hours)
day_main_revisions -- # of revisions made to Articles (1st 24 hours)
day_wp_revisions -- # of revisions made to Wikipedia and Wikipedia_talk (1st 24 hours)
day_talk_revisions -- # of revisions made to Talk (1st 24 hours)
day_user_revisions -- # of revisions made to User and User_talk (1st 24 hours)
week_revisions -- # of revisions saved (whole week)
week_reverted_main_revisions -- # of revisions make to Articles that were R:reverted (whole week)
week_main_revisions -- # of revisions made to Articles (whole week)
week_wp_revisions -- # of revisions made to Wikipedia and Wikipedia_talk (whole week)
week_talk_revisions -- # of revisions made to Talk (whole week)
week_user_revisions -- # of revisions made to User and User_talk (whole week)
surviving -- Boolean. Made an edit between 3 and 4 days after registering their account. (super short-term measure)
sessions -- Number of R:edit sessions (in the whole week)
time_spent_editing -- Approximated R:session duration (over the whole week)

6. As you must know, the practice of science is involves skepticism and challenging that skepticism. If you're reading bias into "started to convince myself that it is real" please first observe that my initial conclusion was that "either there is no real effect or the effect is small" and it was only with follow-up tests that we agree make more sense that I switched to "there is a small, real effect". I'm not here to sing some party line on VE. I'm here to come to know things. I get no benefit from either direction of this finding. FWIW, I'm not on the Editing team; I'm on the Research team.

Now that that is out of the way, I think the a priori reasoning is that VE helps edits make fewer revert-worthy test edits or mistakes. I'm working on developing a sample to run through Wiki labels so that we can have human eyes look at the edits in order to test this hypothesis -- and others. Let's not get hung up on some tentative conclusion. Let's instead drive forward to know better. I think there is plenty that we don't know and this result -- if it really is real -- is just a teaser that suggests where we might look.

7. I think the critique is fair, but please consider how many varying tests I performed to look for differences in productivity. The p-value can't tell us conclusively that there is no difference, but we can use our own judgement to identify that there is unlikely to be any substantial difference. 20k is a lot of observations. We have large statistical power. If you are someone who is not familiar with the difference between "significant" and "substantial", then I think a solid take-away is that we found no evidence of a difference where we ought to have found a difference if there is one.

--Halfak (WMF) (talk) 16:01, 25 June 2015 (UTC)[reply]

On point #7: Sunrise has quoted a summary written by the product manager at w:en:Wikipedia:Village pump (proposals)#Research results, not something that you wrote. Whatamidoing (WMF) (talk) 19:25, 25 June 2015 (UTC)[reply]

1. That sounds like it could be a good result in itself!

2. Yes, “at least one” is the threshold I had in mind. That table looks good (it could just be presented as “31.7%” and “29.3%,” maybe with some confidence intervals).

4. Okay, I don’t think this is good. In the summary, there were 9 tests reported under “productivity and survival,” 2 reported under “burden,” and a few more under “ease of editing.” Even if we round that down to 10 tests (and we’re treating the pairs of tests in “productivity and survival” as single tests), that gives a Bonferroni threshold of p<0.005, for which that result is nonsignificant. I’m aware that other corrections aren’t as strict, but this is also with a couple of favorable assumptions. So my own takeaway would be that at minimum it’s much closer to the significance threshold. You don’t have to consider my judgement though – if you haven’t done so, I strongly encourage that you check your study design with a professional statistician.

6. On your first paragraph, I wouldn’t assume differently, and my interest comes from the same sentiment. I also have some interest in the integrity of the VE enabling process since the WMF has lost a lot of trust at enWP. Of course I know you’re not involved in that yourself, and if I decide to oppose the proposal it won’t be a reflection on you in any way. I’m also here because I enjoy analyzing data (though, FYI, it means I tend to focus on criticism, so it’s not intended if I seem overly negative). On the explanation, I agree that that’s reasonable, so I withdraw my criticism in this category. I agree with the sentiment in the second paragraph as well, and I look forward to hearing about the Wiki labels data.

7. I agree with you – I wouldn’t make that critique for the conclusions that you wrote yourself, the points you’ve mentioned being among the reasons. As WhatamIdoing was referring to, this statement was directed primarily at the RfC summary (which e.g. describes the conclusions as independently finding no effect for multiple metrics).

8. To summarize, I’m still concluding that the revert effect probably doesn’t exist, largely because of the multiple testing issue. The other observation that plays a role is that when including both of the tests that were being used in the analysis, one of them is considerably less significant (unless there’s some issue I’m not aware of, like X² tests being inherently biased towards failing to reject H₀. I know some tests work that way.) Of course, I’m open to being convinced, either now or after more data has been gathered.

9. Separately, since you’ve mentioned improvements that trend in the same direction even if they don’t reach significance, I would have been interested in a multivariate regression onto all the variables – i.e. LR test a model containing all (independent) variables relative to null (or an analogous test where correlation between variables is allowed and accounted for). And of course, one useful advantage would be the lack of vulnerability to the multiple testing issue.

Thanks again for the dialogue. Sunrise (talk) 22:26, 25 June 2015 (UTC)[reply]

┌─────────────────────────────────┘
Hi Sunrise. This is great fun and I appreciate your critique. Without such critique, Science would be doomed to become elaborate bias confirmation. It's good to reason these things out. It seems that we're in clear agreement about this.

8. OK, I'm going to start with this. This is a weak result. It's small. We have a lot of observations and it's not particularly significant and it could hardly be called substantial. It's also something that we don't understand and it warrants further study -- to say the least. So, we're really nit picking some details of interpretation that we shouldn't be taking so seriously. Now, back to addressing the critique. I don't think it is fair to count up all of the statistical tests that discuss on the study page. If you were to do that, why stop at this study page? I ran a lot more statistical tests in the work log. I also ran a lot of statistical tests on related projects. FWIW (which may not be much), I'm relatively well published in the scholarly space of behavioral science and information technologies[2] and I have never seen a paper do such corrections unless a single meaningful hypothesis is being address by several independent tests. In this case, this hypothesis is being addressed by two tests that measure burdensome newcomers in two different ways. If this is not "good statistics" then I think you have the whole field to take on. I recommend starting with CHI as I've seen a few "best papers" there with dubious methodologies.

Now, re. the prop test showing lower significance. When we flatten the data to arrive at a proportion, we are not using our available signal to look for a difference. In order to perform the prop test, I flatten a positive integer (the number of reverted edits) into a simple boolean (>= 1 reverted edits?), so I'm not surprised that the statistical confidence in the difference went down. Now, did the p-value change significantly? I dunno. That's where we might need a someone with a doctorate in statistics (since I am some version of a professional statistician and I *am* a professional experimenter & scientist). One thing that we can know -- if you believe the p-value of the prop.test -- is that at least some of the effect on overall reverted edit counts is attributable to making just one edit that needs to be reverted. Alternatively, the effect could have been solely due to a decrease in the count of reverted edits for those editors who were already getting reverted.

Regardless, I appreciate your skepticism as I see it as the more appropriate of two extremes (absolute belief and absolute disbelief). What I might suggest is that you consider approximate answers, possible beliefs and differing degrees of certainty as a more powerful way to view the outputs of knowledge production. I have a weak, possible belief that something happened to new editors in the experimental condition that caused them to be reverted less often and I'm excited by the potential of learning what it was -- and that it might be something very interesting.

9. How, exactly, do you propose that we do a multivariate regression with different dependent variables(e.g. productivity, blocked status, reverted edit count), but the same boolean predictor (control or experimental)? If we can find a way to spec out the independent and dependent variables, I think that, for the scalar outcomes, we'd probably look at a negative binomial regression and for boolean outcomes, we'd want a logistic regression.

--Halfak (WMF) (talk) 00:50, 26 June 2015 (UTC)[reply]

I’m glad to hear you’re having fun also. :-) I’ll subdivide my response further since we’re focusing on point 8.

8a. I entirely agree that it isn’t something that shouldn’t be taken too definitively – although it is being treated as such at enWP right now! You’re right that my idea is to include all the tests that you performed in the correction – ideally, you would specify ahead of time exactly which comparisons you planned to make. (That said, I’m aware that’s not possible when the analysis is partly exploratory, which I understand is one of the challenges for such analyses.) On fields as a whole, I’ve been told a number of times (and agree from my own experiences) that the majority of statistical analyses published in biomedicine can’t be trusted - you have to reach quite a high standard of quality. So I wouldn’t be surprised if this were a valid critique of this field as well, though that’s not something I could make a judgement on myself.

8b. You make a good point on the reduction of signal for the proportions test, and I’m not really sure how to address that. (For example, one thought is that as you’ve pointed out, it’s a large dataset, so we should presumably expect it not to matter much, but on the other hand, if the difference isn’t significant then the result might also not be significantly different from 0.05.) Either way though, that point is less important.

8c. By “professional statistician,” I meant to refer to (PhD level) statistical consultants, and/or professors at major research universities. (My apologies!) The idea would be that at the beginning of any project, or for a day every few months, they would come in to chat about your work and offer outside criticisms. If you don’t have something like this as standard practice, I’d appreciate if you could pass on my recommendation or let me know who to give it to.

8d. We actually already agree on thinking in terms of degrees of belief. My preferred approach is to (aspire to) think in terms of Bayesian updating, so in this case, I would ask: how likely is this result given true H₀, relative to how likely this result is given false H₀? My prior expectation is that for every 20 independent comparisons, on average one of them will be significant by chance. When I look at these results, I see that both models explain the data equally well, so I don’t update. Or to be more specific, there will be a slight rearrangement of probabilities towards the significant metric, conditional on there being an effect – but the overall predicted probability that there is any effect will remain the same, or even decrease.

8e. Also! - A couple of quotes from your video reminded me of the Symphony of Science series, though I don't recall which ones.

9. It would be the other way around – the outcome would be VE status, and the question would be whether it’s possible to predict from the variables (in aggregate) whether someone was experimental or control. So it would be a logistic regression, assuming we’re still not considering any repeated measures information. I’m not sure this is actually valid though, e.g. I could see it running into some reverse causality issues. Sunrise (talk) 08:09, 27 June 2015 (UTC)[reply]

Hey Sunrise, just got back from the holiday weekend.

8a. points taken. I still maintain my conclusion that bonferonni correction is not appropriate here and that your plan for applying it to every statistical test I've performed in my search for real effects is poorly defined at best. Your point about not trusting analyses I'm sure is true in every field. Trust is a weird thing in scientific inquiry that I would not ask for. Personally, I find the recent papers published about how "all science is bunk" or whatever misses the mark in that it assumes that everyone interprets significant p-values as truth and all other evidence as secondary. IMO, that's not a very good way of knowing. The critiques I do find useful are about selective reporting of tests -- something that was obviously not done here.

8c. Well, I hold a PhD in CS, but most of my work is in applied statistics. I have people who hold a PhD in statistics review my work on a regular basis (peer review), but not within the WMF. While people seem to regularly call for a professional statistician to review my work (believe it or not, you are not the first), it seems that no one ever seems to worry about a professional behavioral scientist -- which I am. I would argue that professional status in experimentation work is infinitely more important than one particular analysis methodology. If we had the resources to hire another scientist at the Wikimedia foundation, I'd much rather hire another behavioral scientist since there's far too much work to do and I disagree about the amount of tutelage a well published applied statistician needs in the practical application of statistics. If perfect models created perfect knowledge, I might feel differently. IMO, it's better to perform a follow-up study to learn if past results hold up and why than waste any more time on a refining a model.

8d. I think we're on the same page re. degrees of belief. This is why I don't feel very strongly about the result. It has not shifted my degree of belief that strongly. Further, until we have an explanation for why, I'm going to be looking for opportunities to explain it away as something trivial.

8e +1 for the Symphony of Science series. Feynman was one of the first public scholars whose lectures I found engaging. I highly recommend anything you can get from him.

9. Hmm... Now here's where we might need someone who works as a professional statistician. I've never seen a reverse prediction like this justified in a published study. I have seen them in papers under review in the past, but those papers were rejected given that the authors couldn't justify it as an appropriate methodology. Yet, I can't say exactly if or why it would be wrong. Worse, I don't know how to expect the statistics to behave or under what circumstances the model might tell us wrong things. It seems that someone wanting to do some background reading on this could start with something like this chapter from a textbook. For now, I must focus on new studies. --Halfak (WMF) (talk) 16:47, 6 July 2015 (UTC)[reply]