Research talk:Autoconfirmed article creation trial

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Work log


Discussion[edit]

Experiment length[edit]

Thanks for Work log/2017-06-30. Please make sure in the configuration that the test is actually such, i.e. that the PHP code automatically disables the configuration after the expected end time without a new commit. If the correct timescale is one-two weeks, I think that's enough to see effects on casual users, but some behavioural adaptations by stubborn users may take longer. So after we measure the negative consequences we can conclude they'd be the same or greater if the restriction became permanent. (While for the positive consequences on content, these are mostly qualitative, I guess?) --Nemo 10:55, 1 July 2017 (UTC)

Projected overall impact over the longer term[edit]

My team is responsible for increasing editor participation. Our interest in assessing the impact of restricting new article creation has to do with gauging its effect on the long-term health of the wikis. What my team needs to know is, how important is new article creation to recruitment and retention of editors? What measures here speak to that? Do we know such basic facts as how many editors who create new articles as their first action go on to become active editors long term? Is there a way to project, for example, what the impact will be—in editors lost, in edits not performed, in articles not created—over a 5- or even 1-year period? JMatazzoni (WMF) (talk) 16:12, 20 July 2017 (UTC)

Hi JMatazzoni (WMF), thanks for bringing this up! When it comes to new editor retention, I would argue that hypotheses 1–7 cover that concept fairly well. That list of hypotheses includes the newly added H5 which is specifically about retention (surviving new editors), a hypothesis that should've been there to begin with, apologies for omitting that.
At the moment I do not think we know much about how creating articles affects editor retention, and I do not know of any research that looks at projecting long-term editor survival. Given that we have hypotheses that should cover the former, I'll be working on setting up data pipelines for that, and I'm interested in knowing more about it historically as well as what happens during the trial. When it comes to the latter, that looks to me like a research project that might take a few months to complete. While it is related to this trial in the sense that it would be interesting to know how the trial affects those projections, I see that as a potential project that can be done post-trial so as to not delay the trial's execution. Cheers, Nettrom (talk) 17:49, 24 July 2017 (UTC)
I'd like to suggest rechecking the data at 3 or 6 months (or more?) to see if anything shows up in the longer term. The WMF tends to pay a lot of attention to initial edits, but initial edits have a low average value. Experienced edits are far more valuable, and they're what keep the system working. Alsee (talk) 19:43, 3 August 2017 (UTC)

What is the survival rate of non-autoconfirmed article creators?[edit]

We know that the desire to create an article is what attracts a significant portion of new users to the wikis. My sense is that the questions posed in the research so far should tell us what portion of these new article creators we stand to lose if we shut the door on them. But I'm not sure that the proposed questions give us the means to understand the actual value of the editors we might lose.

To understand the value of these lost editors (assuming we do lose some), it would be very useful to know the following:

  • Currently, among non-autoconfirmed users who create articles, what percentage will still be active editors after one and two months? (We might call this the conversion rate of said editors.)

I think that unless we understand this, we won't understand how important new article creation is to recruitment and retention. It may be that many or even most of these new article creators are just spammers and vandals. I've certainly heard that said. But some of us know from experience or have heard anecdotes about the power publishing an article has to inspire and excite new contributors. I think it's important we understand how important that experience is.

Or am I missing something?JMatazzoni (WMF) (talk) 22:56, 21 July 2017 (UTC)

Hi again JMatazzoni (WMF)! I don't think you're missing something. As mentioned above, I added H5 to specifically address the retention question, which we can further split up by whether the new editor creates an article or not.
When you talk about value of the editors that we might lose, I interpret that to also mean we should investigate the total value of the contributions these newly registered accounts make, not just the quality of their first article (covered by H17). That is different from just measuring the number of edits they might make, as the content of those edits might not stick. Am I interpreting you correctly here, or is there an aspect of this that I'm missing?
I noticed you added a part to the "New accounts" research question about the "quality of the wikis". What do you mean by "quality" in this context? Is it something that is not covered by the "content quality" research question? Cheers, Nettrom (talk) 17:58, 24 July 2017 (UTC)
This will be a useful metric (if measured both before and at the conclusion of the trial). I would suggest that we should look at 30, 60 and 90 days and also look at whether those editors edit any articles other than the ones that they have created.MrX (talk) 19:26, 24 July 2017 (UTC)
Hi @Nettrom:. Thanks for adding H5. Yes, this sounds like it speaks to my concerns. But I have a question about it that relates to your question of me. According to H5, we will figure out whether the trial causes an overall—and, I suppose, immediate—decline in the number of "returning new editors." This is good, but the overall figure for surviving new editors is just that; it's surviving new editors overall. Since non-autoconfirmed article creators are bound to be only a portion of total new users each month, losing them is liable to have a relatively small impact on the total number of returning new editors (which must total about 4-5k/mo, according to this graph). And our trial is brief. So, as another way of getting at this issue, it would be useful it seems to me to understand the base level for the percentage of non-autoconfirmed users creating articles who have, historically, qualified as "returning new editors". Then, if our other analyses tell us that shutting the door is, indeed, causing some of these editors to never create their articles in the first place, we'll know what that is costing us. Or perhaps it's better to say we'll have a different way of looking at the cost. Since H5 sets out to determine whether the proportion of surviving new editors is "unchanged," I assume you will, in fact, be calculating the historical rate, for comparison purposes?
Understanding the historical rate at which non-autoconfirmed article creators convert to "returning new editors" would be an example of what I mean when I say I'd like to understand the "value" of these new users to the wikis. Are they just hit-and-run drivers? Or does new article creation play an important part in conversion and retention?JMatazzoni (WMF) (talk) 23:08, 31 July 2017 (UTC)
Hi JMatazzoni (WMF), thanks for your comments on this! I think we're on the same page when it comes to what we'll learn by studying "surviving new editors". I do plan to gather historical data for most (if not all) of these measurements, partly to understand what has happened in the past, but also because understanding the effect of the trial means comparing what happens during the trial to what we expect would happen without it.
I've added a note to H5 about wanting to know more about the retention of newly registered accounts for those who create new articles versus those who edit existing ones. It should be useful knowledge, and I want to make sure I don't forget about it. BTW, thanks also for the link to the stats. I reran the query for only the English Wikipedia, and it's about half of the total (here's the result for reference). Cheers, Nettrom (talk) 17:50, 2 August 2017 (UTC)
Hi @Nettrom:. Thanks for your response. Looking at the data you linked to reminds me of another issue we can foresee with this short trial: the retention rate fluctuates month to month considerably, bouncing in recent months from around 2000 second-month users down to about 1400 repeatedly. So if we see a decrease—or an increase for that matter—after ACTRIAL, how are we to know whether that is the result of the policy or normal fluctuation? This is why the historical baseline might give us a better picture in some ways. If we can gauge how many users are lost because of the no-article-creation rule, and if we know the historical retention level (the "value") of those new users, it might give us a better picture of what the effect of ACTRIAL really is. JMatazzoni (WMF) (talk) 18:24, 2 August 2017 (UTC)


Effectiveness of the NPP[edit]

I'm thinking about other hypotheses we can use to measure the effectiveness of the NPP during the trial period. H9 says "The workload of New Page Patrollers is reduced," which is measured by the number of new articles coming into the queue. But I'm not sure "workload" is the right way to think about it -- ideally, patrollers will be able to be even more productive, because they won't be overwhelmed and disheartened by the non-AC-created pages.

MusikAnimal wrote this query to keep track of how many unique patrollers are working each month, and how many review actions they make. I'd like to know what the impact is on those measures -- are patrollers getting more done because they're not turned off by the size of the backlog? Are they less likely to patrol pages, because there aren't as many easy pages to nuke, so patrolling isn't as much fun? There's a leaderboard of who's done the most review actions -- are the most active patrollers even more active? What happens to the "long-tail" less-active patrollers?

MusikAnimal also wrote this query to see how many articles in the queue were edited by multiple reviewers before they were marked as reviewed -- an indication that the page may be a "Time-Consuming Judgment Call", which are hard to bring up to the NPP standard. I'd like to track that during the trial; I think it'll help us see if the decrease in non-AC-created pages makes it more worthwhile to spend time on the AC-created pages. Measuring the amount of time that a page spends in the backlog would be interesting, too.

I just threw a dozen possibilities at the wall; I'm not sure which ones would be the most effective. What do other people think? Pinging Nettrom, Kaldari, MusikAnimal (WMF), Kudpung, TonyBallioni, anyone else who cares. :) -- -- DannyH (WMF) (talk) 22:33, 2 August 2017 (UTC)

I like the idea of measuring the average number of reviews performed per day during ACTRIAL and prior to or after ACTRIAL. I believe we should be able to get that data from the logging table. Musikanimal would know more details. Kaldari (talk) 22:59, 2 August 2017 (UTC)
  • I think it'd be an interesting metric. I suspect that the activity will likely go down: you'll lose the people who sit at the front of the feed tagging articles like Chet was a great guy. We snuck cigs from his dad when we were kids. Man, I miss those days for deletion. The more interesting metric to me would be the ratio of active patrollers to articles created and the ratio of the average number of reviews per day to number of articles created. A decrease in gross activity isn't necessarily a bad thing if the net product is a decreased backlog and a higher quality of article that enters into the main space. TonyBallioni (talk) 23:22, 2 August 2017 (UTC)
    • Also, if there are any replies to this thread that I'd be interested in/for some reason you want my over verbose thoughts, please ping me. I don't come on meta that often TonyBallioni (talk) 23:50, 2 August 2017 (UTC)
Thanks for bringing this up, DannyH (WMF)! When is was working on instrumentation yesterday and reached H9, I found that I was unhappy with the term "workload". It suggests we're measuring how much time NPPers spend patrolling, which I am not sure we're able to determine, nor am I certain we want to. When working on the hypotheses I stayed away from hypothesizing about the number of patrol actions done each day because I was concerned that it's too easy to game. Feel free to correct me if that concern is unwarranted. Measuring the number of reviewers would be more appealing to me, and TonyBallioni's idea of measuring ratios sounds good as well. Understanding more about what happens to TCJCs is perhaps something we can add as a part of H10, since it concerns the backlog (maybe we should just refer to it as a "queue" rather than a backlog, those are different things to me)? Having an understanding of the distribution of review actions per reviewer is also useful, and it also ties this trial with previous work.
A couple of the questions require surveying NPPers. Numbers can't tell us their opinions about the backlog or whether patrolling is less fun, unfortunately. I'm not against doing that, but wanted to point out the methodological difference. Cheers, Nettrom (talk) 19:17, 3 August 2017 (UTC)
I think it's possible that a New Page Patroller might try to game things by changing their reviewing behavior during the trial, but reviewing is hard work, and I don't think many people would be able to artificially inflate their reviewing work for very long. But that's just my opinion. I agree about Tony's ratios idea.
For the opinion survey-type questions, I was just using phrases like "not as much fun" to generate ideas for what we could possibly measure. Sorry that didn't come across well. :) -- -- DannyH (WMF) (talk) 19:54, 3 August 2017 (UTC)
For a mature adult of average intelligence, patrolling new pages is never fun. Not only is it not fun, it's a drudge and it's depressing. It's interesting to others who enjoy button mashing on web sites or playing MMORPGs, but even they tire of it after a while. I think measuring the average number of reviews performed per day during ACTRIAL and prior to or after ACTRIAL is an essential metric towards building an overwiew. In the various discussions on these issues, the people who think what we are planning is simply to stop non confirmed users from creating articles are not aware of the most important issue, and that is how and by whom and by what standards we judge whether or not a new page is to be kept.
I personally don't think reviewers will even think of altering their MO during the trial, but that's just my opinion. It will certainly be interesting to see and we should preferably be geared up to measure the activity profile of individual reviewers. Slightly off topic perhaps, but there is a table somewhere (I believe made by MusikAnimal) that shows that while the number of active patrollers has halved since we created the New Page Reviewer group, the number of pages reviewed has not, so one could investigate whether or not the bulk of the work is still being done by the same people as before.
One of the hurdles to good reviewing is that non qualified users are still able to work the queue of new pages and tag them even if they can't mark them as patrolled. That's arguably what loses us the potential new authors but at the October 2016 time of my RfC it's what the community wanted. It might also be possible to examine the ratio of non qualified tagging to qualified tagging. I'm sure that ACTRIAL will reveal a whole bunch of things we did not anticipate and which will need to be addressed and improved. Kudpung (talk) 21:36, 3 August 2017 (UTC)
No problem, DannyH (WMF)! I appreciate the ideas and like a lot of them (see suggestions below), just wanted to make sure we were on the same page with regards to how we'd go about answering them. :)
Based on the discussion here, I propose that we replace H9 with a set of new hypotheses about patrol work. H9 is poorly worded, and if we keep it together with a set of measures it's elevated to a research question, which I think will be more confusing since we don't have any other RQs/hypotheses like that. I based the new set of hypotheses on Danny's questions, and tried to incorporate the comments from TonyBallioni and Kudpung. Not certain how some of these things will play out, so further feedback here is most welcome! Note that I've just numbered these in order, the numbering of the other hypotheses we have will of course have to be updated.
H9 Number of review actions will decrease.
Because the number of created articles will go down.
Related: Ratio of review actions to created articles is unchanged, because reviewers will notice that there's less work to do and adjust their efforts accordingly. (This one I'm unsure about, we might also hypothesize that they'll spend some effort chipping off older articles if that's what they're used to doing.)
H10 Number of active patrollers will decrease.
This is connected to the previous hypothesis, will reviewers notice that there's less work to do and stop reviewing, or will they continue? Again, I'm not sure here.
Related: The ratio of active patrollers to created articles is unchanged. This one will just have to be logically consistent with H10.
H11 The distribution of patrolling activity flattens.
Because there is less work to do, there is less heavy lifting that's needed to be done by the most prolific patrollers, meaning they end up doing less and evening out the distribution.
H12 The number of Time-Consuming Judgement Calls will decrease.
Looking at the graph of the backlog in [[:en:Wikipedia:New pages patrol/Analysis and proposal#Non-autoconfirmed contributors|]], most of it consists of articles created by autoconfirmed users. It is reasonable to assert that most of the backlog is TCJCs, because if they weren't, they would've been marked as reviewed? Since articles created by non-autoconfirmed users is 15% of the backlog, we can hypothesize that TCJCs will decrease. The influx of TCJCs by autoconfirmed users will be stable.
Curious to know if these sound reasonable, or if there's something about NPP that I've missed. Cheers, Nettrom (talk) 16:55, 4 August 2017 (UTC)

Comments from Kudpung[edit]

With all due respect, what Joe Matazzoni : ’’We know that the desire to create an article is what attracts a significant portion of new users to the wikis. ‘’, is missing is that where Wikipedia is the only major website, blog, or forum in the world with such a low level of control at registration, adding content, and such an open access to everyone to effect maintenance task that nevertheless require a knowledge of the Wiki and a set of other siklls, it is an absolute magnet to inexperienced, sometimes very young users, and of course trolls (a lot of them). In 2011 89% of the patrolls were done by 25% of the patrollers. To date, no recent statistics have been offered unless they have been prepared in the meantime by MusikAnimal (WMF) and I am unaware of it. I would assume however that today it's more like 90/10 and I do a lot of it myself.

In a survey I conducted on NPP in November 2011, the WMF withheld the data for full five months until they had produced their own 'sanitised' (sic) report, which in some respects was not even close to the reality. Nearly 4,000 invitations were sent to users who had made patrols during the sample period. Of the 1,255 respondents, of which after removing the unusable submissions, which appeared to be mainly trolling, only 309 could be used making the exercise practically worthless. For historical purposes, the 'sanitised results are here.

I am of course very cautious about this experiment being decided by Foundation employees who have no little or no first hand knowledge of what the current challenges to the encyclopedia actually are, and have resisted the community's pleas for a long time for something to be done. I think therefore that it would be unwise to continue to discount the extremely important empirical experience of some volunteers who have been intensely occupied with these issues for a very long time - to the extent that some of them, including admins, have given up hoping, and have even retired from Wikipedia.

I have gone through the list and these are my comments.

Hypotheses

Rather than being neutral, these all appear to be expressed as a worst case scenario by those who are already biased in favour of the misused and outdated 'anyone can edit' mantra. Today it has to be: 'Anyone can edit - if they follow the rules' .
It assumes that most users have honourable intentions when they register.
It assumes that the users with honourable intentions will be lost

In contrast however, the research team do not appear to want new users to be offered any help or advice at their point of entry, which is contrary to the spirit with which the overwhelming consensus was achieved in 2011, and which leaves me somewhat baffled.

I will reiterate here that the consensus was not for a trial per se, but was in favour of rolling out the editing restriction forthwith. The idea to implement it as a six month trial was in deference to a not entirely insignificant number of supporters who suggested that to run it as a trial first might be more appropriate.

H1: Number of accounts registered per day will not be affected.

Safe to assume

H2: Proportion of newly registered accounts with non-zero edits in the first 30 days is reduced.

'We expect…' Yes this is the outcome desired by the community

H3:Proportion of accounts reaching autoconfirmed status within the first 30 days since account creation is unchanged.

We expect a small increase in the number of users requesting premature confirmation at PERM. We expect well over 95% of these to be denied (in the same relation as current requests for 'confirmed' which are mostly based on a perceived 'need' to upload media.

H4: The median time to reach autoconfirmed status within the first 30 days is unchanged.

Safe to assume, ,but what is this 30-day criterion based on (have I missed something?)

H5: The proportion of surviving new editors who make an edit in their fifth week is unchanged.

Possibly. It would be wrong to draw any premature conclusions.
The 'Further segmentation' will provide an important breakdown. History has shown that a large number of newly registered accounts do not begin editing for a very long time - often years before contributing either with edits or with new articles. It would also be interesting to see the number of new accounts whose first edits are in maintenance areas that are neither concerned with adding content nor creating articles. At NPP for example, we are aware that a large number of users begin their Wiki careers by reverting vandalism and/or tagging new pages.

H6: The diversity of participation done by accounts that reach autoconfirmed status in the first 30 days is unchanged.

This is the current ruling. It is often abused and some editors have suggested that this should be limited to 10 mainspace contributions.

H7: The average number of edits in the first 30 days since registering is reduced.

Another hypothesis; it would be wrong to draw any premature conclusions.

H8: is basically a repeat of H3. Note however that 'some newly registered accounts have a single purpose' is not a hypothesis, it is a well established fact.

RQ-Quality Assurance

H9: The workload of New Page Patrollers is reduced.

Safe to assume, but up to date stats are called for.
Recent information demonstrates that the 2011 proportions are probably still reasonably accurate. The trial will reduce the number of new pages to be patrolled, but this should not necessarily be seen as affecting the 'workload' of the reviewers - volunteers will do as much or as little as they wish and since the WMF's IEP debacle they will no longer be pushed like galley slaves. It is expected however that the reviews will be more thorough and hence expose irregularities that are still going largely undetected.

H10: The size of the backlog of articles in the New Page Patrol queue will decrease faster than expected

Not necessarily. It would be wrong to draw any premature conclusions. There is no indication that reviewers are consciously focused on reducing the backlog. There are also no coordinated efforts. Any coordination of anything at all at NPP stopped when I retired from my 6-year ex officio/de facto role of NPP micromanager.

H11: The survival rate of newly created articles by autoconfirmed users will remain stable.

It is not clear from where or by whom his 30 day cut-off was established. Is is based on the number of days it takes to nominate and delete an article at AdD? If so, OK.

H12: The rate of article growth will be reduced.

The researchers again express this as a negative hypothesis, the community sees this effect as a plus. Quite apart from the very high number of new articles that are totally unsuitable, a very large number of the few remaining titles that might possibly have some potential are dumped into the encyclopedia corpus with no intention whatsoever on the part of the creator to provide even a minimum of compliance with MoS, anything longer than ten words, or any reliable sources. These users generally do not return. It's a scandal to expect the volunteers to turn these things into respectable articles - and we rarely do. Thus such pages will remain perma-tagged perma-stubs for years, and usually do.

H13: ‘’The rate of new submissions at AfC will indeed increase

but by far less than any negative hypothesis will suggest. AfC however has its own serious problems of quality reviewing and measures really need to be introduced before AfC can be considered as core, essential process such as NPP. There are strong voices for merging AfC and NPP into the same interface while retaining their own special characteristics.

H14: The backlog of articles in the AfC queue will increase faster than expected.

Safe to assume, but 'faster' is an unfounded presumption.

H15: The reasons for deleting articles will remain stable.

There is no reason to posit an Hypothesis - this result cannot be surmised in advance, but it is very possible that the opposite may be revealed to be true.

H16: The reasons for deleting non-article pages will change towards those previously used for deletion of articles created by non-autoconfirmed users.

There is no reason to posit an hypothesis - this result cannot be assumed in advance. If the trial is conducted in the manner required by the community’s consensus, those who can’t wait for their account to be confirmed will use the Wizard and the article will be created in the Draft namespace.
RQ-Content quality

H17: The quality of articles entering the NPP queue will increase. Safe to assume

H18: The quality of newly created articles after 30 days will be unchanged.

Safe to assume.

H19: The quality of articles entering the AfC queue will be unchanged.

Safe to assume, but as AfC has never been more than a volunteer project (as opposed to NPP), it has never been the subject of any formal research.

Thank you for reading. As in the old days with the VCEO and C-level staff, I'm available any time for a multi-way video conference. As a normal volunteer I can't afford to travel halfway round the planet to discuss these issues at Wikimania as has been suggested to me. Kudpung (talk) 03:32, 3 August 2017 (UTC)

You might consider reviewing the meaning of hypothesis. Essentially, it's only a "we think this is how this works" statement, which is to be tested (review section "working hypothesis"). That means, for example with H15, that it's fine if the hypothesis "fails"--that's the point of positing the hypothesis. "Safe to assume" statements may not be borne out, but you can't know whether they are unless you have the hypothesis to begin with. --Izno (talk) 15:25, 3 August 2017 (UTC)
Which, perhaps with "safe to assume" you could say "evidenced in researched/anecdotal qualitative data" if that is what you mean. --Izno (talk)
Thank you for the lecture Izno. I think we can dispense with the petty nitpicking. The objective is to move this research forward and to help people like the statisticians, developers, and you who have never even done any new page patrolling better understand what they are supposed to be researching. 20:59, 3 August 2017 (UTC)
@Kudpung: Enough with the barbed comments--this is not the first. My experience in NPP is irrelevant to my prior comment, which was in good faith. It was a reasonable presumption, based on the tone of your comment to which I responded, that you did not understand what a hypothesis is. --Izno (talk) 01:29, 4 August 2017 (UTC)
Hi Kudpung, thanks for chiming in here! Based on your comments, I get the sense that we largely agree on what the effects will be, although we might reach them for different reasons because we approach this from different angles. At the same time, there were a few places were we seem to disagree, and I'll try to clarify how I approach this trial and the hypotheses that go with it.
You wrote: Rather than being neutral, these all appear to be expressed as a worst case scenario… When generating these hypotheses, I do my best to lean on what previous research and/or community insights suggests will happen. In this case, some of the hypotheses related to behaviour of newly registered accounts are related to a study by Drenner et al. I hadn't referenced that paper in the description of H2, which I've now fixed by rephrasing it to include that reference. They found that (unsurprisingly) increasing barriers to entry reduced the proportion of users who cross the barrier. Requiring autoconfirmed status to create articles is a barrier. Following the literature I then hypothesize that users who otherwise wouldn't cross the barrier will leave, and the remaining hypotheses should be logically consistent with that initial hypothesis.
I appreciate that you brought up the spirit with which the overwhelming consensus was achieved in 2011, I went back and read the closing of the initial proposal again and noticed the mentions of changes to the Article Wizard and AfC. While I don't disagree with that being a good idea, I also see two challenges. First, it would require substantial design and development resources, which would further delay the trial. Secondly, we would then be experimenting with multiple changes at the same time, complicating the process of analyzing what's going on to understand causes and effects. Since one of the purposes of the trial is to provide data for a community discussion post-trial, making fewer changes will help inform that discussion rather than make it more confusing. At this point, I also don't see further delaying the trial as something the community is interested in.
Regarding the 30 day limit for article survival in H11, that is based on the research paper by Schneider et al (referenced in the description of the hypothesis). The limit is based on the time it would take an article to go through AfD, yes. There also needs to be a limit to ensure we don't have an infinite study period. Thanks again for the comments! Cheers, Nettrom (talk) 22:17, 3 August 2017 (UTC)
Hi Nettrom. I think we all want this trial to begin as soon as possible. My comments regarding the Wizard may have been taken slightly out of context at some stage; for the purpose of ACTRIAL I have been simply working on significantly redacting the enormous walls of text (without losing their messages) that new users are expected to wade through and become Wikiexperts in order to create their first article, but I am not attempting any reprogrammation of it. I haven't insisted that the excellent Article Creation Flow that Brandon Harris was working on five years ago should be completed in record time, because it will probably contain elements of, or even largely replace the Wizard. What I do firmly believe however is that right on the account registration page there should be a friendly message, something with a psychologically positive approach such as:
If you are going to create a new article, we have put in place some exiting features to help you through the process and it will be seen by visitors to Wikipedia as soon as it meets our minimum standards for display.
with Kaldari's start page appearing shortly afterwards. I read somewhere once that the vast majority of users who create an account never make an edit, and of course I do not know what percentage of those who do, create an article in mainspace or make simple edits. What is clear however, is that a large number who register do so with he express intent of posting hoax, attack pages, nonsense, and spam, and make simple vandalism edits.
I understand your point of view based on the 3rd party research. I also respect your leaning on a 3rd party academic research. Empirical conclusions from getting our hands dirty are nevertheless so important they should be factored into the arguments. Therefore where Drenner et al may be perfectly right, their paper was not based on our specific challenges for which we do not yet have any results to analyse - and even then, we don't know and never will, what the articles would have been from the supposed 'lost' editors. I am of course hoping that (having seen what arrives into the en.Wiki on a regular basis for years) that it will be no significant loss. I would thus mention a suggestion Jimbo Wales made some years ago that users should concentrate less on the number of articles and do more about the quality of the existing ones. However, in general, the WMF clutches at figures of raw growth as a sign of Wikipedia's esteem and success. By reducing the possibility for first-time users to create articles directly in mainspace we might indeed lose a tiny number of potential new, good faith editors, but we are on a warpath against a huge, currently insurmountable flood of rubbish and while there is going to be some minor collateral damage, ACTRIAL is not Wikipedia’s Hiroshima or Nagasaki. Kudpung (talk) 03:24, 4 August 2017 (UTC)

Secondary effects[edit]

Secondary effects may be hard to measure, and may take time to manifest. But I'd like to suggest a few:

  • Once the backlog is brought down, experienced editors will be freed up to do other work. This includes quality-article creation, general article improvement work, and other tasks.
  • A reduction in garbage-pages may improve general morale and reduce cynicism.
  • People who jump right into creating an article, with zero general editing experience, are likely to have a very bad experience. Not only are we likely to lose them permanently, it can feed bad word-of-mouth about Wikipedia. The best outcome is if we either funnel new users into general article edits first and/or funnel them into a better new-article process. Alsee (talk) 20:44, 3 August 2017 (UTC)
Hi Alsee, thanks for suggesting these! I'm unsure how we would go about studying the first two, but we are definitely looking at the third one. One of the hypotheses brought up is that not being able to create articles might lead to a better new user experience, because creating an article only to see it being deleted shortly thereafter is demotivating. We've proposed H5, looking at surviving new editors, as a way to get at whether the new user experience is improved. Cheers, Nettrom (talk) 22:51, 3 August 2017 (UTC)
@Alsee and Nettrom:. We're already very concerned at the way absolutely anyone without the slightest knowledge or experience is still allowed to work through the new-article queue and tag articles quite wrongly thus biting not only creators who should be sent packing, but also good faith users who are not aware of doing anything wrong - and these are the ones we need but who won't come back. In the RfC I proposed in October last year, a few users were vehement almost to the point of personal attacks that we should to allow the newbie maintenance workers to continue to patrol new articles. I do not know of a work around, but I firmly believe that the experience of good faith new users will be greatly improved during ACTRIAL. The analysis will provide us with an important insight. Kudpung (talk) 01:33, 4 August 2017 (UTC)
  • The first two points are the most important reasons for the trial. This is what I meant in my note at the bottom of this page. The most important thing about this trial is NOT its effect on new editor attention. Please figure out a way to measure the first two things. Jytdog (talk) 02:25, 19 August 2017 (UTC)

Control group?[edit]

Will this change be implemented for a randomized set of contributors, i.e. will there be a control group? --The Cunctator (talk) 15:08, 4 August 2017 (UTC)

Hi The Cunctator, It's not a change per se. It's a trial of a fixed duration. Because it involves only new users, control groups are not possible. However, we will be making comparisons of a sample period before the trial, during the trial, and after the trial. The data that will be gathered and analysed is quite complex in order to get the best overview of the effect of the trial, and will also be of great help for many other aspects of the Wikipedia. Kudpung (talk) 15:48, 4 August 2017 (UTC)
Kudpung thanks for taking the time to reply. Just fyi this is a semantic argument and frankly not productive: "It's not a change per se. It's a trial of a fixed duration". This is false: "Because it involves only new users, control groups are not possible." This is irrelevant to my question: "The data that will be gathered and analysed is quite complex in order to get the best overview of the effect of the trial, and will also be of great help for many other aspects of the Wikipedia." When the answer to a question is "no", it's usually best to say "no." --The Cunctator (talk) 16:13, 8 August 2017 (UTC)
Hi The Cunctator, thanks for asking about this! In this case it's not really feasible to have a control group. Instead, we'll have to implement it for everyone, and plan to make extensive comparisons against historical data to understand what the effects are. Cheers, Nettrom (talk) 20:37, 4 August 2017 (UTC)
Why is it not feasible? --The Cunctator (talk) 16:13, 8 August 2017 (UTC)
Hi again The Cunctator, and thanks for asking more questions! I should've elaborated on that in my previous response, sorry. I see challenges to having a control group for both technical, research methodological, and community reasons. While I'm not deeply knowledgeable about the MediaWiki software, I do know that this trial affects several different pages on Wikipedia. Modifying the software so that it's possible to track which group the users are in and make sure that they get the appropriate user experience isn't necessarily straightforward (e.g. did we miss a page somewhere?) Most of the research I've seen that might do those kind of A/B tests usually modify one or two elements on a single page in a fairly simple interaction (e.g. web search), whereas the interaction with Wikipedia is more complex. Research-wise, I'm concerned about how easy it would be to figure out what group a user is in, and I'd prefer if users get treated the same regardless of group. Also, I don't think we'll learn much more from having a control group since there's a lot of historical data to lean on. When it comes to the community, the proposal of the trial asks for it to affect everyone. Secondly, I think having a control group might also result in some very confused discussions, for example on the Teahouse ("why were they allowed to create an article and not I?") Lastly, I think it would also make it more difficult to understand how the trial affects the Wikipedia community, which in turn affects the post-trial discussion of whether to implement this change permanently. Let me know if that doesn't answer your question and I'll try to explain more. Cheers, Nettrom (talk) 17:33, 8 August 2017 (UTC)
Thanks. I understand that there are plausibly profound technical challenges to implementing a control group during the trial, and I can see there being confused discussions on Teahouse. But I will push back strongly on the claim that there's no research cost to failing to have a control group because there's historical data. That's not how it works! Randomized trials are qualitatively more informative and reliable than the alternative. It would be great to have researchers recognize that there is a real cost to not running a randomized trial, and that a meaningful attempt is being made to minimize that cost. E.g. given the lack of a control group, what do we expect needs to be done to provide a reasonable comparison between the trial period and the historical data? There seems to be consideration of this, e.g. recognizing that there are seasonal waves of new edits, but it would be productive I think to be even more intentional.
Suggestion - one constrained version of control-group testing for ACTRIAL could be with the special case of editathons - where new users are trained and unleashed on Wikipedia. We could see what happens to editathon participants who are locked out of page creation and compare them to ones who are allowed page creation. Certainly an edge case but should be implementable. --The Cunctator (talk) 18:16, 8 August 2017 (UTC)
The Cunctator: there is discussion at en:Wikipedia:Village_pump_(proposals)#Allow_users_in_the_Account_Creator_user_group_to_add_users_to_the_Confirmed_user_group that would have made such a suggestion much easier. While I support it and there is plenty of time for the RfC to run, it is quite possible at this time that the community might reject the idea of editathon participants being regularly exempted from ACTRIAL. You might want to mention these concerns there if you haven't already weighed in. TonyBallioni (talk) 23:27, 8 August 2017 (UTC)
TonyBallioni: Oops, too late!--The Cunctator (talk) 14:12, 18 August 2017 (UTC)
The Cunctator: The en.Wiki receives around 1,000 new pages every 24 hours. There are perhaps 1,000 participants at editathons over any three months. Editathon participants (I've facilitated a few editathons) are genuinely motivated people who have not come along to learn how to spam or vandalise the encyclopedia. I therefore can't quite see how they could be a viable control group. I can't immediately see a way of randomly splitting the daily intake into A and B groups, but if you have an idea how this could in fact be technically implemented on the current Page Curation system, please let us know. Kudpung (talk) 09:46, 9 August 2017 (UTC)
Kudpung: As I said, it's an edge case. We'd only be directly comparing the experience of editathon participants with ACTRIAL. As you implied, they are extremely high-value new editors, and it's therefore reasonable to try to establish the best-quality research conditions for them. There are a number of the hypotheses listed on the main page that could be examined with this. Furthermore, I'm operating under the assumption you also recognize that there is some percentage of non-editathon participants who are not spammers or vandals. Thus the comparative experience of editathon participants who are blocked from creating new pages with those who are not should give us some baseline information about the good-faith new editors in the wild. --The Cunctator (talk) 14:11, 18 August 2017 (UTC)
The Cunctator, for a control group, the sample size of editathon participants would be far too small to produce any data that would impact the findings of the trial. The trial begins very soon and we are just waiting for some final technical teaks. Kudpung (talk) 20:16, 18 August 2017 (UTC)
Since it's not clear, I've been talking about comparing sets of editathon participants against each other. There is no size problem in that case. Furthermore, my understanding is that WMF is engaging in the research and data analysis, not the advocates of limiting article creation to autoconfirmed users, so it would be great to have a WMF rep weigh in. --The Cunctator (talk) 00:36, 21 August 2017 (UTC)
If you wish to set up control groups among the editathons, I don't believe such an experiment is within the official scope of the upcoming trial. Probably the best thing would be to discuss it with the editathon organisers - the data may be useful. ACTRIAL is a community project. The Foundation has offered to pay a statistician to help analyse the results. The WMF is also doing the technical requirements for us as the volunteers do not generally have access to the software. There are also software security issues to be addressed. Kudpung (talk) 01:55, 21 August 2017 (UTC)

┌──────────────────────────┘
(Excuse me as I outdent this…) The Cunctator: Sorry for not responding to this earlier! I agree with your point that there are benefits to a randomized trial, and I was not trying to imply that the setup we have is equivalent. What we do have are a large number of constraints on how this is going to happen. Planning and executing an A/B test around edit-a-thons is unfortunately outside of the possibilities.

When it comes to the methodologies available here, I am interested in applying interrupted time series analysis on this, as well as forecasting models (e.g. ARIMA). In the former case I'll be looking into how to control for autocorrelation in the data whenever autocorrelation is an issue. In the latter case we can use the models for forecasting and see to what extent our measurements are outside the expected, and we can also build models on data from different years to understand if the trajectories are different. At the moment I'm focusing on getting our data pipelines set up and doing some analysis of historical data as it comes in. I hope that addresses some of your concerns! Cheers, Nettrom (talk) 00:42, 22 August 2017 (UTC)

Something to think about[edit]

As far as I know, there is no group at WMF that is concerned with raising or maintaining article quality nor with retaining existing editors. These are two of the greatest concerns in the editing community, driving this trial. The key questions are - is less trash introduced to WP, and are editors who volunteer for NPP less burned out and disgusted by the way people abuse WP by dumping garbage into it?

User:JMatazzoni (WMF) and others keep describing this trial in the language of threat with regard to raw new editor retention, and I just don't understand this focus, unless.... that is a metric by which some folks' job performance is measured. Is there some kind of role-based conflict of interest here? That is a real question.

I also strongly object to language like "shutting the door" -- there is no sane organization that actually lets new people just do whatever they want; companies have orientation, and people go through drivers ed before they start driving... it could very easily become very normal, very fast, that new editors put articles through AfC at first. Totally. Normal. No "door closed" at all - just a sanely guided process that remains open to everybody. The "shutting the door" framing is as unfortunate as "death tax".

The most important outcome measure of this trial is not its effect on new editor retention, but it is sure looking that way based on the discussion above.

I am concerned about the importance of the "new editor retention" concern, and the framing of this trial as a risk to that, becoming too great an influence as the experimental design is worked through.

Thoughts? Jytdog (talk) 02:22, 19 August 2017 (UTC)

The concern that I have voiced many, many times over the past years is that the WMF is afraid of anything that would lower the raw number of creations and registrations. They believe in proudly reporting growth in numbers, rather than in growth of quality which is not so easily measured. They, and many users, misuse the mantra 'The encyclopedia anyone can edit' to mean no rules or restrictions can or should be applied. It doesn't. Kudpung (talk) 03:22, 19 August 2017 (UTC)
I'll just add what I said a couple of weeks ago further up: By reducing the possibility for first-time users to create articles directly in mainspace we might indeed lose a tiny number of potential new, good faith editors, but we are on a warpath against a huge, currently insurmountable flood of rubbish and while there is going to be some minor collateral damage, ACTRIAL is not Wikipedia’s Hiroshima or Nagasaki. Kudpung (talk) 12:56, 19 August 2017 (UTC)
From Community Tech's point of view, the purpose of the trial is to see what the impact is in multiple areas, as you can see on the Research page: new editors, quality assurance, and content quality. We want to get a holistic view of what happens when non-autoconfirmed editors are redirected to Article Wizard, so that we can all talk about the impact and make decisions about what to do.
Jytdog, you're right that there isn't a product team right now that's working on either raising quality or onboarding and retaining new editors, and that's actually a source of frustration for me. I've had a lot of internal conversations about this stuff over the last six weeks, and one of the points I've been making is: if the new editor experience is important to the Foundation, then we need to invest more in improving that experience -- and that investment includes supporting the moderation and curation processes. I'm hoping that what we learn from this trial will help us to work on that.
I've actually asked Joe to stop commenting on these pages, because he's giving the wrong impression about the work that we're trying to do here. Joe isn't on the Community Tech team, and he's not working on this project. I'm sorry about the confusion. -- -- DannyH (WMF) (talk) 18:59, 21 August 2017 (UTC)
Thanks for your note - I feel like you were really trying to hear me. But please be aware that what i wrote was ... "nor with retaining existing editors". It is so odd that you walked away hearing "retaining new editors". I was talking about retaining experienced editors, who get burned out dealing with torrent of garbage that gets added to WP every day and from dealing over and over and over with the same horrible arguments to retain bad content. from the associated torrent of new editors Again with this focus on new editors. ack. Jytdog (talk) 19:05, 21 August 2017 (UTC)
Jytdog: No, I was talking about that too. :) In this trial, the experienced editors are the ones in the "quality assurance" area. And that's what I mean about "supporting the moderation and curation processes" -- that too much work is being piled on experienced editors. I should have been more clear about that. -- -- DannyH (WMF) (talk) 20:41, 21 August 2017 (UTC)
Thank you for answering :) Jytdog (talk) 20:53, 21 August 2017 (UTC)
This may stem from the Foundation's little hands-on experience as volunteer editors and despite all the discussion, still not fully understanding the importance of patrolling new pages. It's therefore normal that comments such as '...and that investment includes supporting the moderation and curation processes' will continue to be met with scepticism by the community so long as the maintenance of such crucial software is relegated to a queue of wishes for convenience and comfort gadgets and gimmicks. And it will certainly be another year before anything is done - if at all, by which time any truly experienced patrollers will have completely lost interest; it's happening already. Kudpung (talk) 21:24, 21 August 2017 (UTC)

Effect on edit-a-thons…[edit]

I know the trial is going to start very soon, but I'm interested in making sure that Wikimedia/the Wiki community collects data about an issue that isn't mentioned here yet. Wikipedia Edit-A-Thons have a unique intersection with the Autoconfirmed article creation right because many attendees of these events (1) are new to Wikipedia, (2) get in-person training on Wikipedia policies, (3) immediately attempt to create new articles for Wikipedia. I want to make sure that the impact of the ACTRIAL on edit-a-thons is documented quantitatively. Specifically I would suggest maintaining quantitative data on:

  1. the number of new articles created and kept from edit-a-thons
  2. the number of new editors registered at edit-a-thons, and how many remain active
  3. the wait times involved in the AfC process for draft articles created at edit-a-thons

Separately, it may be of interest to know

  • some metric of the relative quality (i.e., ability to survive deletion) of edit-a-thon articles created by non-autoconfirmed users

During and after the ACTRIAL, we should also look qualitatively at how the trial impact the process of running edit-a-thons, perhaps by surveying organizers or participants.--Carwil (talk) 16:38, 31 August 2017 (UTC)

Hi Carwil, there have been some discussions over the last month about ACTRIAL's potential impact on editathons that you might be interested in -- first, a discussion on the English WP ACTRIAL talk page, and then an RfC on en's Village pump (proposals) about creating a new user right for Event Coordinators.
I'm not sure if we can measure the impact on edit-a-thons, because (as far as I can think of) there isn't a way to tell which edits/contributors are connected to edit-a-thons. There are a couple tools that program organizers commonly use, but they're not used universally. But I might be wrong about that -- do you know of a way that we could reliably tell who's at an edit-a-thon? -- DannyH (WMF) (talk) 17:40, 31 August 2017 (UTC)

Effects on page deletion[edit]

I was surprised not to see any mention on this page about how the number of deletions will be affected. I hypothesise that it will have substantially decreased the number of deletions and that the proportion of A7 and G11 will have changed. AFAIK there are no good tools available for analysing the deletion logs, but I'm in the process of writing a program to scrape it and analyse count the categories etc. and should be able to see what, if anything, has changed. If anyone knows of previous work done on the deletion log, please let me know. Smartse (talk) 23:37, 7 February 2018 (UTC)

@Nettrom: I spotted you've created deletionreasons.py which reads from an SQL database. Is this only accessible by WMF staff? Smartse (talk) 10:16, 8 February 2018 (UTC)
Smartse: en:WP:MANPP were the numbers we had beforehand. TonyBallioni (talk) 13:20, 8 February 2018 (UTC)
Hi Smartse, thanks for asking about this! We have two hypotheses about how deletions are affected by ACTRIAL. H18 concerns articles (the Main namespace), and H19 is about all other namespaces. When I worked on gathering data for this, I chose to focus on the User and Draft namespaces for H19, partly due to how we've seen an increase in Draft creations during ACTRIAL, and partly due to AfC submissions coming from those two namespaces. I documented this in the January 16 work log, then wrote the code you refer to, gathered data, and started analyzing it. In our case, we are interested in summary statistics and only store counts for each day and namespace, we do not store the actual log entries.
The databases used for getting the data and storing it are both on Toolforge and are accessible to anyone with a Toolforge account. If you have a Toolforge account and wish to access our data, connect to the tools.labsdb server and use the s53463__actrial_p database. The table you want to query is deletion_reasons. The database is updated daily around 01:00UTC. If you'd prefer a TSV file, I can add the dataset to our dataset page and set it up to be updated once a day as well.
I've analyzed the data on deletions in all three namespaces. The January 22 work log shows the results for H18 (Main). We compared the first two months of ACTRIAL against the same time period in 2012–2016. Overall, there's a significant reduction in deletions, median number of deletions per day drops by about 220 pages. The table shows a breakdown of this, and A7 and G11 have the largest drops in number of pages/day. Most of the reduction in deletions comes through CSDs.
A similar analysis for the Draft namespace is in our January 19 work log with the breakdown table in the January 23 work log. We compared against 2015 and 2016 as the Draft namespace was created in late 2013 and deletion activity fluctuated a lot in 2014. There's a significant increase in deletions in the Draft namespace of about 30 pages/day. That increase comes through G11, G13, and "other", the latter being our catch-all category for deletions that doesn't appear to match any of our identifiers.
Lastly, our analysis of the User namespace is also in the January 23 work log. There we use only 2014–2016 due to the introduction of U5. We find a small and marginally significant increase of about 15 pages/day. Changes in reasons for deletions is more varied. U5 still shows a large increase, "other" as well. U1 and G13 are down, the latter most likely due to consistent usage in previous years.
I'm working on updating our research page with these types of results, and will make sure this gets on there before the end of the day. Feel free to ask questions about any of this! Cheers, Nettrom (talk) 18:12, 8 February 2018 (UTC)
Nettrom: does your statistical analysis for the draft namespace take into account that G13 was very recently expanded to include all drafts, not just AfC drafts? That could be a confounding variable. TonyBallioni (talk) 18:34, 8 February 2018 (UTC)
TonyBallioni: No, I was not aware of that. Do you happen to have a link to the discussion about it, or maybe an RfC? It would be great to have that referenced!
I am unsure to what extent the change in G13 affects our analysis. There are also fairly large increases in G11 and "other", and deletions increased in general, so we're likely to find a significant increase if we withheld G13. Looking at the breakdown graph in the work log suggests that G13-deletions happen at a rate comparable to that of 2015, while they happened less frequently in 2016. The bigger question is perhaps whether G13 deletions are up due to ACTRIAL causing more focus on the Draft namespace, or whether it's caused by the wider definition of G13. Our dataset of Draft creations and AfC submissions could help answer that, but I see that kind of a detailed analysis as outside the scope of the current work. Cheers, Nettrom (talk) 19:16, 8 February 2018 (UTC)
Sure, this is the RfC. I was unsure whether it would impact the research, but thought it worth pointing out since you mentioned un uptick in drafts. The G11 increase makes sense as well. Thanks for all your work. TonyBallioni (talk) 19:20, 8 February 2018 (UTC)
TonyBallioni: Awesome, thanks for grabbing that RfC! I added a note to the January 23 work log about this, so it's documented there as well. Cheers, Nettrom (talk) 22:57, 13 February 2018 (UTC)
Nettrom Ah ha! Thanks very much for that. Turns out I should have read the research page more closely. While I'm a little annoyed that you've already answered my questions, so I need to find a new project, those are some really interesting results! Fingers crossed that there will be no associated decrease in overall article creation. I've signed up to Toolforge so will hopefully be able to make my own queries soon. I'd been looking for somewhere to try out SQL in the real world (very much a DS noob).
I noticed here though that you're failing to categorise 10-20 % of mainspace deletions, with it doubling in the ACTRIAL dataset. These percentages seem pretty high to me, given that admins shouldn't be deleting articles without specifying a valid reason. Does your regex catch uncapitalised rationales like a7/g11? I know that 'spam' is given as a reason a fair amount of the time as well, rather than g11. Smartse (talk) 21:28, 8 February 2018 (UTC)
Oh one more - how do you deal with log entries where multiple rationales are given? A7 and G11 for example are often used together. Smartse (talk) 22:03, 8 February 2018 (UTC)
Smartse just a quick note, in case you're not aware: you can run SQL queries even if you don't have a ToolForge account, using Quarry. In fact, it's substantially easier (though very long-running queries may time out). You can see a somewhat random assortment of Quarry queries on my own Quarry profile, and on Nettrom's. Cheers, Jmorgan (WMF) (talk) 18:35, 9 February 2018 (UTC)
@Jmorgan (WMF): No I wasn't aware of that! I've made a little start over there. Thanks! Smartse (talk) 12:53, 10 February 2018 (UTC)
Smartse Thanks for asking question about this! It's also great to see Jmorgan (WMF) point you to Quarry, a wonderful tool for sure.
You ask good questions that apply not just to this project, but to anyone doing data analysis. To what extent are we capturing the right data? I just updated the GitHub repository to make sure its version of deletions reasons.py matches the one we're using. The regexes used for catching CSD, PROD, and AfDs are in lines 109-111. As you can see, we're requiring a wikilink (to policy) for all three, meaning that we won't catch free-form text referring to a policy (e.g. "delete per g11"). We do match anywhere in the log comment, so we're not requiring the link to be at the start. There's a few reasons for doing it this way versus other ways. Requiring the link means that if we have a match, it's unlikely we caught something else. If we run a case-insensitive match on "A11" and someone deletes "User:FooBarya11/baz" with the comment "delete per request from User:FooBarya11", we'd be incorrectly labelling it (should be U1). Secondly, the link allows us to easily pick up the reason without requiring a set of options in the regex because the pattern in the link is very regular. Can we consistently match general reasons using a "G\d\d?" regex? I'm not so sure. Lastly, having just three regexes makes the logic of matching fairly straightforward. In order, we check for CSDs, PROD, and AfDs, and anything not matching those are "other". A more flexible setup will likely involve more regexes, which means we might have to start accounting for processing time, and perhaps more logic to decide their priority.
I looked at the code, and no, we don't catch multiple references in the same log comment (it's a single call to each regex' search() method). We don't have aliases for any of the references either. If you're looking for a data science project, free to look into the log comments that we fail to capture and see if there are good patterns in them that we could use to improve the data gathering. I'm all for having better data! Cheers, Nettrom (talk) 22:57, 13 February 2018 (UTC)
@Nettrom: Thanks. That's encouraging to hear. I did a very quick and dirty test at the weekend and was catching ~93% of the comments as having some criterion in, but that definitely needs more work and was quite a small sample. I'd realised about the problem of catching strings in comments. Working on the plain text as opposed to wikitext is a bit easier as searching for " a11" etc. would probably get around the problem of "User:FooBarya11/baz". Working out how to do SQL queries in python is beyond me at the moment, but I've adapted your query to get a dump of 2018 and will see what I can get out of that. One other thing I noticed having a quick look over it is that redirects are included, when they are not articles. At the moment, you'll be catching some of these as AFDs and others as "other". They could be excluded either with a limit on page size or just excluding any of the comments containing redirects. With double criteria I would also prefer to log each one as half an occurrence, because Twinkle will always put A7 before G11 so you will only record A7, but I can see why you've chosen a simple and robust approach. Smartse (talk) 11:13, 15 February 2018 (UTC)
See https://github.com/smartse88/actrial That's based off the logs this year so far (until a few days ago) so isn't directly comparable to your analysis, but I have classified c. 96 % of the entries into a category. There are some big difference, mainly in the AFD stats which I think is due to the redirect issue I mentioned above. G5 is also a lot higher as I have included en:Special:Nuke in that category. How come G7 and G8 don't show up at all in your stats? I need to work out how to start plotting it by date and then analyse some pre-ACTRIAL data too. Smartse (talk) 23:45, 17 February 2018 (UTC)

┌───────────────────┘
I'll just outdent this a bit since we're already six indents down here. Thanks for doing this work! I've only had time to have a quick look at it so far, but it sounds like the main concern here is that we can do a better job of catching redirects, and that there are a few categories missing, am I reading that right? I went and looked at the R code I've written for doing the analysis and these two issues are connected. The Python code I linked to earlier categorizes and counts all G-reasons listed on WP:CSD, as well as R2, R3, and X1 (you can find them in the self.stats dictionary of the DataPoint class, lines 41–73). Because R2, R3, and X1 were used to delete redirects, I want to correctly categorize them and then ignore them in the analysis as I want to focus on pages that are more article-like. When I went and looked at some of the data, I also got the impression that G6 and G8 were often used to delete redirects, so I also chose to withhold those from the analysis (and wrote a note about this in the January 19 work log). Catching nuked pages as you mention could be useful, do you have an idea to what extent is Special:Nuke mentioned in log comments but not caught by what I already have? I am also concerned about redirects being deleted as part of AfDs as you mention. I would expect something like "Redirects for Deletion" existing for that, but maybe that's not the case? Do you happen to have some examples readily available that I can go look at? Cheers, Nettrom (talk) 16:18, 19 February 2018 (UTC)

No problem. It's an interesting exercise and I had wondered to myself about these kinds of stats for years, so I'm glad to be able to analyse it. No just to figure out how to plot it! Thanks for the link to the earlier notes about how you're handling redirects, I hadn't seen that. The main problem with the redirects in yours is with entries like Delete redirect: [[Wikipedia:Articles for deletion/Miho Maeshima]] closed as delete ([[WP:XFDC|XFDcloser]]) which you will be categorising as an AFD. That's why I've searched for anything indicative of a redirect - I think the chance of false positives on that are very low (just realised I have R1 in mine which doesn't exist). G6 and G8 will probably be catching some redirects, but G6 is mainly used for history merges, but you're correct in that they shouldn't be counted as article deletions either. Nuke is used a surprising amount (I was logging it separately at first and IIRC it was >5 %) and while G5 is mentioned in some of the entries, it is often just the standard text containing "mass deletion of pages". Smartse (talk) 00:23, 21 February 2018 (UTC)
@Smartse: Thanks so much for looking into this and helping identify where we can make some substantial improvements, this is great work! Catching the redirect case for AfDs is particularly useful, I hadn't thought of that. Took me a little digging to figure it out from your example, since the AfD doesn't talk about any redirects that pointed to the article that's proposed for deletion. Once I found that and the deletion log of the redirect, it all made sense. I've added a check for "redirect" to our code and filter those out from PROD, AfD, and "other". Also added a check for Special:Nuke like you did and add those to G5 (unless there's already a reason listed), looked like there's quite a bit of comments starting with "Mass deletion of pages" and an explanation. Deleted the old data and just finished gathering it all up to today. Will go through our analyses later today and check how they're affected.
Hope you're able to figure out how to get it plotted! I don't use Excel for this kind of work, instead I live in the R/RStudio world. It has a bit of a learning curve, so if you're comfortable with Excel I'd probably stay with that and google for some ideas. If you wish to dip your feet into the R world, I'm sure I can get you some code to plot a graph. Thanks again for your help with this! Cheers, Nettrom (talk) 18:14, 21 February 2018 (UTC)

H13[edit]

@Nettrom: Just a quick question on this in terms of why we aren't using more recent data? The current size of the new pages backlog as of my typing is 3602 unreviewed articles. It goes up and down, but the clear trend here is that it's gone down by ~11,000 pages since ACTRIAL started. It's up and down, but as written, the current finding suggests that ACTRIAL had no impact here, which is a concern to me. Thanks for all your work here. TonyBallioni (talk) 16:36, 19 February 2018 (UTC)

@TonyBallioni: Thanks for asking, I was expecting to see a question about the time span at some point. The main reason why we're using data up until mid-November is to have a consistent window across all the analyses we're doing. Some of the final analyses were started in December or January, and due to for example the 30-day window in some of them meant we couldn't use the most recent data at that time. As you've noted, this means we are not picking up the NPP backlog drive.
NPP backlog size
I agree with your conclusion that the backlog has been reduced during ACTRIAL, it's gone down from 14,353 articles to 3,748 as of 18:00 UTC today (about half an hour ago). Since we're discussing H13, keep in mind that it states that the size of the backlog will remain stable. Whether we look at the first two months or up until the present, H13 isn't supported since the backlog hasn't been stable.
With regards to ACTRIAL having an impact on the backlog, I'm struggling to see clear evidence that the trial is a driving factor. About 2,000 articles were removed from the backlog prior to the trial starting. The backlog is further reduced by 1,500 during the first two weeks of ACTRIAL. From then on until the first week of December, the backlog isn't substantially reduced. The analysis of H9 finds a reduction of about 300 patrol actions per day during the first two months of ACTRIAL, and the analysis of H18 finds an average reduction of around 200 speedy deletes per day. There's not a significant reduction in the number of active reviewers, meaning that the ratio of created articles to reviewers has gone down (H10 and its related measure, in the Feb 10 work log with H9). Combining the results for H9, H10, and H18 suggests there should be resources available to consistently reduce the backlog during ACTRIAL, but the major decrease doesn't start until mid-December. We know that patrolling continues to be unevenly distributed work (H11), and I noticed in the discussions on WP:NPPR's talk page and on the WP:ACTRIAL talk page that this is also brought up. After a drive to recruit new patrollers and an initiative to have a backlog drive in January, the major reduction in the backlog starts in early December and continues to the end of the backlog drive.
Combining all of this: there's evidence that suggests that ACTRIAL has freed up patroller resources, but there's not evidence that these resources were used to reduce the backlog. Or am I missing something here? Regards, Nettrom (talk) 19:17, 20 February 2018 (UTC)
I think you make good points on this, and there are probably too many confounding variables to determine what caused the drop (ACTRIAL, backlog drive, more patrollers, combination of all factors?) I suppose my question was in terms of presentation, the overall trend here has been down, so having it end when it goes back to stable levels isn't the full picture of the trial (it's rising to 4000 again, but as I've said on the NPP talk page, I expect fluctuations over time). Anyway, thanks for well thought response as always :) TonyBallioni (talk) 16:47, 25 February 2018 (UTC)
@TonyBallioni: Ah, it seems to me like we've been discussing two slightly different things, I ended up focusing on the hypothesis itself and the process of determining whether it was supported or not, and not necessarily the broader perspective. We are working on writing up a final report based on our analysis, and that report will definitely discuss the longer term changes such as the massive reduction in the NPP backlog, it's an important part of what's been going on during the trial. I think it's unfortunate that our analysis window ends just before the backlog started, and hopefully how we cover it in the report will address your concerns. Cheers, Nettrom (talk) 15:36, 2 March 2018 (UTC)

ACTRIAL ending[edit]

I just have a quick question. ACTRIAL is scheduled to end on 14 March 2018. After this date, will non-autoconfirmed users automatically be able to create in the mainspace again? Or is that just an informal date for when we plan to revisit the issue? Mz7 (talk) 01:30, 2 March 2018 (UTC)

Hi Mz7, thanks for asking! As far as I understand, once the trial ends non-autoconfirmed users will be allowed to create articles again. This is discussed in the trial duration proposal from 2011, and also mentioned on the ACTRIAL-page on enwiki. Regards, Nettrom (talk) 15:45, 2 March 2018 (UTC)

Request for H10 to be ignored or regarded as P-hacking [edit]

The work load for NPPers hasn't dropped, it just that less patrollers are out in force. TThe main reason I know is that after a large recruiting for new patrollers created an uptick, but after a bit, particularly after the January drive ended, the luster may have worn off, and new NPPers just went back to whatever they were doing. The second reason, which is harder to track except for sending out a "why haven't you been patrolling" survey, is business. I for one was very busy IRL throughout February (see my en.wiki edit count compared to the rest of my tenure) and all my on-wiki activities suffered, from CVU to STiki to NPP and RCP and PCRing. Therefor I find H10 and its results to be misleading and/or useless in the macro. Thanks, and plz ping me, enL3X1 ¡‹delayed reaction›¡ 13:29, 2 March 2018 (UTC)

Whoops, for clarification, I am saying that the correlation between "workload decrease" and NPP work decrease is fabricated. enL3X1 ¡‹delayed reaction›¡ 13:30, 2 March 2018 (UTC)
@L3X1: Thanks for asking about this! I am unsure whether I understand exactly what your concerns are. H10 hypothesizes that participation in New Page Patrol is to some extent dependent on a perception for a need to review, and that perception is affected by the rate of article creation. We do our best at trying to describe what we would expect to see prior to the trial starting based on what we knew at that point. Then we set out to measure patroller participation and describe what we find. I read through our analysis of H10 again, and do not find that we make any particular claims about causes for the results that we're seeing. Similarly for our short description of the results for H10, it doesn't make claims about the causality.
I take the issue of P-hacking very seriously and am concerned about your labelling of what we're doing here as that. It's assuming bad faith on our part, and suggesting that our methods are questionable, at best. Our hypotheses were written well before ACTRIAL started, and we made several adjustments to them based on feedback from the community (H9–H12 were completely rewritten). We have a well-defined window of analysis, which is the first two months of ACTRIAL (meaning that February 2018 is not part of our analysis; see also the discussion about H13 on this talk page). I don't see that we slice and dice our data in order to find results that are (marginally) statistically significant. Lastly, we don't make causal claims from correlational data, as you point out we'd have to survey new page patrollers if we want to know why they do or do not patrol during ACTRIAL.
I hope I've been able to clarify the process behind H10 and why it's phrased the way it is, and how it relates to what our analysis found. If I misunderstood your concerns, I'm looking forward to discussing them with you so we can figure this out! Regards, Nettrom (talk) 17:06, 2 March 2018 (UTC)
Thanks for the reply, Nettrom. I am sorry if I came across as assuming bad faith/attacking, I did not mean to do that, and have struck the relevant parts. I am reading through the February 10 worklog, but am not sure I am convicned that Actrial is the causation for less active patrollers. Thanks, enL3X1 ¡‹delayed reaction›¡ 17:23, 2 March 2018 (UTC)
For the correlation to be accurate, acfter Actrial ends, the number of patrollers should rise. Will you have to wait 2 or more months in order to check?enL3X1 ¡‹delayed reaction›¡ 17:44, 2 March 2018 (UTC)
Looking at the charts to me it looks like every year there is a 4th quarter decrease, which I would attribute to holidays, vacations, and year end slowdowns. enL3X1 ¡‹delayed reaction›¡ 12:49, 3 March 2018 (UTC)

┌─────────────┘
Hi L3X1! First of all, apologies for not replying to this sooner. I wanted to be able to sit down and spend some time writing a solid reply, and the last week hasn't really had that available. Your comments and questions are most welcome here, so I'm not happy with leaving them unanswered. Secondly, thanks for striking out the P-hacking part, I appreciate that immensely!

Active patrollers per day
Active patrollers bimonthly

Your questions about causation and what's going on with the data for H10 are on point. Let me try to explain some of the things I've been seeing and how it relates to what we find. I'll be adding thumbnails of the various graphs along the way, hopefully that helps making things clearer. The first challenge with this data is the introduction of the reviewer right in late 2016. This right changes the model of reviewing from "anyone can review" to "only those with prior approval can review", and it means that we shouldn't compare older data with current data because the system is so different. In the graphs, we see the effect of this as a significant reduction in the number of active patrollers.

Patrol actions per day

The reviewer right also comes with a second challenge in that it possibly changed reviewer motivation prior to its introduction. Reviewers knew on beforehand that the change was coming and that they could get the right through prior activity. We see in the data that the number of active patrollers is quite a lot higher in the fall of 2016 than in 2014 and 2015, perhaps more clearly in the bimonthly graph. If we look at the graph for number of patrol actions, there isn't a similar increase there. In other words, we have a higher number of active patrollers but they do not appear to be doing a lot more work. While we do not know their motivations for certain, we have further reasons to discard the 2016 data as different.

The third challenge is seasonal variations in the data. You correctly point out the fourth quarter decrease, activity on Wikipedia tends to slow down towards the end of the year, particularly in December. There also tends to be increases in the first and third quarters, the graph of active patrollers has this pattern in 2015. When I want to understand if there's a change during ACTRIAL, I want to take this pattern into account, and for many other analyses I do this by using the same date range in previous years as a comparison period. I chose differently for H10, let me try to explain why.

Patrollers active per day 2016-2017

When we're talking about seasonal patterns, one thing worth noting is that 2017 appears to not have them. The daily patroller graph for the two most recent years makes this perhaps easier to see. There appears to be a slight increase in February and March, but otherwise there's a lot of stability in the trend. As mentioned, 2015 had strong seasonal patterns, and Wikipedia activity tends to have them. If we don't see them in 2017, that could indicate that the introduction of the reviewer right has significantly altered the reviewing system in such a way that the number of reviewers participating no longer follows the seasonal pattern of Wikipedia activity. If that's the case, then we would see a stable number of active reviewers in the fall. Going back to the 2016–2017 graph, that also seems to be the case, the number of daily reviewers is quite stable prior to ACTRIAL starting.

Combining all of this, when I'm trying to understand if there's a change in the number of active patrollers during ACTRIAL, I chose to use the first six months of 2017 as a comparison period because of the stability across that time. Comparing those two periods, we find a significant decrease in the average number of daily active patrollers from 76.7 to 57.1 (or 25.6%).

Forecast graph

Secondly, I looked at using an ARIMA forecasting model, and when training that model I find that it should take seasonal variation into account (it has a yearly component). A challenge with that model is the introduction of the reviewer right, which breaks the time series. That break likely also shifts the magnitude of the variance in the data, and that is undesired when doing this type of model building. I looked into transforming the data, but could not find an approach that seemed to work. Looking at the forecast, we see a very large confidence interval, which I interpret as being caused by the break in the time series. It makes me question whether the model is appropriate for this analysis, so I lean on the first result instead. If I had more time available, I would look into how to handle those types of breaks in the time series for forecasting models.

Wrapping up, I also wanted to mention causality. I cannot conclude from this analysis that ACTRIAL caused a drop in the number of active patrollers. As you've pointed to earlier, we would have to survey the patrollers to understand if that's the case. Instead, I can only observe that the number of active patrollers appears to be lower during the trial than it was before.

I'm curious to see what happens when the trial ends. I'd probably give it a month before I try to draw any conclusions. Thanks again for asking about all of this, I hope I've been able to explain some of the reasoning behind how the results came about. Please do ask if something was unclear, happy to discuss this. Cheers, Nettrom (talk) 17:50, 12 March 2018 (UTC)