Jump to content

Research talk:Template A/B testing

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 12 years ago by DarTar in topic Research: namespace?

I've been working on this for quite some time but the en.Wiki project seems to be largely dormant. I have all sorts of sandbox messages (in my user space and in template sandboxes) I use instead of the bitey, TLDR templates, most of which in my opinion concern CSD, PROD, COI, and warnings where users have in GF recreated articles that duplicate exiting topics, redirects in mainspace, etc. Some of the current standard templates have nearly as much text as half a page of a paperback; editing them down is almost impossible for a non-programmer due to all the php calls (embedded transclusions) that can't be located. There are a lot of templates listed in Twinkle that are hardly used, making the list so long that it is often quicker to type a custom message. templates need to be user-centric, but many of our template prose does not , or does not appear to have been composed with that in mind. It's worth considering that there are three distict classes of user to be addressed: GF users who simply can't get through the maze of text walls of policies and guidelines, those who are not native English speakers, and those who have absolutely no interest whatsoever in submitting serious new articles and edits. --Kudpung 03:34, 17 October 2011 (UTC)Reply


I don't understand, are you assuming that all templates are substituted? This isn't always the case. Does something change if they're not? (I don't understand exactly how this is going to work.) Nemo 21:20, 21 November 2011 (UTC)Reply

So this method does require substitution if you want to randomly select a template for use in real time. However, there are some ways to A/B test things with transcluded templates, it just depends on what they are. For example: in English we're working on testing stuff for shared/dynamic IP addresses, including effect of regular archiving and different header templates that denote what the shared addresses is. These are not substituted, but because there is a category for all these IPs, you can do a comparative test by just splitting them in half and applying different templates to each group. As long as you can figure out a way to generate two or more groups to compare in a controlled way, you can get reliable data. Steven Walling (WMF) • talk 21:45, 21 November 2011 (UTC)Reply

False positives[edit]

I was surprised to see that 347 or nearly 15% of warnings were for "goodfaith" edits. This struck me as too high for edits identified as vandalism, and I thought Huggle was only used to deal with vandalism (though I may be out of date there). If an individual rollbacker had a false positive rate as high as 15% we'd take away rollback, so I don't see how the average could be that high or even close. So these could be the ones really worth further analysis. For example if these are goodfaith editors making newbie mistakes such as changing between AD/BC and BC/BCE or between British and American spelling then getting the right message to them could make a big difference. Better still if we can quantify this and establish that they are big enough problems to be worth taking serious action we could revive Strategy:Proposal:More multi dialect wikis. I'm still more than half convinced that it is our support of varieties of English other than American English that accounts for our relatively low US readership vis a vis the UK or Canada. Of course nowadays such reversions may be being picked up by the edit filter and not even reaching the wiki. WereSpielChequers 18:20, 13 January 2012 (UTC)Reply

To clarify: the "goodfaith" group was not defined in our qualitative assessment as simply "should not have been reverted", so it's not correct to say they were false positives and that Hugglers did such a bad job. Everyone who did this coding was aware that while there may have been false positives, every editor has a different idea about what should be reverted or not for quality reasons and there can be good faith incidents like spelling or dates disagreements, as you mentioned. So what the "goodfaith" group does mean is that every contribution by those folks was very clearly at least trying to improve Wikipedia in an unbiased way, even if they arguably should have been reverted depending on the situation. Two good examples are: this and this. Steven Walling (WMF) • talk 19:47, 13 January 2012 (UTC)Reply
I don't think that either of those should have been treated as vandalism. Can you tell me whether it was reverted using undo or rollback, and in either case what were the edit summaries and any talkpage message? Also can you confirm that these were Huggle reverts using rollback or did you look at all reverts including those that would count as "normal editing". We currently do have editors who have given up on the {{fact}} template and will simply undo certain edits with an edit summary of "revert unsourced", I'm fairly sympathetic to that if the reverter is reverting to something they know is sourced and different. In any event I think these edits are worth focussing on because this could well be where we find the editors we are driving away. WereSpielChequers 20:33, 13 January 2012 (UTC)Reply
It'd take a little bit of detective work to find those exact revision ids again, but we can try if you really want. Either way, we see that kind of revert all the time in Huggle, especially if the author was an IP address. Steven Walling (WMF) • talk 21:35, 13 January 2012 (UTC)Reply
I find that worrying, on the other hand this is one area where we could potentially improve things. Efficiency gains in processing vandals are going to be minor as we are already pretty efficient there, and no-one realistically plans to recruit former vandals in any numbers. But I would suggest we drill further re this subset. Of course part of the risk is in looking at edits individually. If someone has been editing in bad faith then after a certain point everything they do may get reverted, even if some edits look plausible at first glance. Another risk is that without the edit summary you don't know whether a superficially goodfaith edit was reverted with an edit summary. Of "Sorry, but that paragraph you keep adding is still a copyvio even if you change the spelling to American English". WereSpielChequers 22:58, 13 January 2012 (UTC)Reply
Rather than dig through for the original and now stale ones can you give me some recent ones if you are still monitoring this sort of stuff. I'd like to know whether people are misusing rollback and also whether other edits put these into context. WereSpielChequers 18:12, 23 January 2012 (UTC)Reply

Vandal testing[edit]

One variable you might find worth testing in the vandal warnings is grey hair. The theory being that few things are more offputting to an adolescent than realising they are playing a game with a grandparent. RHaworth's userpage is a case in point. This may be one area where an icon or picture might pay dividends, at least in terms of deterring the vandal at an earlier stage WereSpielChequers 15:10, 16 January 2012 (UTC)Reply

Research: namespace?[edit]

Is there a reason for this page not to be in the Research: namespace?--Eloquence (talk) 18:59, 3 April 2012 (UTC)Reply

We don't really think of it as a research project. Other than that, no reason in particular. Steven Walling (WMF) • talk 20:11, 3 April 2012 (UTC)Reply
Then perhaps it shouldn't have "Wikimedia research projects" in the infobox. ;-) --Eloquence (talk) 20:38, 4 April 2012 (UTC)Reply
Yeah, that's confusing. Nuked! Steven Walling (WMF) • talk 20:48, 4 April 2012 (UTC)Reply
I second Eloquence, especially that this is now back in the E3 experiment list with its own template, how is A/B testing not research? --DarTar (talk) 16:57, 9 May 2012 (UTC)Reply