Research talk:Trending articles and new editors
Tuesday, June 28th
- Planning to analyze the page view bursts and new editors within the range of 15-30 days.
- Even this size, the transfer and processing will take half a day or more.
- Extracted bursts from the Wikistats logs for the first half of January 2011.
- Currently using the following heuristics to decide a burst in page views:
- the burst period must have 20 times more page views than the previous time range,
- the page views are averaged out with the values from surrounding 13 hours,
- Currently using the following heuristics to decide a burst in page views:
--whym 04:23, 29 June 2011 (UTC)
Wednesday, June 29th
- Minor fixes to the burst detection script and it now seems to work well.
- Due to the huge size, I am limiting the scope of the analysis to 16,000 articles by random selection.
--whym 03:33, 30 June 2011 (UTC)
Thursday--Friday, June 30th--July 1st
- Found that my dataset contains a very small number of newbie edits to 'trending' articles (only 16 out of 600+ edits). Possibly because those trending articles were protected/semiprotected.
- In contrast I have seen many IP edits that seem to be made by newbies.
--whym 20:16, 5 July 2011 (UTC)
- It is difficult to be certain, and as well as some longstanding IP editors I've also known some burned out Wikipedians who withdraw from wikispace but still do some IP editing, but yes my impression is that a lot of goodfaith IP edits are by newbies. If you are interested there are two related lines that you might want to investigate. One is an aspect of the pending changes "trial" on the English Language Wikipedia. During the process which ran for much of 2010 and into 2011 there were hundreds of semiprotected articles that were opened up to IP and Newbie editing but still monitored by the pending changes process. The trial is now over and those articles are now either unprotected or semi protected, but it would be useful to have a neutral researcher look at what actually happened. To be fair I should warn you that pending changes was a very contentious trial, and it is hard to know whether these articles were meaningfully more open when in some cases the notoriety that attracted vandals has long ebbed. A much less contentious area that I started last summer is the Death anomalies process this is a tool and related informal project that is looking at death anomalies between different language versions with ten different projects now extracting reports. I sometimes deal with anomalies on the English language report, and it is certainly my experience that the death of a notable person often results in one or more newbies updating their article (typically this means looking at the article seeing that since the deat newbies and or IPs have added the death date but perhaps it needs a reliable source and some formatting as well of course as chnaginging the category from "living people" to "died in 2011"). So if by "trending articles and newbies" you mean "real world events that prompt new editors to come to Wikipedia" this would be a good place to look - and if you can tell me anything about how my baby is running in any of those other languages I'd be very interested. WereSpielChequers 10:13, 6 July 2011 (UTC)
Tuesday--Wednesday, July 5th--6th
- I have been trying to get a meaningfully large number of samples of newbie edits in trending articles.
- Also tried detecting spikes by least-square fitting and it looks better than using simple relative difference.
- Trying to see if newbie edits are really so few in trending articles. With different detection mechanisms, different threshold values and time frames, it seems to be true. I'm worried about that seeing the tendency in those few newbies will be statistically less meaningful.
- Further breakdown of anonymous edits could be interesting to see. I'll be seeing edit counts and length of edit history of those anonymous editors editing trending articles.
--whym 23:06, 7 July 2011 (UTC)
Further analysis and debug
- Found a bug in retrieving relevant revisions. The results as of today take some older revisions before trending hours into account mistakenly. I will be fixing this soon.
- Implementing User:WereSpielChequers and User:Buickmackane's suggestion to find whether each revision was semi-protected or not. This also will be included in the next update.
- Another interesting aspect is to find how many of experienced registered users edits were reverts. (by User:Drdee)
--whym 04:30, 15 July 2011 (UTC)
Late July--Eary August
- To look into how people who contributed to trending articles continued editing, I added a column showing the edit count between 120 days later and 210 days later than each edit. I also retrieved older page view counts and saw the historical change. Both combined, there seems to be a drop in the retention between 2010 and 2011 (although this might be too early to judge, since I only have 3 data points: January 2009, January 2010 and January 2011). Absolute number of newbies editing trending articles is also much smaller in 2011, probably because semi-protection of trending articles was much more frequent. I will be updating the sprint page with new plots. --whym 03:49, 10 August 2011 (UTC)
Before I was an admin but not at the start of my wiki career I remember being online for the evening when Sarah Palin was unveiled as John McCain's running mate. The article was incredibly busy, I think we hit 25 edits a minute at one stage with several sections having edit wars, then it went from semi-protected to protected and I could no longer edit it. This article went through all three common stages of becoming more difficult for newbies to edit:
- The sheer level of activity meant that edit conflicts were as common as successful edits, and this would differentially exclude newbies who hadn't yet learned that when an article involves currently breaking news you can't successfully edit the article, you can only edit individual sections. We could probably mitigate this with a function that disabled the Edit button and only allowed people to edit sections. But I would anticipate a very different result for articles which go completely off the scale in activity as opposed to the 99.99999% of articles. In other words we have 3.6 million articles and usually none of them are quite that hyper active, but the 1 in ten million scenario of a major breaking news story is a very different editing experience to the norm. also it may be 1 in 10 million articles, but in terms of readers and potential editors it is many orders of magnitude greater. I suspect that editors who try to edit a few times but keep getting edit conflicts may be amongst those who give up on us, this may be measurable.
- Semi-Protection is a level of protection that stops editing by newbies and IP editors, it only applies to a tiny proportion of articles, but that includes a significant number of high profile public figures. The drawback of semiprotection is that it may be offputting to newbies who start editing an article related to a trending topic, then find themsleves unable to edit even though they were editing in Goodfaith. Again this should be measurable and there is a potential mitigation - admins can grant autoconfirmed status to such goodfaith newbies. But i'm not aware that this is often considered in such circumstances.
- Full protection. This was the thing that eventually drove me from the Sarah Palin article that evening, full protection restricts editing to admins, and this cuts out a range of editors from fairly new editors who've just been autoconfirmed to some very longstanding editors who aren't admins.
I would anticipate that each of the above will result in a very different experience for the newbies concerned. a separate variable will be the degree of contention, a breaking new story about a tsunami is likely to be quite collaborative. http://en.wikipedia.org/wiki/Gaza_flotilla_raid and many other breaking news stores rather less so. My suspicion is that many editors start in the middle of battlegrounds such as http://en.wikipedia.org/wiki/Talk:Murder_of_Meredith_Kercher and find that wikipedia is quite hierarchical, we admins with our NPOV and insistence on sourcing and a global view are a great restriction on their desire to get the truth onto the web. WereSpielChequers 14:57, 5 July 2011 (UTC)
Edits per hour
Re "trending articles get 7 edits per hour whereas non-trending articles get 8.5 edits per hour in average in my dataset". Across the whole of EN wiki we get ten million edits every 50 days or so, allowing for edits outside article space that gives us an average of about one edit per article per month, individual articles can be many times more active - en:Sarah Palin hit 25 edits a minute at one point. But a group of articles with 7 or 8.5 edits per hour are much more active than the average. Could these be pageviews or were these preselected unusually popular articles? WereSpielChequers 11:19, 11 July 2011 (UTC)
- I should have provided edits per article per hour instead of edits per hour. The strangely high number of edits in non-trending articles was due to the fact that the figures were not normalized by the number of articles. When normalized by the number of articles, it was 0.264 edits for trending articles, and 0.0132 edits for trending and non-trending in average. It might not be very accurate due to the skewed way I sampled the dataset. The dataset only contains articles that appear in the page view counts, i.e., articles with no page view are out of scope. Does this sound reasonable? --whym 19:04, 12 July 2011 (UTC)
- Very reasonable, but I think it would help to have a line explaining that. I'll put in my interpretation, but if you could correct me and maybe put some figures in it would be helpful. I suspect that you'd find that goodfaith newbie edits to the vast majority of articles that are not in this study are unlikely to be reverted. WereSpielChequers 21:58, 13 July 2011 (UTC)