New project: Multifactor productivity estimation of Wikipedia growth

This project is mainly statistics gathering plus data processing and econometrics.

I'd like to find a coauthor who knows the Wikipedia data more than I do.

All measures may be available by month, enabling a monthly measure; if not, by-year is okay, and standard for MFP (multifactor productivity).

Estimate output in (say) page-views of all the wikimedia sites -- a count, converted to an index. Alternative frame: Brynjolfsson and Oh value Internet usage by assigning a dollar value to people's time^[1]

Construct indexes with estimates of the "labor" going in to this -- possibly just number-of-edits or bytes-edited on main content articles. It might be good to separate professional, paid, and volunteer if this is feasible.

Construct indexes with estimates of the "capital" going in to this. Most feasibly this would be intellectual capital only, as proxied by the byte-available to be served up as pages. In theory it would be good to get metrics of the servers and software but these are hard to measure and movements in their flow may be closely correlated to the bytes measure.

Unlikely: Estimate of the energy, materials, and services going in -- energy costs? volumes? conferences? Doesn't seem feasible, yet. Just assert these things are highly correlated to the number of bytes/pages being offered.

The project is data-intense and not all the data exists straightforwardly.

The project would (a) match these labor and capital inputs were matched to an electronic/software/service output -- "Wikimedia "services". Hopefully we'd show that the productivity of editors is on average going UP to an extent that more than compensates for declines in the number of editors, as the service continues to improve. It improves more than its labor increases mainly because its intellectual capital is improving; also because it physical hardware capital is improving; and because its workforce has better tools.

This project is production economics and estimation. It is speculative. Some estimates may be possible -- of growth accounted for by labor, growth accounted for by capital, and a residual of productivity growth.

Associated subproject: What obsolescence rate or depreciation rate makes sense for text on Wikipedia sites? More exactly . . . how long is text expected to last before being changed? Or if this is too hard to calculate, what is the distribution of observed ages of text that is replaced? This sort of information will help construct an information-capital measure of the data now on the site.

Key next steps: gather data over time on inputs (effort/labor by staff and volunteers, computer services/availability or cost) and critically outputs, e.g. pages displayed to readers
Methods such as Tornqvist index and computation of log-growth rates are relatively standard and usable

Sub-project supporting both projects above: Categorize editors into roles/occupations

Use data from Quarry or some random sample to classify the workers or labor effort going in to vandal-fighting, copy-editing, .... etc
Estimate occupations, hours, effort, edits, or something associated with these roles/jobs.
For productivity increases induced by platform-creation, can perhaps use micro model in Meyer (2007).
Can perhaps get session durations a la: Research:Measuring editor labor hours and/or Research:Activity session
https://en.wikipedia.org/wiki/Wikipedia_talk:Labels/Edit_types/Taxonomy
https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_types/Taxonomy
https://meta.wikimedia.org/wiki/Research:Automated_classification_of_edit_types/Taxonomy
https://meta.wikimedia.org/wiki/Research:Automated_classification_of_edit_types/Taxonomy
https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Labels/Edit_types&curid=48415252&diff=699643375&oldid=699483962

Catch up on the Research Newsletter and the Wikipapers list
Contact Erik Zachte with update

On the importance and value of the question

Two major league economic statisticians have recently mentioned the issue that free digital goods, including Wikipedia, are not measured in GDP:

Erica Groshen presentation at BEA Advisory Committee, May 2017 noted GDP measurement problem of absence of measures of free nonmarket services, notably Wikipedia. Erica L. Groshen; Brian C. Moyer; Ana M. Aizcorbe; Ralph Bradley; David Friedman. How Govt. Stats Adjust for Potential Biases from Quality Change and New Goods in an Age of Digital Technologies. Their related paper in the Journal of Economic Perspectives, Spring 2017) refers to WIkipedia in the same way.
Hal Varian's presentation to FESAC, the advisory committee on economic statistics agencies: Measurement Challenges in High Tech, FESAC meeting at Census, 9 June 2017
Measures of productivity can get us toward sophisticated measures of output change from year to year

There is a substantive Wikipedia issue of interest also -- can we diagnose the degree to which it is a disabling problem that there are a declining number of editors on English Wikipedia

Labor and bot edits

Source info on what bots do, how much they do, and their purposes.

R. Stuart Geiger; Aaron Halfaker. 2013 When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes?. WikiSym 2013, Aug 05–07, 2013, Hong Kong, China. Copyright 2013ACM . Proceeding WikiSym '13 Proceedings of the 9th International Symposium on Open Collaboration. Article No. 6 . ACM New York, NY, USA ©2013. ISBN: 978-1-4503-1852-5 doi>10.1145/2491055.2491061. On WikiPapers.
ooh! data from Erik Zachte: https://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm
https://en.wikipedia.org/wiki/Wikipedia:Time_Between_Edits -- measures of edit pace, without weights on the edits
Labor: Labor – there are efforts to measure or impute the length of an editing session (based on when there is an observed change from the same IP addr). edit session length would be excellent. Follow up on Geiger and Halfaker's work. If not, at the very least we could do some sort of estimation on page length (bytes -> words -> number of words/minute -> time).
we can use Quarry or other sources to get measures of labor like: how many edits were made in a month ; how many bytes were changed, roughly ; how many pages were changed ; how many editors there were ; how many edits were vandalism or responses to it . . . etc. Capital is harder
and saw a paper examining the decline in content additions to Wikipedia (by Jerry Kane at BC)
For labor, again a measure of time spent editing (from clicking edit to clicking submit)
On bots: Torsten Kleinz Guardians of Global Knowledge: How Automated Helpers Protect Wikipedia. Ethics of Algorithms site. 8 Nov 2017.

Capital measures

Capital: idea: use estimates of the resources required to maintain a database. We can see how different that looks as a capital measure from a bytes-stored measure. we may be able to find some estimate of resources required to maintain a database of X GB and then pull that up to the whole size of all their data

Online output / value-added concepts and metrics

Sources:

https://meta.wikimedia.org/wiki/Research:Measuring_value-added
https://meta.wikimedia.org/wiki/Research:Productive_edit
https://meta.wikimedia.org/wiki/Research:WikiCredit
GDP effects of Facebook etc. "In this paper, Soloveichik and Nakamura introduce an experimental GDP methodology which includes advertising-supported entertainment like Facebook in final output as part of personal consumption expenditures. They then use that experimental methodology to recalculate measured GDP back to 1998. Including 'free' apps in measured GDP has almost no impact on recent growth rates. Between 1998 and 2012, real GDP growth rises by only 0.009% per year. The researchers then recalculate total factor productivity (TFP) growth when free apps are included as both final output and business inputs. For example, Google Maps would be counted as final output when it is used by a consumer to plan vacation driving routes. On the other hand, the same website would be counted as a business input when it is used by a pizza restaurant to plan delivery routes. Measured TFP changes for both media companies and the rest of the business sector. Internet publishing companies are producers of free apps, so including free apps in the input-output accounts raises their TFP growth by 1% per year. The rest of the business sector uses free apps, so including free apps lowers their TFP growth. The net impact is an increase in business sector TFP growth of only 0.004% per year." ^[2]
Relatedly, Byrne et al 2016.^[3]
What level are views tracked at (daily, monthly, etc.)?
Can weight pages by article length, presence of footnotes, pictures, good/featured article, or links to other pages. Anything that gives us something beyond just page views will be helpful. What can we get from from Quarry?
alternatives: operate at the individual article level, or at the more aggregate Wiki-wide level. Wiki-wide is more straightforward. Article level gives us more granularity – although I'm not sure how much more insight it gives us. For now, it probably makes sense to start with aggregate wiki-wide so we can get some calculations going.
I usually use Python for scraping, R for data manipulation and Stata for running regressions. starting with some simple calculations at the wiki-wide level, then it may make sense to just pull together some spreadsheets (as you suggest) and use Stata. Either way: do a small sample to make sure we know exactly what we want to capture
Views as output: See quarry. There are measurement problems ; there was a big unexplained decline in number of views a while back. It might have had something to do caching (e.g., by Akamai), or as you mention the interceding by google with a paragraph answering people’s questions, or something to do with the measurements by Alexa or someone. If so that garbles one of our variables, and we don’t know the fix. If the most easy to get versions of view-counts is not quite what we want, the WMF is probably willing to help. I know some of the people who manage those data flows.
Quarry for data on edits: We get a LOT of access through their amazing “Quarry” search service described here: https://meta.wikimedia.org/wiki/Research:Quarry. It’s available here to anyone with a Wikimedia account: https://quarry.wmflabs.org/. The SQL queries go to a giant databases of most past edits – different databases by language – with the edit sizes and content. The front end makes query strings used by OTHER people are available, so one can copy queries that work. The underlying database is not very complicated, but SQL queries get messy. I’ve seen SQL front ends before and thought this one was quite wonderful, with these “social” shared aspects. We can work together to craft queries and rework and copy them online. And yes, it is conceivable we could get access to other data too. I’ve discussed this a couple of times with people at the WMF. First let’s see what we can get from this source before asking for anything unique. And there’s a big-data management problem here – samples could include millions of edits. (So we might make do with estimates from samples for a while.)
Time shift: Google’s decision to include Wikimedia blurbs. They include Wikipedia output, but aren’t served from/by Wikipedia. (Send Frank (a) my pdf presented to SPSQ, (b) ask about "individual article level", (c) point out error in blurbs re Richard Seaman)
Centrally we need output measures. We can get some version of monthly output views from here: https://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm It’s not easy to understand, but it looks like we could get MONTHLY views starting in March 2008. We can get that for English-only, and for all-languages together – two different output streams, for which overlapping capital and labor were used. It looks like there were sharp declines starting in May 2014 and April 2015. Maybe that’s the google blurbs right there. Some of data issues are discussed here: https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics
views as output; a second measure: views x content length
Google started showing Wikipedia blurbs (rather than just links) for many of its search results. The decision to do this (although not necessarily the processed used to decide when to do it) was fairly public and a finite period in time. Even if we don't know the process, we could correlate with Google Trends. This would lead to an exogenous shock in people clicking on Wikipedia links at therefore an exogenous shift in 1) demand and 2) the likelihood of edits (assuming more views leads to more edits, even if the conversion rate is small). I'm not sure we need to do this, but at the very least we could examine some subquestions that might be interesting (how much did Google change demand, and therefore productivity, of Wikipedia, etc.).
Google blurbs: note error in Richard Seamans case
frame the question being answered as: "How do you measure productivity in crowdsourcing?" or "How to measure productivity for public digital goods?”
output streams for MFP estimation: [1] which cites Research:Characterizing Wikipedia Reader Behaviour which cites more. These are output streams which could be combined into an output index.
a productivity claim associated with translation tool: Wikimedia Blog/Drafts/Content Translation Tool Boosts WikiProject's Productivity 17%

Info on admins and chapters

Wikimedia chapters conference

Effects of google blurbs

References

↑ Erik Brynjolfsson; JooHee Oh. The Attention Economy
↑ Leonard Nakamura; Rachel Soloveichik. Capturing the Productivity Impact of the ‘Free’ Apps and Other Online Media. February 18, 2016. Summary from NBER's list of presentations at March 4 2016 Economics of Digitization conference.
↑ David M. Byrne; John G. Fernald; Marshall B. Reinsdorf. Does the United States have a productivity slowdown or a measurement problem? Brookings Papers on Economic Activity | March 4, 2016. (short version, which links to full version)

[1] Erik Brynjolfsson; JooHee Oh. The Attention Economy

[2] Leonard Nakamura; Rachel Soloveichik. Capturing the Productivity Impact of the ‘Free’ Apps and Other Online Media. February 18, 2016. Summary from NBER's list of presentations at March 4 2016 Economics of Digitization conference.

[3] David M. Byrne; John G. Fernald; Marshall B. Reinsdorf. Does the United States have a productivity slowdown or a measurement problem? Brookings Papers on Economic Activity | March 4, 2016. (short version, which links to full version)

[1]

[2]

[3]