Grants talk:IdeaLab/Wikipedia Metrics for Institutions

From Meta, a Wikimedia project coordination wiki

Merged proposal[edit]

I made a similar proposal and I am merging my content here and deleting that proposal.

Grants:IdeaLab/Audience metrics for small groups of articles
What is the problem you're trying to solve?

Organizations will not commit their resources to developing Wikimedia projects unless they have proof that Wikimedia projects help them meet their own goals. One very common goal that many organizations have is "dissemination of content", and this also is something that Wikimedia projects do to a greater audience and at lower cost than all other options in those cases in which the dissemination is to happen online. However, many organizations fail to recognize that Wikimedia projects can be used in this way because the analytics reports done on Wikimedia projects at as a whole are difficult to put into the context of any entities focused interest.

If there were some way to show an analytics report that was relevant to a particular organization, then that organization could more easily recognize that Wikimedia projects reach their audience, and they would be more persuaded to contribute.

What is your solution?

A tool should be created which does the following:

  1. A prompt encourages a user to make a list of Wikimedia content pages, starting with articles on English Wikipedia
  2. A prompt asks a user to give a time range
  3. Given that input, the tool checks http://stats.grok.se or wherever data is stored, and gets a pageview count for each of those articles within that time range
  4. The tool outputs the traffic for each article, summarizing it by month for each article, and giving a total for all time in the range for each article, and giving a sum for all articles within that time range
  5. This data forms the basis of that organization's impact report on how Wikimedia projects helped them achieve their communication goals
Goals
  • Organizations use the tool to track the traffic to Wikipedia articles
    • Wikipedia content contribution and article development become a standard communications practice in organizations for sharing information aligned with Wikimedia community values
    • Organizations begin reporting in communications conferences that they use Wikimedia traffic reports in a way analogous to how they report Twitter Tweets, Facebook likes, and other kinds of pageviews
  • WikiProjects adopt these tools to build community enthusiasm for subsets of articles in the WikiProject
  • More people begin tracking trends based on their own personal interest, and monitor readings which are set up and displayed on their userpage

Blue Rasberry (talk) 16:16, 25 November 2014 (UTC)[reply]

I created another version of this page to outline a tool that might be of use to GLAM institutions. Grants talk:IdeaLab/Wikipedia Metrics for Institutions/GLAM. OR drohowa (talk) 18:53, 18 December 2014 (UTC)[reply]

Technical design[edit]

A tool is designed which has input fields including the following:

  • accepts the URL of a Wikipedia page
  • accepts a start date
  • accepts an end date

Given these three inputs, the tool checks the existing pageview counting tool at http://stats.grok.se and returns a single number, which is the sum of all pageviews of all Wikipedia articles listed in the submitted URL.

Disclaimer[edit]

Some specific organizations are named in the below examples. These organizations know nothing about this proposal and are not involved in it. They have a marginal relationship to Wikipedia in that they contribute content to the en:Choosing Wisely health campaign, which the non-profit organization en:Consumer Reports shares on Wikipedia through me. Blue Rasberry (talk) 18:00, 25 November 2014 (UTC)[reply]

Example[edit]

The en:American College of Emergency Physicians is curious about how many people use Wikipedia to get information about emergency medicine, which is their field of expertise. They add some information to some Wikipedia articles, and now would like traffic reports to measure how many people they can reach if they have their experts develop health articles on Wikipedia. After putting information in the articles, they list the articles they have developed at en:Wikipedia:Choosing Wisely/American College of Emergency Physicians watchlist, and they set the time range as May 1 2014 - July 31 2014. In return, they get a single number, which is the sum of pageview counts for each article in that list for each of the three months.

Here is an example of input and output:

Please give a URL to be processed

Please give a date from which to start the pageview count

  • May 1 2014

Please give the last day of the pageview count

  • July 31 2014

In return, the tool would give the following output:

The total number of pageviews received by the Wikipedia articles which were linked in https://en.wikipedia.org/wiki/Wikipedia:Choosing_Wisely/American_College_of_Emergency_Physicians_watchlist during the range of May 1 2014 - July 31 2014 was 523,350.

Output variation - spreadsheet instead of single value[edit]

The minimal accepted output for this tool is a single number, which is a sum of pageviews.

A more useful output would be data which could be put into a spreadsheet and which showed monthly pageview counts for each article. Using the example above, here are some values for the emergency medicine article in a useful format.

article name May 2014 June 2014 July 2014 total
Foley catheter 13,114 11,988 11,990 37,092
Palliative care 60,077 54,262 49,617 163,956
Abscess 48,768 39,840 39,979 128,587
Oral rehydration therapy 18,786 16,730 17,892 53,408
Fluid replacement 5,250 4,473 4,118 13,841
Intravenous therapy 46,942 42,349 37,175 126,466
grand total 523,350

A file should be exportable such that it can be viewed in Google Docs.

Further variation[edit]

Consider the spreadsheet listed above. Suppose that instead of one set of articles to be examined, there were multiple sets. In this case, the tool would be configured such that it would compile multiple reports from a single URL.

For example, suppose the tool was given a single URL. In that URL, there are links to multiple lists of articles, perhaps for American College of Emergency Physicians, American Congress of Obstetricians and Gynecologists, and the American College of Cardiology. In this case, the tool should explore an additional level down instead of examining the links on the page given, and should return the metrics from the articles found by following the links in the provided link.

In this case, the tool would return multiple instances of the spreadsheets shown above, with one spreadsheet per link. The reason behind this is that multiple organizations may want reports for articles within their field of interest, and it should be possible to collect all of these reports with one action.

Explanation of utility[edit]

Many organizations which do online educational outreach desire metrics which enable them to do impact evaluation. Commonly examined metrics include pageviews of an organizations own website, their Facebook Likes, their Twitter retweets and impressions, and their number of followers in various social media platforms. Currently, no analogous metric is available to export from Wikimedia projects. Wikipedia pageviews are the best available comparable metric to any of these other commonly used metrics, and for that reason, having a tool which presents these metrics will enable organizations to compare the utility of supporting Wikipedia as compared to the utility of investing in any other communication channel.

Scale of this[edit]

A typical use case is that a user would want a three-month report on 10-50 articles. An extremely active user may want a yearly report for 300 Wikipedia articles. An anticipated highest use case might be for someone to wish to see a report for all articles in a Wikipedia category somehow, and therefore might request a year's pageview metrics from 5-10,000 articles.

For the sake of this trial, if a scheme could be managed to deliver a three-month pageview report for up to 30 Wikipedia articles in one instance of running the tool, then that would be success. This amount of use should also meet the needs of 99% of anticipated users of this tool.

Over Specified?[edit]

#!/usr/bin/env python
# -*- coding: utf-8  -*-
# Import stats.grok.se view count
# License: Public Domain
import requests, json
from datetime import date, timedelta
# The data lags behind, so don't set to the current date
begin = date(2014, 5, 1)
end   = date(2014, 6, 1) + timedelta(days=-1)
pages = "Foley catheter|Palliative care|Abscess|Oral rehydration therapy|Fluid replacement|Intravenous therapy".split('|')

# Get data
pageviews = {}
for page_title in pages:
    pageviews[page_title] = {}
    # XXX Python's date functions are insufficient
    for i in range(begin.year*12 + begin.month - 1, end.year*12+end.month + 1 - 1):
        req = requests.get("http://stats.grok.se/json/%s/%s/%s" % (
            'en',
            "%04d%02d" % (i//12, i%12+1),
            page_title,
        ))
        response_json = req.json()
        pageviews[page_title].update(response_json['daily_views'])
    # Sum the page views
    page_hits = 0
    for viewdate, views in pageviews[page_title].iteritems():
        if begin.strftime("%Y-%m-%d") <= viewdate <= end.strftime("%Y-%m-%d"):
            page_hits += views
        # Print daily view count in TSV for Excel's Pivot tables
        #    print "%s\t%s\t%s" % (page_title, viewdate.encode('utf-8'), views)
    # Output the results
    print "%s: %s" % (page_title, page_hits)

Its pretty basic, doesn't take long to run. You can modify it to import into Excel's pivot tables to graph weekend affects. Also, in the intervening time since this was proposed Vipul Naik has built the tool: http://wikipediaviews.org/ --Dispenser (talk) 03:44, 8 December 2014 (UTC)[reply]

Dispenser Fascinating! Let me play with this. Blue Rasberry (talk) 20:58, 8 December 2014 (UTC)[reply]