New Readers/Raising Awareness in Mexico/Research

This is an overview of possible methods for addressing the following two research goals: 1) measuring immediate impact of this campaign -- i.e. how many new (unique) devices visit Wikipedia as a result of the video -- and, 2) measuring long-term impact of this campaign -- i.e. how many new readers do we retain in one month? two months? The first research question is relatively easy to answer while the second one is much more difficult.

RQ1: Immediate Impact[edit]

The video link goes to an intermediary landing page with Piwik/Matomo analytics instead of the usual webrequests logging, so there are a few non-standard aspects to be aware of in order to support research on impact.

Matomo will gather the necessary statistics on pageviews to the welcome landing page and what links are clicked on from there. If a deeper analysis is desired of what Wikipedia pages are visited within these sessions, the referer information (which will reference the landing page) combined with client-ip and user-agent from the webrequest logs can be used to reconstruct the pages visited. Individuals who download the mobile app and use that to view Wikipedia will not have this referer information, but an estimate of the count of them can be provided from the Matomo data.

The webrequests logs can be used to answer questions such as how many of these sessions are associated with new devices in the last 30 days (and presumably new readers then), which can be estimated from the WMF-Last-Access cookie (i.e. device accepts cookies but no WMF-Last-Access cookie set) in the webrequest table. If further details are desired about sessions associated with these devices in the last 90 days, requests associated with this user-agent + IP can be examined. A caveat with this is that the greater the time period studied, the more likely it is that the combination of user-agent and IP address is no longer a good proxy for that individual. Other individuals might use this user-agent + IP (e.g., in the case of a shared desktop), that individual’s IP address might change (e.g., they are using data on mobile), or their user-agent might change (e.g., user-agent includes browser version info and anecdotally Firefox updates every few weeks on average).

RQ2: Long-term Impact[edit]

Measuring whether the users who clicked through to Wikipedia from the video are returning one week, two weeks, one month, etc. later is much more difficult. Because the capability for tracking users is not immediately available and due to the caveats about user-agent + IP address mentioned above, the accuracy of identifying all sessions associated with a particular device presumably falls off quite rapidly over time. This leaves two main options:

User-Agent + IP Address[edit]

Despite the caveats, still use a hash of these two fields (as well as potentially other fields like browser language) to identify whether devices that clicked on the video link returned to Wikipedia within some period of time.

Pros: this is a rather direct method of estimating return visitors.

Cons: this will almost certainly be an underestimate, though it is difficult to predict how large of an underestimate it is.

Difference in Differences Causal Inference[edit]

Causal inference does not seek to identify specific devices associated with the campaign but depends on detecting changes in aggregate traffic that can reasonably be associated with the campaign. To use this method, we need to monitor aggregate traffic to a set of pages that are likely visited by individuals who viewed the campaign (referred to as “treated” pages) and a separate set of pages that are unlikely to be visited by individuals who viewed the campaign (referred to as “control” pages). This aggregate page views to the “treated” and “control” pages is measured before and after the campaign. If the pageviews increase more in the “treated” pages than the “control” pages after the campaign, this can be assumed to be the result of the campaign.

Pros: This approach does not require any webrequest data, just careful selection of the “control” and “treatment” pages. It therefore can be done with publicly available data.

Cons: The drawback of this approach is that other external factors might affect the analysis (e.g., the “treated” pages chosen are coincidentally affected by some other external factor not related to the campaign such as a drop in Wikipedia usage due to the holidays). Selecting the “treated” and “control” pages can also be quite difficult if there is no strong hypothesis as to what types of Wikipedia pages might be visited by the individuals who will watch the video. Some insight might be gained from our research on characterizing Wikipedia reader behavior by identifying pages likely to be visited by readers from Mexico to Spanish Wikipedia with work- or school-related motivations, but this would be a rough estimate at best for the population that will likely be reached through this video campaign. Finally, it is not possible to know whether a detected change in pageviews is due to new readers from the campaign or already existing readers associated with the campaign who simply increased their page views.

Alternatives[edit]

An analysis of aggregate statistics that measure how many pages the new readers (as identified in RQ1) read in that first session could give an idea of whether these individuals are engaging with the site or checking it out and quickly leaving. There is no research that we know of at this moment that would tie new reading behavior to likelihood that the user returns, but intuitively greater interaction is a better indicator of future returns.