Research:HTTPS Transition and Article Censorship
In June 2015, the Wikimedia Foundation started using HTTPS to encrypt all traffic to Wikimedia projects (see blog post). The transition to HTTPS presents a unique opportunity to investigate censorship of Wikipedia at the article level. The aim of this project is to investigate how article access patterns changed in different countries as a result of the transition to HTTPS and to determine if there are patterns indicative of censorship practices that relied on requests being sent via plain HTTP. If these patterns exist, we aim to build a model to determine if an article was censored in a given country.
We constructed a data set consisting of hourly time-series of pageview counts for each article and country. The data spans the time period from May 2015 to July 2015, but can easily be extended to later months.
We started by computing the difference in total views in the 2 weeks before and after the transition for each (article, country) pair and manually inspected the articles with the largest relative increase. This heuristic is quite weak. Many articles that were trending in the weeks following the HTTPS transition but are known to not be censored show a large relative increase in pageview counts. However, for some articles that we suspect to have been censored, there is a clear pattern: pageview counts are very low up to the transition, then rapidly increase after the transition and remain high. We have been able to find several hundred examples of access patterns that resemble the figure on the right. Our next steps involve hand labeling a larger sample of times series and then building a model to find all time series that show this pattern in access rates. We will also investigate if there are other characteristic patterns that are indicative of censorship.
Abstract of the resulting paper
"This study, conducted by the Internet Monitor project at the Berkman Klein Center for Internet & Society, analyzes the scope of government-sponsored censorship of Wikimedia sites around the world. The study finds that, as of June 2016, China was likely censoring the Chinese language Wikipedia project, and Thailand and Uzbekistan were likely interfering intermittently with specific language projects of Wikipedia as well. However, considering the widespread use of filtering technologies and the vast coverage of Wikipedia, our study finds that, as of June 2016, there was relatively little censorship of Wikipedia globally. In fact, our study finds there was less censorship in June 2016 than before Wikipedia’s transition to HTTPS-only content delivery in June 2015. HTTPS prevents censors from seeing which page a user is viewing, which means censors must choose between blocking the entire site and allowing access to all articles. This finding suggests that the shift to HTTPS has been a good one in terms of ensuring accessibility to knowledge. The study identifies and documents the blocking of Wikipedia content using two complementary data collection and analysis strategies: a client-side system that collects data from the perspective of users around the globe and a server-side tool to analyze traffic coming in to Wikipedia servers. Both client- and server-side methods detected events that we consider likely related to censorship, in addition to a large number of suspicious events that remain unexplained. The report features results of our data analysis and insights into the state of access to Wikipedia content in 15 select countries."