Research:Characterizing Wikipedia Citation Usage
See January 2020 preprint: https://arxiv.org/pdf/2001.08614.pdf
As per Wikipedia's Citation page “Citations have several important purposes: to uphold intellectual honesty (or avoiding plagiarism), to attribute prior or unoriginal work and ideas to the correct sources, to allow the reader to determine independently whether the referenced material supports the author's argument in the claimed way, and to help the reader gauge the strength and validity of the material the author has used.”.
Therefore, at a historical time where the role of fake news has proven to be pivotal in many political and societal matters, it is of great interest to assess if the external citations in Wikipedia are leveraged (and checked) by its readers.
The goal of this project is to form an understanding of the role of external citations in Wikipedia reading. To this purpose, we plan to instrument all the English Wikipedia articles for a limited amount of time, to capture user interactions with the footnotes and references. The expected outcomes of our project are: (1) gaining a deeper understanding of the citation usage patterns showcased by the readers, and (2) providing insights and potential recommendations to elicit higher interest on citations.
In the longer term, we would like also to develop a predictive model that can output the click frequency of any given external citation. The insights exposed by this model could inform the design of new/better tools for citation visualization, such as the Reference Tooltips.
We expect the methodology and the list of questions change over time and as we learn more about the data. The current work-in-progress list of questions we have brainstormed about are listed below.
Citation Usage in Wikipedia
- What are the most cited resources? Building on existing research on citations with identifiers, we will breakdown the analysis per type of resource (e.g., scientific article, newspaper article, company webpage, blog, social media, etc.), per URL domain, per topic
- How frequently are references clicked (e.g., what fraction of pageviews entails a reference click)?
- Which characteristics of a Wikipedia article impact how often the external citations are visited? We will use different techniques to highlight the role of article popularity, article topic, article quality, article saliency (e.g., article about a current, trending event) in citation consumption.What are the common characteristics of an external citation, and how is citation clickthrough affected by these? Characteristics include position on the page, creation time relative to the lifecycle of the article, reference type (e.g., further info, support for a fact, etc.)
Reader Behavior Analysis
- Can we study how distinct groups of Wikipedia readers interact with external citations? e.g., reference followers vs. ignorers, top-to-bottom readers vs random section readers, etc.
- Which type of reading sessions lead to consult the external citations more frequently? We will evaluate the impact of session length, per entry point (search engine vs. random browsing), per Wikipedia browsing session (first article vs. end of a session)
There will be 2 data collection iterations. The aim of the first round is to check to tune the Schema and the timespan. The second round will gather more solid data on which we will performa large scale analysis.
First round of data collection
We are in early stages of understanding the data -- this means the data collection plans are work in progress, and we will iterate on. As of June 14, 2018, here is what we know about the first data collection step:
- We will collect data for a few days, sampling 1 to 15% of traffic, depending on the sparsity of entries (we do not know the frequency of citation usage, so we may have to change this plan based on the initial validation steps)
- After this period, we will check the data quality. Once that’s verified, we intend to do data collection at 100% sampling rate for a period of one week.
- The schema for the data we will collect is here: Schema:CitationUsage. In the schema, we are not storing the IP address, this is automatically collected by the event capsule
- While the Schema does not include the IP, the clientIP is collected by the EventLogging capsule, and the information gets purged (dropped) every 90 days, see Data retention and purging
- Initially, we intend to purge all data at 90-day time intervals until we get a better sense of what kind of signal we can get from this kind of data.
- We won't collect data from logged-in users.
Discussion moved to the talk page.
The first round of data collection started on Thu, Jun 28, 11:13 PM (UTC) and ended on Mon, Jul 9, 11:06 PM (UTC)
Second round of data collection
The second round of data collection will start in the first week of September, and will last for around one month, compared to the previous round, we made a few changes:
- Schema changes: During the first round of analysis, we identified some issues and changed the schema accordingly. With the previous schema, we have data about users' interactions with citations. To fully understand engagement with citations, we need to get data about page visits that do not covert into interactions with citations. So far, we have approximated this data by querying the webrequest table, as explained in the documentation of the analysis. To have a good estimation of this quantity, we added a 'pageLoad' event to the new Schema.
- Timescale changes: We are expanding the data collection period to one month. In our analysis, we noticed that capturing data for one week only leads to a biased representation of the actual reader behavior. For example, we found in the first round of analysis that the most clicked references correspond to movies released on that week, or to sport events popular during that week. We hope that capturing data for longer time will smooth out the effect of specific time-specific events.
- Anonymity. We care a lot about anonymity. In our analysis, we discard IP and user agent, i.e. the most sensitive data in our Schema. We publish the code used for our analysis to show the specific fields of our interest, which do not include sensitive data. We don't collect data about logged in users. Also, we plan to purge all sensitive data after 90 days, which means that at least we will:
- Drop IP/user agent columns
- Drop all fields of geocoded_data which are more specific than "country"
- Drop areas/pages with few samples.
- Make the timestamp as coarse as possible
We started the data collection on Wed, Sep 5 2018, at 4:33 PM UTC, and temporarily stopped the data collection after a few hours due to the overload of events. We are making changes to the schema to sample page load events at 50%, and will start it again during the week starting Sept 17 2018.
After carefully analyzing the event overload problem, we decided to split the schema into two.
- The Citation Usage schema records all readers' interactions with footnotes and references, and it's sampled at 100%
- The Citation Usage Page Load schema records all reader's pageviews, and it's sampled at 33,3%, to avoid event overload.
The second round of data collection with this new schema configuration started on Sep 24 2018, 6:16 PM UTC, and ended on Oct 25 2018, 11:05 AM UTC.
Third round of data collection
After finding and fixing some bugs in the data coming in from the second round of data collection (mainly related to section header characteristics and the storing of the citation identifier), we decided to collect a third round of data.
The third round of data collection started on March 21st, 2019, and ended on April 23rd, 2019.
Results also in https://arxiv.org/pdf/2001.08614.pdf
First round of data analysis
See the first round of analysis page.
Second round of data analysis
See the Second round of analysis page.
Third round: Analyzing reading sessions
See the Analyzing Reading Session page.