Research:Characterizing Wikipedia Citation Usage/Analyzing Reading Sessions
Here, we provide an analysis of the reading sessions of users as captured in the third round of data collection. The aim of this analysis is to characterize reading sessions according to their "citation need". Similar to the analysis from "Why We Read Wikipedia", we want to categorize readers' intents for exploring the citation space.
The idea is to aggregate similar reading sessions and then come up with an a-posteriori characterization based on the analysis of the resulting clusters. The overall goal is to be able to describe qualitatively and quantitatively clusters of similar reading sessions, in order to understand what makes readers interest in exploring inline citations.
- Global Analysis: First we will aggregate reading sessions using a Markov chain, in order to describe the general reading behavior
- Pattern Mining: Next, we will cluster reading session into meaningful groups of consistent sequences
- Cluster Characterization: Finally, we will characterize these groups according to page characteristics (topic, popularity, etc), citation characteristics (domain, template, etc), session characteristics (length, number of pageloads, etc) and user characterisics (location, interests, etc)
The hypothesis here is that we will find at least three groups of citation interaction patterns in our clusters:
- On-surface patterns: Some references might be hovered or checked (click up and down).
- In-breadth patterns: Readers move outside Wikipedia by clicking on external references.
- In-depth user behaviours: for a specific yet technical or challenging information need, the user enters Wikipedia and delves into one or multiple pages extensively.
We created session representations as sequences of actions for each reader (identified with a unique identifier made of IP+Session Token). Sequences are made of combinations of the following elements:
- pageload: the action of loading a Wikipedia aritcle
- extclick: the action of clicking on an external link outside the reference section
- reflcik: the action of clicking on an external link inside the reference section
- intclick: the action of clicking on an internal link, another Wikipedia page (in the same or in a new tab).
- fnhover: the action of hovering on page-internal links (e.g., “”) that renders the reference tooltip for a given citation
- fnclick: the action of on page-internal links (e.g., “”) that take the user to the reference section at the bottom;
- upclick: the action of clicking from the reference at the bottom back to the anchor (e.g., “”)
- END: when the user stops to interact with the page, after a 1-hour timeout (this definition of "END" state might change with time)
Aggregating reading Sessions
We aggregated the sequences of actions in a single page for different readers, and displayed them as a Markov chain. The arrows that go back to pageLoad refers to the load of the same page after an event. This is mainly due to the click on the browser back button.
- 70% of the session is made by one pageload only (pageload->END).
- After loading a page, 0.2% of the times the reader clicks on an external reference, 0.6% on an external link, 0.8% will hover on a reference, and 22% of the times will visit another Wikipedia page.
- After an external click 13.7% of the times, the readers go back to the Wikipedia page.
- After a reference click 17.4% of the times the readers go back to the Wikipedia page.
- Probability of external click after hover is only 1.3%, while a click to jump (fnClick) in the references section happens 6.7% of the times.
- After a jump on the references section 34.7% of the time the readers click a link.
- The self-loop on intClick (28.3%) is generated by opening multiple tabs (at least 2 from the same page in sequence).
- Self-loop in pageload (5.7%) is typically the pattern: Google -> Wikipedia page -> back button to Google -> click again Wikipedia