|This page is currently a draft. More information pertaining to this may be available on the talk page.
Translation admins: Normally, drafts should not be marked for translation.
What is a session?
How can we use it as a metric?
How do we calculate it?
Where do we calculate it?
Use cases and requirements
The resulting approach must allow us to:
- Break sequences of user-triggered events into sessions;
- Calculate an approximate length for these sessions;
- Identify the number of events in the sessions;
- Identify the time between each event in each session;
- In a way efficient enough that it can be performed on an as-needed basis, at the Wikimedia logs' scale.
HCI prior art
Generating breakpoints between sessions is a common area of interest within the wider field of Human-Computer Interaction. It's particularly prominent in relation to search engines, where there is an interest in distinguishing "sessions" from "tasks" and investigating the relationship between the two, but also appears in works dealing with web analytics generally, and collaborative projects specifically. The industry standard methodology is to generate a set of intertime values, between each successive pair of events associated with a user, and then "end" the session after the intertime value gets above a single, specified number of seconds, used to indicate inactivity.
The de-facto standard for what this inactivity threshold should be is 30 minutes (1,800 seconds). This originates in a 1994 study by Catledge and Pitkow, which used client-side tracking to determine Internet user activity patterns. The mean between each user event was 9.3 minutes, leading to a session timeout value of 25.5 minutes - 1.5 standard deviations out. This is commonly rounded to 30 minutes, giving us the 1,800 second value. This value is considered the industry standard breakpoint, and is still in use as of 2014.
Criticisms of this standard center on two themes. The first is the idea of a global, 30-minute threshold value. Jones and Klinkner (2008) found that, in relation to search data, "this threshold is no better than random for identifying boundaries". Mehrzadi & Feitelson (2012) found that, with their search dataset, trying to find a single global threshold, "any chosen value will be too short for some sessions with relatively long breaks, but too long for other cases where it will erroneously concatenate sessions that should actually remain separated. In either case, the data about the longer sessions (which may be the more important ones) will be erroneous." It is worth noting that this is potentially down to the choice of dataset: by definition, search logs are going to contain different behavioural patterns from dedicated, knowledge-based websites, since a well-designed search engine directs users away from the site as fast as possible. Within the Wikimedia research we found a very different outcome to testing global thresholds. Mehrzadi & Feitelson instead suggest using "domain knowledge and intuition" to set per-user thresholds.
The second is the use of an approach involving threshold at all. As well as problems with the specific 30-minute value, Jones and Mehrzadi, along with Montgomery and Faloutsos (2001), were unable to find a single value that was uniformly useful. Instead, several non-threshold-related methodologies have been developed. Cooley et al.(1999), along with Srivastava et al. (2000) and Spiliopoulou et al.(2003), discuss Navigation-Oriented Heuristics (contrasted with Catledge and Pitkow's Time-Oriented Heuristics) - exploiting the link- and search-based nature of most internet browsing and using the theory that pages that are not accessible from each other, and that are accessed by the same user, belong to different sessions:
a requested Web page P that is not reachable from previously visited pages, should be assigned to a different session. This heuristic also accounts for the fact that P need not be accessible from the page immediately accessed before it. Rather, the user may backtrack to a page visited earlier, from which P is reachable. These backward moves are not always registered by the Web server, because the pages to which they refer can be already available in the client’s cache. In that case, the heuristic reconstructs the shortest path of backward moves leading to P and adds it to the user’s session.
While fascinating and worth exploring, it's doubtful as to whether it would be efficiently doable for a network like ours. Wikimedia projects have 159,956,444 pages; even were we to be able to match URLs to page titles consistently, the sheer number of pages and links is likely to make this a too-computationally-expensive task to perform on an ad-hoc basis.
Wikimedia prior art
Within the Research and Development work around Wikimedia projects, several things have touched on session length. The first major work, by Aaron Halfaker and Stuart Geiger, looked at sessions of editing activity, and was published
- White, Ryen W.; Huang, Jeff (2010). "Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs" (PDF). Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval: 587–594c. ISBN 978-1-4503-0153-4.
- Jones, Rosie; Klinkner, Kristina Lisa (2008). "Beyond the Session Timeout: Automatic Heiarchical Segmentation of Search Topics in Query Logs" (PDF). CIKM 08 (ACM).
- Jansen, Bernard J.; Spink, Amanda (June 2003). "An Analysis of Web Documents Retrieved and Viewed". 4th International Conference on Internet Computing.
- Jansen, Bernard J.; Spink, Amanada. Saracevic, Tefko (2000). "Real life, real users, and real needs: a study and analysis of user queries on the web" (PDF). Information Processing and Management 36. ISSN 0306-4573.
- Eickhoff, Carsten; Teevan, Jaime., White, Ryen., Dumais, Susan. (2014). "Lessons from the Journey: A Query Log Analysis of Within-Session Learning" (PDF). WSDM 2014 (ACM).
- Huntington, Paul; Nicholas, David; Jamali, Hamid R. (2008). "Website usage metrics: A re-assessment of session data". Information Processing and Management (Elsevier) 44: 358–372. ISSN 0306-4573.
- Geiger, R.S.; Halfaker, A. (2014). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).
- Tyler, Sarah K.; Teevan, Jaime (2010). "Large Scale Query Log Analysis of Re-Finding" (PDF). Proceedings of the third ACM international conference on Web search and data mining (ACM): 191–200. ISBN 978-1-60558-889-6.
- Goseva-Popstojanova, Katerina; Singh, Ajay Deep; Mazimdar, Sunil; Li, Fengbin (2006). "Empirical Characterization of Session-Based Workload and Reliability for Web Servers" (PDF). Empirical Software Engineering (Springer Science) 11: 71–117. ISSN 1573-7616.
- Spiliopoulou, Myra; Mobasher, Bamshad; Berendt, Bettina; Nakagawa, Miki; (2003). "A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis" (PDF). Journal on Computing 15 (2): 171–190. ISSN 1526-5528.
- Catledge, Lara D.; Pitkow, James E. (1995). "Characterizing Browsing Strategies in the World-Wide Web". Proceedings of the Third International World-Wide Web conference on Technology, tools and applications (Elsevier): 1065–1073. ISSN 0169-7552.
- Mehrzadi, David; Feitelson, Dror G. (2012). "On Extracting Session Data from Activity Logs". Proceedings of the 5th Annual International Systems and Storage Conference (PDF). SYSTOR '12. ACM. ISBN 978-1-4503-1448-0.
- Montgomery, Alan L.; Faloutsos, Christos (2001). "Identifying Web browsing trends and patterns" (PDF). Computer (IEEE) 34 (7). ISSN 0018-9162.
- Cooley, Robert; Mobasher, Bamshad; Srivastava, Jaideep (1999). "Data Preparation for Mining World Wide Web Browsing Patterns" (PDF). Knowledge and Information Systems (Springer) 1 (1): 5–32. ISSN 0219-3116.
- Srivastava, Jaideep; Cooley,Robert; Deshpande, Mukund; Tan, Pang-Ning (2000). "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data" (PDF). SIGKDD Explorations (ACM) 1 (2). ISSN 1931-0153.
- as of 2014-10-22 20:21:47 UTC