Jump to content

Research:Understanding search behavior of users

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T310021
Ricardo Baeza-Yates
Erik Bernhardson
Duration:  2022-June – 2022-August
This page documents a completed research project.
Understanding Search Behavior in Wikipedia - Report - Bruno Scarone

The objective of this research initiative is to provide a better understanding of how and why WMF's internal search is being used by the different types of users of the platform. The research is conducted per the request of the Search Platform team at WMF.

Research questions


The research primarily targets the following questions:

  • Understand differences in search behavior on the web vs mobile user interface (including browsers).
  • Understand country/regional differences, especially in emerging countries and specific languages.

These were selected from a prioritized list of research questions which are of interest for the Search Platform team. Other elements relevant to the study were taken from the following Phabricator task, namely:

  • Evaluation of common or relevant query patterns.
  • Top returned documents (articles) and top clicked through documents.

Data and Methods


Data sources


The different data sources utilized in the analysis are listed in the table below, together with links to their description page (if available), as well as the data retention policy that applies to them:

Name Max. data retention
Web request logs (wmf.webrequest table) 90 days
Search logs (event.mediawiki_cirrussearch_request table; abbreviated emcr) 90 days
discovery.query_clicks_daily table (abbreviated dqcd) 90 days
wmf.mediawiki_wikitext_current table -
event.searchsatisfaction table -
Pageview hourly (wmf.pageview_hourly table) -
Data sources used for this report

Additional information about the data sources can be found in the report available on this page.

Evaluated metrics


A list of metrics computed as part of the project for the defined time ranges are presented next, grouped by the type of analysis. In what follows, when sections are referenced, they correspond to the sections of the project's report, available on this page.

  • Both for general search behavior (Section 3) and search behavior based on client type (Section 4):
  1. Total number of sessions
  2. Distribution of number of clicks per session
  3. Dwell time (based on sessionized clicks and checkin time)
    1. Percentage of clicks that have an associated dwell time
    2. Average page length per dwell time (histogram) bin
  4. Average ranking position clicked on
  5. Number of words per query
  6. Top k queries
  • Search behavior based on language and country
  1. User hits per country for a given language
  2. Usage metric (number of hits normalized by size)

Time ranges


The quantities analyzed for the general search behavior and behavior based on client type have been computed for 2 one-week time ranges: 2-8/May/2022 and 4-10/Jul/2022. Both time ranges span the first complete week (from Monday to Sunday) of a month and the selection was done as an initial validation check of the results. At times for the case of general search behavior, when results are considered to be similar for both time ranges (2-8/May/2022 and 4-10/July/2022) only one set of results is shown, based on the fact that both are included in the analysis based on client type.

When analyzing user hits per country for a given language (Section 5.1), two time ranges are used (again for validation purposes) at the monthly level (Feb/2021 and Jul/2022). The proposed usage metric (Section 5.2) is computed for 4-10/Jul/2022, since it depends on the number of words per project and the user hits, quantified and validated in Section 5.1.

Infrastructure and software


All code used to generate the results for this research initiative was run on Wikimedia Foundation's production cluster. In particular, the databases were queried using Apache PySaprk version 2.4.4.



Further details about the work performed as part of project, as well as the results obtained, are included in the project's report (also attached at the top of this page).



Pages with the prefix 'Understanding search behavior of users' in the 'Research' and 'Research talk' namespaces:

Research talk: