Research:Understanding Curious and Critical Readers/Reader interactions with talk-pages and version-history

From Meta, a Wikimedia project coordination wiki

We analyze how much readers in Wikipedia engage with articles' talk-pages and their version-history.

Motivation[edit]

There are various teaching materials about assessing the quality and trustworthiness of information in the context of Wikipedia. Beyond the actual encyclopedic content, version-history and talk-pages have been pointed out as two of the most useful features for understanding how the knowledge was created.

  • The Civic Online Reasoning Curriculum provides a course on how to use Wikipedia wisely when reading laterally (for more details on lateral reading as a strategy to assess credibility of information online, see the paper Lateral Reading: Reading Less and Learning More When Evaluating Digital Information[1]). In this they point out the importance of talk pages and version history to see who is behind the information and assess its trustworthiness.
  • Module 2 of the Reading Wikipedia in the Classroom guide aims to teach teachers competency about the understanding, assessment and evaluation of information and media by showing teachers how to use various components of a Wikipedia article to determine the quality of information. After the subsection on the article itself (2.4 Overview of Wikipedia article structure), the following two subsections are about
    • 2.5 Talk pages and other communication spaces on Wikipedia. It not only reveals perspectives of contributors but also the quality assessments from WikiProjects tags.
    • 2.6 View history and the process of knowledge creation. It provides information about each edit revealing, e.g., when was the last edit (was it recent?) or by whom (registered editor or anonymous?), how many editors have worked on the article and how much overall activity is there?

Therefore, we want to better understand how much readers in Wikipedia actually use these features. Specifically, our research questions are:

  • What is the overall engagement of readers with talk-pages and version-history of articles?
  • What (types of) articles lead to high engagement with their version-history and talk-pages

Methodology[edit]

Webrequest-logs[edit]

We extract reader interactions with Wikipedia articles’ talk-pages and version-history from webrequest-logs for the month of January 2022 for all Wikipedias.

We perform the following filtering analysis steps for both cases:

  • Filter bots and automated traffic (“agent_type”=”user”)
  • Filter device to keep only desktop and mobile_web (“access_method”!=”mobile app”)
  • We keep track whether the reader also attempted to edit at least one article in the same day in the same Wikipedia. In this way we can distinguish between requests from editors and those that are reading without editing. We assign pseudo-ids to readers via a hash from user-agent, client-IP, and a daily changing salt. Edit-attempts are identified by checking whether the uri-query contains any of the following signatures: “action=edit”, “action=visualeditor”, or “&intestactions=edit&intestactionsdetail=full&uiprop=options”

Talk-pages[edit]

We keep all requests to a talk-page that originated from the corresponding Wikipedia article:

  • Keep requests to pageviews (“is_pageview”=1) to article talk pages (“namespace_id”=1)
  • Only keep requests to talk-pages that came from the corresponding article in the main namespace of the same project. We expect that the referrer is of the form “https://<language>(.m).wikipedia.org/wiki/<page_title>” and only keep the request if <page_title> matches the title of the article talk-page
  • Filter repeated requests to the same article talk-page by the same reader in the same hour.

Version-history[edit]

We keep all requests to the version-history of an article that originated from the corresponding Wikipedia article.

  • Keep requests to version-history by checking the uri-query:
    • Desktop:“uri_query” contains “action=history&title=<page_title>”. This is the same across different projects:
      • https://en.wikipedia.org/w/index.php?title=Marie_Curie&action=history
      • https://de.wikipedia.org/w/index.php?title=Marie_Curie&action=history
    • Mobile: the “page_title” is of the form "Special:History/<page_title>. However, the signature “Special:History” varies across projects and we have to keep track of the different aliases:
      • https://en.m.wikipedia.org/wiki/Special:History/Marie_Curie
      • https://de.m.wikipedia.org/wiki/Spezial:Versionsgeschichte/Marie_Curie
  • Only keep requests to version-history that came from the corresponding article in the main namespace of the same project. For this we extract the page-title from the uri-query (see above) and keep only those that are in the main-namespace and are non-redirects
  • Filter requests to version-history accessed via a feed-reader (“uri_query” contains “feed=rss”)
  • Filter requests iterating through the version-history (“uri_query” contains “offset”)
  • Filter repeated requests to the same article talk-page by the same reader in the same hour.

Features[edit]

We extract the following features for each article:

Reliability: We extract whether an article contains any of the 10 templates associated with reliability issues following the approach from WikiReliability (Unreferenced, One_source, Original_research, More_citations_needed, Unreliable_sources, Disputed, POV, Third-party, Self-contradictory, Hoax). We use the wikitext_current table to extract the templates from the wikitext. Only enwiki.

Results[edit]

What is the overall engagement of readers with talk-pages and version-history of articles?[edit]

We calculate the click-through-rate (CTR) of talk-pages (tp) and version-history (vh), respectively, by dividing the number of interactions with tp and vh by the total number of pageviews (pv) yielding a number between 0 and 1. We also report the inverse of the CTR (CTR-inv) which gives an integer number stating how many pageviews, on average, lead to one interaction with tp or vh.

We report the following stratifications:

  • Device-type: desktop, mobile web
  • Readers: all (all readers incl. editors) and non-editors (only readers that did not edit). To calculate the CTR we divide by the total number of pageviews for all readers in both cases. While this will lead to a small error for the CTR of non-editors, the error is expected to be small since the number of pageviews from editors is very small compared to the number of pageviews from non-editors.
  • Projects: we report the CTR for the five largest Wikipedias in terms of the total number of pageviews in that month: enwiki, jawiki, ruwiki, dewiki, eswiki.
CTR-inv for version-history (1:n pageviews lead to an interaction with the article's version-history)
desktop mobile web
all-readers non-editors all-readers non-editors
enwiki 896 1,829 62,874 92,052
jawiki 754 1,884 33,959 46,803
ruwiki 982 2,274 92,022 140,489
dewiki 661 2,967 62,550 90,369
eswiki 977 3,287 50,021 97,807
CTR-inv for talk-pages (1:n pageviews lead to an interaction with the article's talk-page)
desktop mobile web
all-readers non-editors all-readers non-editors
enwiki 1,610 2,423 1,060 1,084
jawiki 2,607 48,755 74,583 109,003
ruwiki 1,934 2,930 76,558 105,606
dewiki 957 33,451 46,247 56,333
eswiki 1,340 6,613 102,662 192,003

Observations:

  • Engagement with version-history on desktop roughly at 1:500-1000 pageview. As a comparison, this is slightly lower than the engagement with citations: in enwiki it was observed at 1:300 pageviews overall and 1:200 pageviews for desktop (Piccardi et al. https://arxiv.org/abs/2001.08614).
  • Engagement with version-history on desktop not only driven by editors: when considering only readers who are not editing, this rate only increases slightly to 1:1000-2000 pageviews.
  • Engagement with version-history on mobile is almost non-existing at 1:50k-100k pageviews.. This is about 100 times lower than for desktop. It is noteworthy that, while the button for the version-history on desktop is at the top, for mobile it is at the top only for logged-in users otherwise it is at the bottom. One could hypothesize that a more prominent placement of the version-history button on mobile would also lead to a higher engagement as observed on desktop.
  • Engagement with talk-pages in enwiki at 1:1000-2000 pageviews across devices and readers. This is a similar rate as seen with the version-history on mobile. Interestingly, the engagement on mobile (1:1000) is slightly higher than on desktop (~1:2000). It is noteworthy that the button for the talk-page on desktop and mobile is at the top of the article in enwiki. One could hypothesize that this is the reason that we do not see a large different in the engagement with talk-pages across device-types.
  • The engagement with talk-pages on mobile in other wikis is almost non-existent at 1:50k-100k. This is roughly 100 times lower than for enwiki. It is noteworthy that the button for the talk-page on mobile is only visible for logged-in users otherwise it is not visible at all. Comparing with the results in enwiki, it is thus very likely that a talk-page button at the top of the article on mobile could lead to a much higher engagement with talk-pages.
  • The engagement with talk-pages on desktop in other wikis varies strongly. When looking at all readers it is similar to enwiki at a rate of 1:1000-2600 pageviews. However, when only considering readers who did not edit the engagement drops slightly to 1:3000-6000 for ruwiki and eswiki and substantially to 1:30k-50k pageviews for dewiki and jawiki. However, we also did not take into account how many pages have existing talk-pages: while in enwiki most pages have a talk-page, for the other wikis this might be much lower. This needs to be checked.
  • Summary of the placement of buttons for version-history and talk-pages
    • Version-history: For desktop always at the top. For mobile at the bottom (logged-out) or at the top and bottom (logged-in).
    • Talk-pages: For desktop always at the top. For mobile in enwiki always at the top, in other wikis (ja, ru, de, es) at the top (logged-in) or missing (logged-out).

Takeaways:

  • Readers do engage with talk-pages and version-history of Wikipedia articles
    • in some cases (enwiki-desktop) the rate of engagement is on the same order of magnitude as engagement with citations
    • This engagement does not only come from editors; In many cases, readers who did not edit engage with a similar rate
  • The placement of the button for the version-history and the talk-page at the top of the article could increase engagement by factor of 100
    • For version-history we see a 100-fold decrease in engagement on mobile where the button is at the bottom in comparison to desktop where the button is at the top of the article
    • For talk-pages in enwiki we see similarly high levels of engagement for desktop and mobile where the button is at the top of the article in both cases.
    • For talk pages on mobile in other wikis, we see a 100-fold decrease in engagement where the button is not visible at the top of the article (it is only shown when the user is logged-in otherwise it is not visible at all)

What (types of) articles lead to high engagement with their version-history and talk-pages?[edit]

We calculate each article's CTR to version-history and talk-pages, respectively.

We will only focus on English Wikipedia.

The Wikipedia articles with the highest CTR[edit]

10 articles with the highest CTR to version-history (English Wikipedia)
page-id page-title n_pv
128927 New_Knoxville,_Ohio 1203
46665635 John_S._Middleton 1487
12579734 Joyce_Aboussie 1592
13280923 Bob's_Discount_Furniture 1851
66446493 James_R._Downing 1753
41536639 Gavin_Patterson 1131
23677893 EmblemHealth 1219
55755834 Nicole_Berger_(American_actress) 1278
43350468 Larak_(Sumer) 1103
39291682 Fidelis_Care 2415
10 articles with the highest CTR to talk-pages (English Wikipedia)
page_id page_title n_pv
28426635 Khaitan_Public_School,_Ghaziabad 1985
23465231 Laudatio_Iuliae_amitae 3641
23849734 Mass_killings_under_communist_regimes 77843
39539073 Bhutanese_passport 1290
69437544 Anti-Palestinianism 1009
53188334 Karatala_Kamala_Kamala_Dala_Nayana 1248
39303114 Sharon_A._Hill 1546
5523768 Exposure_compensation 1574
38079028 SmartOS 1174
4209961 Krishnamurti's_Notebook 1010

Correlation between an article's CTR to talk-pages and version-history[edit]

The two CTRs are not very strongly correlated (spearman rank-correlation=0.17). We can thus assume that readers interact with talk-pages and version-history for different reasons.

Which article features predict low/high CTR?[edit]

We train a scikit-learn’sLinear Regression model to predict the CTR of talk-pages and version-history as the target-variable, respectively. We use the following features:

  • Log of the number of pageviews of the article (same month)
  • Log of the number of edits to the article (same month)
  • Log of the number of edits to the talk-page of the article (same month)
  • Quality-score (0=low quality, 1=high quality)
  • Topic: each of the 64 topics as a separate feature (one hot encoding: 1 if the topic-prediction is larger than 0.4, 0 otherwise)
  • Reliability: each of the 10 reliability templates as a separate feature (one hot encoding: 1 if the template is present, 0 otherwise)

We use 10-fold cross validation and report averages and standard deviation of the regression coefficients (we only report the 10 coefficients with highest and smallest values, respectively).

Observations

  • Talk-pages
    • High quality predicts a low CTR
    • 3 reliability-templates (hoax, pov, disputed) appear in the top-5 features predicting high CTR. While I wouldn't overinterpret the high value for the regression-coefficient of hoax, it is still meaningful to observe that articles with these templates show higher CTRs. The templates do contain a link to the corresponding article’s talk-page; nevertheless it is interesting to see that readers follow these links.
    • Topics with low CTR: culture,
  • Version-history
    • High quality predicts a higher CTR
    • More edits (ns=0,1) predicts higher CTR
    • Reliability template third-party predicts higher CTR: biased because every source named has a very close connection to the subject

Takeaways:

  • Reliability issues associated with high CTR, but for different types of issues
  • Quality leads to low CTR in talk-pages but higher CTR in version-history.

Code[edit]

The code for the analysis is in this repository: https://gitlab.wikimedia.org/repos/research/curiosity/-/tree/critical-readers

References[edit]

  1. Wineburg, S., & McGrew, S. (2017). Lateral Reading: Reading Less and Learning More When Evaluating Digital Information. https://doi.org/10.2139/ssrn.3048994