Jump to content

Product and Technology Advisory Council/August 2025 draft PTAC proposals for feedback/Experimentation

From Meta, a Wikimedia project coordination wiki

This document is an archive of the work done by the Experimentation working group in PTAC to come up with the recommendations.

Purpose

[edit]

How WMF can build understanding and support for experimentation and product iteration.

Working Group Team Members

[edit]

Problem statement

[edit]

Clearly communicating with many audiences about experiments, with the goal of getting the most useful feedback out of an experiment (aka, avoid derailment). How do we collect feedback through experimentation that involves a representative part of the communities and has clear success and failure criteria.

Diagnostic

[edit]

Lots of pushback, which might be due to:

  • Lack of trust, due to past mistakes
  • Lack of predictability (when does it end, what are the criteria for going to the next experiment vs failure criteria cancellation).
  • Institutional momentum. Timelines slip, sunken cost etc.
  • Is this the right thing to build/test?
  • Only dealing with 1 side of the experiment not the other (classic example: newcomers, but no attention for extra work for editors)
  • Not having a full solution because this is still the experiment

Communities sometimes are not aligned in their feedback

  • Different priorities of different communities
  • Different levels of experience of groups
  • Differences between wikis in how disruptive an experiment can be

Communities are often surprised and have many questions surrounding process

  • Information fragmentation and overload
    • "I was not told"
    • Too many places for staff and community to keep track of
  • The experiment's documentation page can be very complex information if you have to read all of it. Often, the page is very domain and process specific, not always user focused.
  • The community tries to avoid disruption. Disruption messes with your ability to work and it changes the carefully built status quo.

Experiments are often big and require lots of work before they can be shared with the community, opening us up to sunken cost fallacy, but yet still skipping steps that are important for a full solution. Discussion focused community. Everything becomes a discussion. This can make a mess out of trying to summarize the answers to the questions asked by the experiment (lack of focus).

Hypotheses

[edit]

If there is a central overview of all ongoing and upcoming experiments, people will be less surprised

  • Build a calendar keeping track of the major projects and the phases (but it’s different for every wiki…yikes)

Better definition of what type of experiment a community has to deal with.

  • Defining labeled phases for experiments: Exploration, testing, validation, iterative improvement, beta, graduation, termination/concluded.
  • Set end dates for experiments and keep to those dates.
  • What feedback are we seeking from which user? [not sure if this works as a hypothesis to test]

Smaller experiments that are easier to validate, shorten the feedback loop and avoid sunken cost.

  • Smaller is easier to test and faster to course correct on.
  • More JS gadgets, Toolforge tools and user scripts instead of full on-wiki dev?

Make a youtube video introduction for every experiment as it will be easier to digest than all the written content. (lots of languages to communicate the content. Monthly meetings conducted by some teams were forms of this to some degree) Centralized place to gather output of experiment(s) data (A/B testing graphs) Having better prepared feedback section results in clearer answers (would be interesting to validate) Getting people involved earlier (but we already tried many variations of this, I think)

Things we wonder about

[edit]
  • Should the community have a say in the failure criteria or if the experiment takes place in the first place (counterpoint: the community often might not have the domain knowledge to evaluate and understand the context behind the experiment or might have inertia towards certain features, for example, Vector 2022)
  • Is what the community perceives as feedback the same as what the product teams perceive as feedback? If not, how can we close that gap?
  • Giving communities data upfront might help, but often it’s data we are actually in search of. Something we have tried and seems like it often simply causes more distraction?
  • Should the Wikimedia Foundation skew its development towards making smaller, iterative prototypes? (counterpoint: this is often not possible in the context of some features)

Experiment

[edit]
  • Building a central overview of all ongoing and upcoming experiments
  • Clearly defining and communicating labeled phases for experiments: Exploration, testing, validation, iterative improvement, beta, graduation, termination/concluded.
  • Validate whether having a better-prepared feedback section results in clearer answers.

Appendix: Previous experiments

[edit]

For the previous experiments, this table is meant to give a sense of the sorts of experiments we have run and that we want to run more of, and more quickly/easily. It’s a sample of about 10 experiments from the last couple of years, including some for readers, some for editors, some successes, and some failures.

Analysis of 10 experiments
Run date Experiment name Description / purpose Experiment type Population Outcome Notes (from the WMF)
May 2025 Tone Check model evaluation Checks using Annotool to understand whether experienced volunteers agree with what the model identifies as promotional, derogatory, or otherwise subjective language, happens prior to deployment of A/B test of the full feature. Human model evaluation ~35 participants across 5 wikis with a minimum of 30 diffs per language Volunteers across the 5 languages agreed with the model’s detection of a tone issue in 95% of cases. Difficult to recruit volunteers
Nov 2024 - Mar 2025 Add a Link on en wiki Deployment of Add a Link structured task to English Wikipedia A/B test 508 participants in the mobile “treatment” group and 5,155 in the “control group” We ramped to 20% with the following results: On mobile: 33.7% increase in the constructive activation rate, a 3.7% increase in the constructive retention rate, and a 19.6% decrease in revert rate Analytics report / Project description Decision: Discuss a 100% release with English Wikipedia community (after also completing improvements English Wikipedia community members requested) T390613] WMF worked very closely with experienced editors to implement feedback and mitigate against a potential negative impact on backlogs. On the flip side, this was a very time and resource intensive experiment to retest a feature that was already tested on other Wikipedias in the past.
Dec 2024 Temporary alpha release of Add a link in reader experience Testing in-article suggestions for logged-in editors with 0 edits, via an alpha release (with a planned unrelease) Temporary A/B test 100% of newly created accounts at pilot wikis (Spanish Wikipedia, French Wikipedia, and Egyptian Arabic Wikipedia). However, most new account holders never visited a page where a suggestion was visible. The alpha test resulted in less than 400 impressions per day. Learned that our task pool was too small, so we made changes so that we could generate more link suggestions to get suggestions in front of more newcomers. Project description Decision: As planned, we rolled back after gathering initial data. We then made adjustments to the feature and link suggestion data pipeline prior to running a full experiment This allowed us to move quickly: build, release, and immediately gather critical insights. But results were not statistically significant and can only serve as early indicators for adoption or risks.
Nov 2024 Alt-text Suggested Edit in iOS Prompting users to add alt text to images after they have made a related edit A/B test 60-day experiment within the iOS App with Spanish, Portuguese, Chinese, and French Wikipedias. “Add an image” editors were split into 2 experiment groups, B or C. Group B received a prompt to add alternative text after an image recommendations edit, and Group B received a prompt to add alt text after an article edit. Decision: Ramped down, because the treatment was not effective enough at causing new account holders to activate at higher rates, and the quality of the alternative text written by newcomers was lower than the threshold we set. Limitation: on wiki policy pages on images and alternative text should be the primary resource for guidance in this space, but many languages do not have a bespoke page for this topic.
Feb - Apr 2024 Reference Check A/B test Prompt users who have added more than 50 new characters to an article namespace to include a reference A/B test 50/50 test, showing reference check on 8 partner wikis to contributors with 100 or fewer edits or unregistered users (Check shown to 8,255 of published new content edits) New content edit revert rate decreased by 8.6% if reference check was available. [report] Decision: Deployed to all Wikipedias except English Controlled experiment with statistically significant data. However, tests happen on specific wikis but we aren’t always able to infer the results would be the same globally. In this case the English Wikipedia community has been reluctant to adopt, though we’ve seen a recent push towards deployment.
Sept 2024 Search Suggestions Browser Extension experiment Provides suggestions to readers when they open the search bar on desktop Fidelity: Prototype of feature A/B test among users who opted-in to download a browser extension A quicksurvey was displayed to a small percentage of readers on English and Spanish Wikipedias, inviting them to download a browser extension which added the feature into the page. Readers interacted with search suggestions at a rate higher than recommendations shown at bottom of page Decision: proceed with “Search suggestions A/B test” in Mar 2025 (see below) Browser extension setup allowed for opt-in controlled experiment with statistically significant data outside of a production environment. Downsides: 1. Opt-in experiment participants are inherently biased towards the feature (as shown by their willingness to download a browser extension) 2. Not an option for mobile
Mar 2025 Search suggestions A/B test Provides suggestions to readers when they open the search bar on mobile Fidelity: Production A/B test The A/B test was run on anonymous mobile web users with the Minerva skin. It was enabled in tiers—starting with all wikis except enwiki on March 4 (UTC), and then deployed on enwiki on March 6 (UTC). The test was disabled on March 24, 2025. 1. Increase in clickthrough rate when compared to similar features 2. Small but statistically significant increase in session length Decision: scale across Wikis 1. A/B test on logged-out readers allowed for unbiased results 2. Can gather statistically significant data from production Wikipedia Downsides: Building A/B test for logged-out users is difficult and requires a lot of engineering time and attention. Our new experimentation platform will make this easier in FY25/26
Feb 2025 Trivia game in Android app Create a daily play trivia game for the Android app utilizing the ‘on this day’ API Fidelity: Prototype of feature Short-term release to a pilot Wikipedia 20-day before/after experiment with logged-out users on the Wikipedia Android app in German Wikipedia. Game had high engagement: Users who engaged with the game had 13.2% higher app retention, 11.2% of users who saw a prompt started a game and of those users, 71.4% completed a game Decision: A/B test and scale to 7 additional languages We were able to gauge engagement from a before/after comparison Gathered enough data on game performance to proceed to more formal experimentation (a/b test) and scaling Downsides: Before/after comparisons carry inherent bias (it’s possible that the increase in retention is because the type of users who enjoy the game already had high retention) This experiment setup is good for early indication of potential success
Oct 2024 Recommendation API comparison Test which existing or potential article recommendation API is most preferred by readers Fidelity: Static mockups Quicksurvey with static mockups The experiment code was loaded in the Wikipedia Recommendations browser extension. Users of the browser extension saw the recommendations immediately. The MoreLike API and Most Linked APIs performed best Editor-curated suggestions, such as the See Also section did not perform as well as WMF APIs Decision: explore combining Most Linked API, then proceed with A/B test on mobile Surveys are helpful for testing user thoughts/ideas without requiring any code to be written Can be a useful brainstorming tool Downsides: Survey participants carry bias, even at scale
April 2025 Search prominence on iOS app A/B tested adding a more prominent search bar to the most-used view in the app: Article view Fidelity: Production A/B test Logged out users on the iOS app Increase in number of users initiating search Feature does not necessarily result in a net increase in browsing Wikipedia articles internally Decision: Scale to all languages Can gather statistically significant data from production Wikipedia A/B test gave confidence in making such a large change to the most popular view in the app (article view). Downsides: Building A/B tests requires a lot of engineering time and attention.
May - Dec 2024 Citation Needed/Add a Fact Citation Need: Chrome extension for internet users to check statements against Wikipedia Add a Fact: allows statements to be added to Wikipedia talk pages as suggested edits Short-term pilot releases to general public, nudged Wikipedia editors to try them via banner ads ~1900 installations of Citation Needed and Add a Fact extensions Citation Needed pilot encouraged us to pilot the additional Add a Fact functionality. Survey responses indicated that users liked the idea, but data showed they did not continue using them. Decision: shut down the extensions because of low usage. Banner ads sent to Wikipedia editors were successful in getting some people to try out the browser extensions.