Jump to content

Research:Initial Attribution Signal Measurement

From Meta, a Wikimedia project coordination wiki
Duration:  2025-November – 2025-December
This page documents a completed research project.


Summary

[edit]

In order to begin building an empirical basis for the Wikimedia Foundation / Attribution Framework, we compared 6 individual attribution signals across two contexts (SERP and LLM chatbot) in a 1,186-participant, survey-based experimental study. Although we were primarily interested in the effect that any individual signal might have on “trust” (i.e., perceived accuracy of both overall information presented and accuracy of information from Wikipedia), we also tried to establish the extent to which the signals are affected by engagement with the text and participants’ pre-existing beliefs and behaviors.

The tested signals appear to improve trust in the LLM context (although showing similar overall performance to each other, while their effects in SERP are harder to identify). Signals in LLM are particularly affected by the deeper type of reading and textual engagement that occurs there—LLM signals benefit somewhat from a high level of prominence, and they benefit especially from high levels of reader engagement. Attribution count in particular was shown to boost trust when participants were paying more attention to the information and sources they were shown. Last update, on the other hand, can benefit trust when users are deeply engaged, but can actually lower trust when users aren’t paying as much attention. Trust in SERP, on the other hand, appears to derive more from characteristics internal to the user than from differences between the signals themselves. Trending was shown to “work” regardless of participants’ level of engagement in SERP, and was also rated as highly important by the participants who saw it.

High-level observations

[edit]

High-level observations

  • In LLM, all signals improved trust over no signal (Control).
  • Attribution count in particular has several benefits for trust in LLM and doesn’t suffer when engagement is low.
  • In SERP, the signals act in ways that were more difficult to observe in this study, and no single signal was shown to be more effective than Control (no signal).
  • Signals generally benefit from reader engagement (i.e., interacting with the information and sources presented in the input). This engagement appears to be more beneficial in the LLM context.
  • References seems to be relatively reliable across both LLM and SERP, even though a low proportion (~50%) of participants who saw it correctly interpreted its meaning.
  • It might not matter as much as we thought if people understand the text presented in a given signal—signal comprehension rates were highly variable, ranging from over 90% for Views and less than 50% for Attribution Count, but signal effects measured in this study were largely independent of comprehension.
  • Establishing whether or not participants noticed the signals emerged as difficult to measure.
  • Future signal users will likely have pre-formed attitudes about which signals or signal-conveyed information are appropriate for a given context or topic—Views emerged as almost universally understood, but under-valued by SERP participants compared to other signals.
  • Other factors that we can’t control matter quite a bit, including participants’ educational levels, English reading proficiency, pre-existing relationships with Wikipedia and search or AI platforms, and their general constellations of attitudes about the platforms and topics tested.
  • Attribution signals function differently in SERP vs in LLM

LLM trust as measured in this study functions in part through active engagement with the text, the information it contains, and the sources it references. Signals also tend to be more effective when participants remember noticing them.

  • All signals tend to outperform Control (no signal) in terms of trust in the information’s accuracy and trust in Wikipedia’s accuracy.
  • The “best” signal depends on the level of engagement given to the text by readers.
  • In LLM, participants appear to be paying active attention to the text they’re being shown, and in some cases different signals condition their reactions to this new text.
  • Trust in LLM is driven in part by comfort with AI platforms, education (with higher educated participants having lower trust in AI in general), and gender (with male participants likewise showing lower trust).
  • Signals in LLM were independent of their perceived value by participants—trust operates along several interacting dimensions in LLM, but pre-formed opinions about the value of individual signals don’t appear to be one of them, at least in this study.

SERP trust is largely independent of signal type, signal presence or absence, or even understanding what the signals mean (at least in this study, and with these participants).

  • Individual signals don’t always outperform Control (no signal).
  • The “best” signal for SERP may depend on signal-external factors, such as the user’s beliefs, the topic, the interaction of SERP context and topic, etc.
  • In SERP, participants appear to be reading differently than they did in LLM, and individual signals were shown to have little effect on their accuracy perceptions. Signals may operate at more of an unconscious level.
  • Trust in SERP is in part driven by information behaviors like accessing Wikipedia and their English reading proficiency.
  • SERP signals may “work” better when they are subtle and don’t rely on differing engagement levels to have an effect. Trending and References were relatively stable across engagement levels.
  • SERP users may have pre-formed opinions about which signals are appropriate for this context—participants who valued References and Trending tended to have higher trust in their SERP output.


Methods

[edit]

Study structure

Recruitment and participants

Participants were recruited from Prolific, a source of global panel participants, and invited to complete a paid survey hosted on the survey platform Qualtrics. Data collection took place in two waves, consisting of a 165-participant extended pilot and a 1048-participant main data collection wave. Of these 1213 participants, 24 were eventually removed from analysis due to failing attention checks.

Upon qualifying for the study by submitting response to a screener survey on Prolific—which selected for at-least-monthly users of AI/LLM chatbot tools, and at-least-weekly users of search engines—participants were routed to Qualtrics where they completed the survey on their mobile devices.

Study tasks

Participants arriving from Prolific were shown the introductory text to a survey titled How do you evaluate online information? After recording their participant ID number, they were asked to complete three sections consisting of a pre-test, feature exposures, and a post-test.

Pre-test

Upon entering the survey, participants responded to several questions targeting their usage and attitudes about the various platforms implicated by the study, including: 2 questions about their comfort and confidence with AI tools; 3 questions gauging their frequency of using search engines, AI tools, and Wikipedia; 9 questions measuring various dimensions of “trust” in those platforms.

Training module

After group assignment, participants were shown instructions and required to pass a short training module to onboard them to the study experience. Participants were shown either a mockup of an LLM chatbot window in which the chatbot had responded to the question how long did it take to build the great wall of china?, or a mocked-up search results page showing the first three results for the query how long to build the great wall…. Directly underneath the images presenting these scenarios, participants were asked to correctly answer one question in which they had to correctly identify a fact they had seen (information recall) and a second question in which they were asked to correctly identify the three sources that were explicitly named in their materials (source identification). If participants didn’t select the correct responses to the training module questions, they were prompted by the system to Please select the correct answer that matches what’s shown in the [context].


Instructions


Next task: Evaluate AI chatbot responses

In this section you'll review 3 responses from AI assistants like ChatGPT, Gemini, or Grok.

For each one: Read the scenario and response Answer brief questions about the information

Please evaluate each response based on the content you see.


Signal exposure

Following successful completion of the training module, participants were shown images of 3 texts containing a signal (or no signal, in the case of Control). The texts were identical for all groups with the exception of signal manifestation. The texts presented queries that participants were asked to imagine having posed in a SERP or LLM chatbot interaction, and covered topics related to Movies, Medical, and Current Events.

Signal-specific post-exposure module

After all 3 scenarios were presented in random order, participants progressed to a signal-specific module asking them if they had noticed their signal, what they thought it meant, and how important they believed such a signal to be in the context of SERP or LLM chatbots:

Participants in the Control (no signal) group were shown the References post-exposure module; i.e., they were asked if they had noticed the References signal, what they thought it meant, and how important the information conveyed by References was for them to see in either SERP or LLM.

Post-test

In the post-test module, participants were asked to identify their level of educational attainment, self-assess their English reading proficiency, and respond to a 6-question battery of True/False questions designed to measure their level of latent “Wikipedia knowledge.”


Findings

[edit]

Noticing the signals

After the training module (which contained no signal) and three experimental exposures in which participants were shown a signal and then re-shown the same signal as they answered follow-up questions (i.e., 6 individual exposures per participant), they were asked Did you notice this detail? Noticing was operationalized as a direct post-exposure question in which each group’s signal was visually reproduced with the accompanying title Did you notice this detail?, directly above the question asking them Did you notice a statement like [signal text] on any information you saw from Wikipedia earlier in this study?. In general, the measured noticing rates showed high variability. The most-correctly noticed signal in both contexts was Contributors (51% in LLM and 59% in SERP), and the least-correctly noticed signal was Last Update in LLM (26%) and Views in SERP (30%). The Control group in LLM particularly reported a high rate of “false” noticing (34%), although the SERP Control group reported a less extreme “false” noticing rate of 14%.

Several outcomes of this study indicate that the noticing measure used in this study may have functioned inconsistently:

  • post-exposure noticing rates are highly variable between individual signals, ranging from 26%–51% in LLM and 30%–59% in SERP;
  • Accurate noticing rates overall tended to be quite low, with no group remembering having seen their signal at a rate higher than ~60%;
  • The Control groups reported unexpectedly high rates of noticing the References signal (which they hadn’t seen)—the LLM Control group reported a “false” noticing rate of 34%;
  • The tested signals appear to be “doing” something, at least in LLM—all signal groups reported higher perceptions of information accuracy than the Control group in LLM.

It is possible that LLM participants were implicitly conditioned via past experience to read their materials differently or more deeply than SERP participants. It is also possible that the LLM materials themselves contained multiple meta-information signals in addition to the tested attribution signals. In such a scenario, LLM chatbot users who notice any meta-informational cues (such as individual sources being indicated; any numbers presented in a subtly different font or color than the main text; etc.) may perceive that the information presented is more robust or of higher quality, regardless of whether or not they correctly noticed or understood the text of the signals studied here.

Comprehending the signals

Rates of signal comprehension showed even more variability than Noticing, with Views emerging as the most-commonly correctly understood signal text (91% in LLM and 95% in SERP), and Attribution Count as the least-commonly correctly understood signal (34% in LLM and 45% in SERP). The Control group, who were shown the References post-exposure module, correctly understood References at a rate of 51% in LLM and 66% in SERP, vs the 50% and 63% recorded by the group that saw References during exposure. I.e., at least for the References text (from 50+ sources), comprehension doesn’t appear to depend on previous exposure to the signal within this study.

Rating the importance of seeing [signal] in [context]

Similarly, participants were finally asked in the post-exposure module to rate How important is it for you to see [this signal] in [this context]? These importance ratings were similarly variable, with References receiving the highest average importance rating by both LLM and SERP respondents (3.5/5). In contrast, Views received the lowest importance ratings among both LLM respondents (2.7/5) and SERP respondents (2.3/5). SERP participants were particularly skeptical about the value of Views when re-shown their signal in the post-exposure module—Views’ importance ratings were significantly lower than those of Attribution Count, the next-lowest-importance signal in SERP.

Effect of genre or topic

The interaction of signal vs topic wasn’t a direct target of this research—three different-but-realistic topics were chosen for source materials with the intention of minimizing the effect that any single topic might significantly affect signal performance. Participant ratings of perceived overall accuracy, Wikipedia accuracy, and their recorded rates of information and source recall were averaged across 3 exposures to 3 topics, rather than treated individually. However, the topics themselves were associated with different rates for these variables in LLM vs in SERP.

Medical emerged as the “strongest” topic in LLM, with all LLM participants generally finding the overall information presented in the Medical materials as more accurate, and showing increased rates of information recall. Rates of perceived Wikipedia accuracy and source ID were generally unaffected by topic, although when an individual group showed significant divergences for these variables, the Medical topic tended to score higher than the other two. In SERP, by contrast, the Current Events topic was associated with significantly lower scores than the other topics on most comparisons. Relatively fewer topic-related differences were found in SERP than in LLM, however.

In order to measure the effect that any individual signal may have had on “trust” (i.e., perceived accuracy of overall and Wikipedia information presented in the materials), all three topic exposures were averaged, meaning that each participant had a total possible score of 0-3 points for information recall and source identification.