Jump to content

Research:System Design for Increasing Adoption of AI-Assisted Image Tagging in Wikimedia Commons

From Meta, a Wikimedia project coordination wiki
Created
23:00, 23 May 2024 (UTC)
Contact
Yihan Yu
Collaborators
David W. McDonald
Duration:  2024-June – 2025-October
Grant ID: G-RS-2402-15230
This page documents a completed research project.


In this research, we aim to investigate designs to increase the adoption and satisfaction of AI- assisted tools within commons-based peer production (CBPP) projects, with a specific focus on Wikimedia Commons. While AI- powered automation tools have long been integrated into CBPP projects for indirect tasks like content moderation, the utilization of AI for direct content generation has surged with recent advancements in generative AI algorithms. However, the impact of AI-assisted tools on human contributors and the design considerations to enhance their interaction, adoption, and satisfaction remain uncertain. This study proposes to co-design an AI-assisted image tagging tool with Wikimedia Commons contributors and users to increase adoption and satisfaction. We will perform a study of the prior WFM attempt to provide a computer-aided tagging (CAT) tool to understand the factors that led to its deactivation. We will then investigate technology designs to improve AI-assisted image tagging for structured Commons. The successful completion of this project is expected to advance the development of an AI-assisted image tagging tool on Wikimedia Commons, promoting greater adoption, usage, and satisfaction among contributors. Additionally, the insights gained from this study can be generalized to enhance interaction and collaboration between human contributors and AI-powered automation tools in the broader Wikimedia tools ecosystem and other CBPP projects.

Introduction

[edit]

AI-powered automation tools have long been integrated into commons-based peer production (CBPP) projects for indirect tasks such as content moderation and contribution quality. In recent years, with the rapid advancement of generative AI algorithms, there has been a surge in attempts to utilize AI-powered tools for generating direct content in CBPP, including creating Wikipedia articles and generating image annotations. AI holds the promise of enhancing content creation efficiency and consistency, thereby addressing content gaps in areas that may have received insufficient human contributions. However, the impact of AI-assisted tools on human contributors as well as the technology designs to enhance human contributors' interaction, adoption, and satisfaction with AI-assisted tools in CBPP remain uncertain. In this proposed research, we will fill in this gap by co-designing an AI-assisted image tagging tool with contributors and users of Wikimedia Commons, with the goal of increasing adoption and satisfaction.

Wikimedia Commons is a WMF project that makes multimedia resources available for free copying, usage, and modification. However, a lack of structured, machine-readable metadata about media files has hindered its accessibility, searchability, usability, and multilingual support. Recently, WMF researchers attempted to introduce computer-aided image tagging (CAT). Unfortunately, our prior research revealed low adoption of the CAT tool by Commons contributors. Participants reported unsatisfactory usability and performance of the tool and resistance to changing their existing workflow of creating and maintaining the local category system. The CAT tool was deactivated in September 2023.

In this project, we fill this gap by investigating Wikimedia Commons contributors’ lived experiences with the CAT tool using two complementary data sources: 595 user comments from 11 wiki pages and 16 in-depth interviews. Our analysis revealed seven key issues that contributed to CAT’s mixed reception and eventual deactivation. We provide community-informed suggestions for improving the usage, satisfaction, and adoption of CAT on Commons. We also contribute empirical insights into how Commons contributors, through interactions with AI-generated tags suggested by the CAT tool, situate the social and cultural knowledge they perceive in images within Wikidata’s ontology. Contributors engaged in far more than simply accepting or rejecting AI-generated tags. Their practices spanned multiple dimensions of evaluation and sensemaking: they attended to accuracy and specificity, weighed usefulness, made ethical judgments, situated images within broader cultural narratives and contexts, improved ontologies and knowledge infrastructures, and treated AI outputs as collaborative seeds. Together, these practices form a multi-layered interaction framework. Building on this framework, we offer design implications for AI-assisted image tagging systems. Our research questions are:

  1. What issues contributed to CAT’s mixed reception and eventual deactivation?
  2. How do Commons contributors, through their interactions with AI-generated tags, situate the social and cultural knowledge they perceive in images within Wikidata’s ontology?

Literature Review

[edit]

Wikimedia Commons

[edit]

Wikimedia Commons is one of the world’s largest online repositories of freely usable multimedia files, including images, audio, and video. Its content is contributed and curated by a global community of volunteers. According to recent statistics, the platform hosts over 118 million files contributed by approximately 13 million users. Commons supports more than 300 Wikipedia language editions as well as other Wikimedia projects such as Wikivoyage, Wikispecies, and Wikiversity, while also serving the broader internet public.

Despite its scale and importance, Commons remains relatively understudied and underappreciated in academic research [1], particularly in contrast to its sister project, Wikipedia. While both are examples of open knowledge work, prior work [2] [3] highlights key differences between the two. Wikipedia focuses on collaborative text production for an online encyclopedia, whereas Commons centers on the collection and curation of free-to-use multimedia. Although both platforms are built on the MediaWiki software, their functionalities, symbolic roles (reference vs. collection), and governance structures position them as fundamentally distinct socio-technical platforms, in line with Gillespie’s extended definitions of platform [4].

Overall, while Commons plays an important infrastructural and cultural role in both the Wiki universe and the broader open knowledge ecosystem, its unique activities, collaborations, and contributions merit more focused scholarly attention.

Commons Category System

[edit]

Commons relies primarily on a category system to organize its extensive and fast-growing collection of multimedia files. This system functions as a folksonomy, a decentralized, user-driven classification method in which category tags are collaboratively created and applied without a strict controlled vocabulary. As a feature built into the MediaWiki software, the category system allows users to group files by associating them with subject-specific categories arranged in a hierarchical structure. Ideally, every file should be directly categorized and discoverable through this structure. However, while functional, the system presents significant limitations, many of which have been previously identified in research on collaborative tagging systems in Commons’ sister project, Wikipedia [5].

First, the category system is constrained by the underlying MediaWiki software, which was originally designed for text-based content. As a result, Commons metadata consists mainly of plaintext filenames, descriptions, and user-generated categories—formats that are not consistently machine-readable. This limits the platform’s ability to support effective search, retrieval, reuse, and integration with Wikidata or external databases.

Second, although the Commons community has collaboratively built and maintained this categorization system, it suffers from challenges typical of user-generated image tags [6] : they are often noisy, inconsistent, incomplete, or overly subjective. While such tags can aid organization and sensemaking, they also introduce fuzziness and ambiguity, as contributors choose tags based on personal preferences, tendencies, and beliefs [7] [8]. As a result, many tags are meaningful only to their original contributors which reduces the overall precision of the collaborative tagging system.

Third, a significant number of Commons files remain uncategorized due to the time and labor required for manual tagging [9].

Finally, despite Commons’ commitment to multilingual inclusivity, English remains dominant in file annotations and category tags [10]. This language imbalance limits accessibility for non-English-speaking users, who face significant challenges when contributing to, searching for, and discovering relevant content. These linguistic barriers exacerbate broader inequities in participation and access—both within Commons and across the wider Wiki ecosystem [11].

Designing AI/ML-Assisted Tools in Wikimedia Projects

[edit]

Much of research on human–AI collaboration has focused on commercial, corporate-controlled platforms such as Facebook, Twitter, and TikTok. These systems are typically designed top-down and optimized for business metrics like engagement and retention. In contrast, Wikimedia projects are governed by decentralized communities of volunteers [12]. The Wikimedia Foundation does not make editorial decisions; rather, decisions about content creation, moderation, and tool deployment are made through open participation, consensus-building, and shared responsibility. These socio-technical differences shape how AI/ML tools are designed, adopted, and evaluated within Wikimedia contexts [13].

Prior research on AI/ML in Wikimedia projects has largely focused on tools deployed in English Wikipedia that support indirect tasks [14] [15], such as detecting vandalism, evaluating contribution or article quality, and suggesting tasks. While these tools are valuable for supporting coordination and moderation, they do not directly contribute to content generation. For example:

  1. Quality control tools [16] help patrollers evaluate edit quality, and identify and revert harmful edits.
  2. Recommendation systems, such as SuggestBot [17], suggest articles or tasks in need of attention based on an editor’s interests or activity history.
  3. Meta-algorithmic systems, such as ORES [18] [19] and LiftWing, represent a deeper integration of Wikimedia's participatory values into AI/ML infrastructure. These tools are not only used to make edit or article quality predictions in support of moderation—they also serve as platforms that enable community members to audit, contest, and adapt models to local needs and values to support transparent and collaborative development.

However, these systems have been studied almost exclusively in the context of Wikipedia and with a focus on editorial moderation and coordination, rather than direct content creation. Our research shifts focus to a different Wikimedia project: Wikimedia Commons, a multilingual, multimedia-oriented platform that communities across more than 300 language editions of Wikipedia contribute to and use. While applying tags may be seen as a metadata task or a form of content curation in other contexts, the Commons community views adding Depicts statements as a form of open knowledge creation. These statements provide semantic, multilingual information about what an image contains and why it matters, which enriches Commons as a shared media repository, improves image discovery, and contributes directly to the content layer that supports other Wikimedia projects. Importantly, these annotations do not merely describe what is visible in an image, they embed the image within broader conceptual and cultural frames. Such editorial decisions shape how knowledge is organized and how users around the world find and interpret visual media. From the community’s perspective, adding Depicts statements is not just a technical task; it is an editorial process that generates, structures, and shares knowledge, which constitutes a form of content creation that Commons is built to provide.

This context introduces unique challenges for AI-assisted tools. Compared to text-based tasks like evaluating edit quality, labeling visual content is more ambiguous, context-dependent, and subjective. A single image may require multiple labels, as it can contain several objects, nested elements, or conceptual hierarchies. Additionally, the interpretation of visual features can vary across cultural contexts. Unlike structured data with objective metrics (e.g., word count), image tagging depends heavily on culture, personal judgment, and lived experience—factors that are difficult to encode into consistent labeling rules.

In this study, we investigate how human contributors and AI collaborate in a multilingual, community-governed environment, with a focus on this form of content creation.

Methods

[edit]

Research Ethics

[edit]

This work has been reviewed by an Institutional Review Board (IRB) at the University of Washington. In July 2024, the University of Washington Human Subjects Division (HSD) determined that this study is human subjects research and that it qualifies for exempt status. This exempt determination is valid for the duration of the study.

[edit]

To understand the Commons editing community's needs, experiences, and concerns regarding the deactivated CAT tool, we conducted a qualitative analysis of Wiki discussions about the tool. We identified 11 Wiki pages (Table 1) where the CAT tool was discussed. These pages contained 595 comments across 172 topics, contributed by 160 unique users. Table 2 shows the distribution of user engagement in CAT-related discussions: 7 users left more than 10 CAT-related comments, 15 users contributed 5–10 comments, 59 users made 2–4 comments, and the majority—79 users—left only one comment.

Table 1. Analyzed Wiki Pages
Page #Topics #Comments
Page 1 6 37
Page 2 24 58
Page 3 111 349
Page 4 12 15
Page 5 9 11
Page 6 5 10
Page 7 1 7
Page 8 1 15
Page 9 1 5
Page 10 1 8
Page 11 1 80
Total 172 595
Table 2. Distribution of User Engagement
#Comments #Users
More than 10 7
10-5 15
4-2 59
1 79

The first author manually copied all 595 comments, along with the corresponding usernames and links to the user pages on Wikimedia Commons or Metawiki, into a spreadsheet for analysis. We then conducted open coding of the comments using a thematic analysis approach. This process identified seven emergent themes regarding Commons editors’ experiences with the deactivated CAT tool:

  1. Goals of the Structured Commons project
  2. Evaluations of the quality of suggested tags
  3. Definitions of “depicts” statements
  4. Differences between image tags and “depicts” statements
  5. Existing infrastructure (image titles, categories, descriptions)
  6. Documentation and instructions for the tool
  7. User interface (UI) issues, including difficulties skipping images, overwhelming notifications, editing suggested tags, adding additional tags, error messages, and waiting periods

Findings from this qualitative analysis informed both the recruitment of participants and the development of our interview protocol.

Interviews

[edit]

To gain a deeper understanding of Commons editors’ experiences with the deactivated CAT tool and the issues identified in our qualitative coding of CAT-related discussions, we conducted interviews with Commons editors who had participated in those discussions.

Recruitment

We aimed to recruit all 160 unique users we identified as having participated in CAT-related discussions. However, we did not send invitations to 41 users for various reasons:

  1. Six user pages appeared to belong to Wikimedia Foundation researchers responsible for designing or developing the CAT tool.
  2. Twenty-six users did not have a user page on Wikimedia Commons or Meta-Wiki.
  3. One user page was attributed to a deceased Wikipedian (WP:RIP).
  4. One user was banned indefinitely from editing all Wiki projects.
  5. One user was marked as retired from editing Wiki projects (WP:RETIRE).
  6. Six users had user pages on Commons or Meta-Wiki but had not enabled the “Email this User” function.

We contacted the remaining 119 Commons editors through the built-in “Email this User” function on Wikimedia Commons and Meta-Wiki. Each editor received a personalized message that included their username, the purpose of the study, a link to our Meta-Wiki study page, the Wiki page where their participation in CAT-related discussions was identified, and an invitation to participate in an interview. We emphasized that participation in the study was voluntary and offered a $25 Amazon gift card as a token of appreciation for their time, effort, and expertise.

Of the 119 editors contacted, 29 responded. As of December 2024, we have completed interviews with 16 of them. Table 3 shows the progress of our interview recruitment.

Table 3. Recruiting Activity
Editors Contacted 119
Editors Replied 29
Editors Interviewed 16

Data Collection and Analysis

Guided by the findings from the qualitative coding, we developed an interview protocol consisting of an opening script, four interview phases, and a closing script.

In the opening script, we introduce ourselves, explain the purpose of the interview, and outline how we handle and maintain the confidentiality of participants' data. We also request consent to audio-record the interview. Before participants provide consent, we clarify that, despite our efforts to anonymize data, Wikipedians may sometimes identify individuals or situations from the discussion. This ensures that participants are aware of any residual risks to confidentiality.

In the first phase of the interview, we ask introductory questions to understand participants’ engagement with Wikimedia Commons. These questions include how long the participant has been editing Wikimedia Commons and what types of work they typically do on the platform. We also ask participants to explain, in their own words, what they understand to be the goals of Structured Data on Commons. These questions help establish rapport and gather foundational information about the participants’ perspectives and experiences.

The second phase of the interview focuses on a discussion of sample images that participants recently uploaded to Wikimedia Commons. We show them three images they recently contributed and, for each image, ask them to describe the story behind the contribution, what the image depicts, and how they would tag the image themselves. Following this, we provide a list of tags generated by the Google Cloud Vision API for the image, translated to Wikidata items, to replicate the type of suggestions made by the deactivated CAT tool. We then ask participants to reflect on these tags, including whether they agree with any of the suggested tags, whether they would accept and add any of the tags to the “depicts” statements, how the tags could improve the discoverability of the image on Commons, and what metadata they find important for tagging such images.

The third phase of the interview shifts to reflections on the deactivated CAT tool itself. We begin by asking participants what they remember about the tool and how it impacted their work on Wikimedia Commons. We then ask them to explain their understanding of the role of the category system on Commons, the purpose of the “depicts” statements, and the differences between these two elements. We also ask them to reflect on the goals of the CAT tool on Commons, whether they think the tool achieved those goals, and why they believe it needed to be deactivated.

In the final phase, we invite participants to share their thoughts on potential improvements to the CAT tool and explore broader issues, such as the risks associated with using AI/ML for tagging images and ways to mitigate these risks on Wikimedia Commons.

In the closing script, we invite participants to share any additional questions or comments. We also acknowledge their time and effort by offering a $25 Amazon gift card as a token of appreciation and confirm their preferred email address for sending the gift card.

We conducted all 16 semi-structured interviews in October, November, and December 2024. Fourteen of these were conducted in English using participants’ preferred teleconferencing applications (Zoom or Google Meet) or via phone calls. During 13 of the teleconference interviews, we shared our screens to display and discuss participants’ example contributions and the suggested tags. For the phone interview, we sent the participant a list of relevant links in advance and asked them to open the materials on their own devices prior to the session. The duration of these interviews ranged from 45 to 90 minutes. Additionally, two interviews were conducted through online chat. In these cases, the researchers emailed the interview questions to participants, waited for their responses, and subsequently asked follow-up questions. We believe this asynchronous approach provided flexibility for participants who preferred written communication and/or translator.

These interviews resulted in a dataset of 16 transcripts. The first author transcribed all interviews and documented relevant pages discussed during the sessions—such as user pages, user contributions pages, and Commons talk pages—to support data triangulation. The first author open-coded the transcripts using a thematic analysis approach [20] and recorded analytical memos. We iterated on the themes and memos and chose to report the emergent themes related to issues that inhibit contributors' collaboration with the CAT tool and themes that demonstrate how participants interacted with AI-suggested image tags to situate the social and cultural knowledge they perceived in images within Wikidata’s ontology.

Findings

[edit]

RQ1: What issues contributed to CAT’s mixed reception and eventual deactivation?

[edit]

We identified seven key issues that contributed to CAT’s deactivation and developed community-informed suggestions for future design.

Misalignment in the Perspectives of Structured Data on Commons

There is a misalignment between the WMF’s vision for the project and the community’s perspective. One major concern was the shift toward generic image tagging and searchability, which many felt would undermine Commons’ educational value and make it resemble stock photo websites. Participants argued that Commons’ strength lies in offering specific, educational images rather than broad, generalized search functionality, which distinguishes it from platforms like Google or Flickr. Participants suggested the need for a clearer understanding of Wikimedia Commons’ user base and their needs, and recommended revisiting the goals of the Structured Data on Commons project through open community discussions.

Unclear Definitions of the Depicts Statement

Contributors reported confusion about how to use the Depicts statement on Commons. The absence of clear rules or guidelines leaves editors uncertain about the appropriate level of specificity, whether to describe an image broadly (e.g., animal), more specifically (e.g., jumping spider), or even by parts (e.g., eyes, legs, leaf). Without shared standards, contributors rely on personal judgment or informal norms, which leads to inconsistent practices and occasional conflicts (e.g., disputes over whether an image “depicts Seattle”). Participants emphasized that this ambiguity is not primarily a flaw of CAT but of the broader lack of consensus around Depicts. They recommended that WMF collaborate with the Commons community to establish clearer guidelines, similar to the well-developed consensus around categories, supported by example-based tutorials to standardize usage.

Challenges in Applying Depicts via CAT

Beyond definitional ambiguity, we found that contributors, especially newcomers, struggled to apply Depicts statements responsibly when using CAT. AI-suggested tags were often inaccurate or irrelevant, which requires careful judgment that many new users lacked. The tool’s gamified design reinforced the misconception that “more tags are better,” which led to misuse and over-tagging (e.g., approving depicts: green for a landscape photo). These misapplications lowered data quality and created additional work for experienced editors. Participants suggested that future tools must pair clear guidelines with features that help users refine and thoughtfully evaluate AI suggestions, particularly supporting newcomers in making careful choices.

Lack of Integration between Categories and CAT

Participants agreed that structured data offers a promising solution to the known limitations of the current category system by providing more precise, machine-readable metadata and improving knowledge representation. However, there was strong consensus that structured data should not replace categories entirely but rather integrate with them. Many participants argued that the current approach to structured data fails to respect existing community contributions embedded in categories. Rather than “reinventing the wheel,” the system should build upon this foundation. Participants proposed two strategies: (1) using AI and machine learning to extract structured data from existing metadata (titles, descriptions, and categories), thus automating the transition while preserving existing knowledge; and (2) adopting a collaborative, domain-specific approach in which editors work together within subject areas to refine and transition categories into structured data to ensure that knowledge transfer is effective and scalable.

Ill-Specified AI/ML Tasks

Participants discussed the limitations of CAT, which relied on a generic computer vision model to suggest broad and unrefined image tags, which results in irrelevant or unhelpful suggestions. Editors argued that AI/ML tools should be designed for specific, community-defined tasks to suit Commons’ complex knowledge organization needs. They suggested that AI could assist with well-defined, repetitive tasks (e.g., categorizing large sets of similar images or identifying attributes like flag inscriptions or firework colors). These targeted algorithms would allow human editors to focus on higher-level work. Participants pointed out the importance of aligning AI tools with community priorities and recommended participatory approaches where domain-specific Commons communities help define AI tasks and ensure the tools address meaningful needs.

Lack of Support for Collaborative Decision-Making and Evaluation

We found that tagging images in Commons is challenging, as editors often struggle to assign accurate and relevant tags independently. Due to uncertainty about how others might interpret or use an image, many editors opted for broad or generic tags, hoping that more knowledgeable community members would refine them. This reliance on collaboration calls for tools that facilitate collective decision-making and evaluation. Participants suggested adopting a “voting” approach, where tags are only accepted if confirmed by multiple users, and creating better discussion spaces to share examples of uncertainty or consult experienced editors.

Disconnect between CAT and Commons’ Search Functionality

Although CAT was designed to improve Commons search through structured data, contributors felt it delivered little tangible benefit. Many disengaged after realizing that adding Depicts statements, often both broad and specific, did not enhance search results and instead recreated a parallel folksonomy resembling the old category system. The core issue lies in search itself: Commons has yet to leverage Wikidata’s semantic relationships for disambiguation or hierarchical reasoning (e.g., linking dog to Beagle). As a result, contributors bear the burden of tagging without seeing improvements in search experience. Participants recommended that, alongside improving CAT, Commons should design and deploy a search mechanism that fully leverages Wikidata’s semantic relationships and aligns with best tagging practices across the Wikimedia ecosystem.

We then discussed the broader implications of these findings for designing human–AI collaboration on Commons and for developing AI-assisted tools that support open knowledge work.

Designing for Human–AI Collaboration on Commons

Our findings show that building effective AI-assisted tools for Wikimedia Commons requires more than technical accuracy. It also requires alignment with community practices, values, and structures. First, tools must actively foster consensus because Commons lacks the governance and shared norms present on Wikipedia, which makes agreement around Depicts statements and AI-assisted tasks difficult. Second, participatory AI/ML approaches are needed to ensure that tools are trained for community-defined tasks and reflect contributors’ priorities, especially given Commons’ decentralized and multilingual nature. Third, legacy systems such as categories, though imperfect, remain central to community workflows and should be integrated rather than replaced. Finally, future tools must address linguistic and cultural gaps by leveraging Commons’ multilingual metadata and culturally specific knowledge to counteract biases in generic AI models.

Designing AI-Assisted Tools to Support Open Knowledge Work

Our findings highlight two key considerations for designing AI-assisted tools in open knowledge work. First, image tagging illustrates that AI should not attempt to eliminate ambiguity but instead support it. Images are inherently multimodal and culturally situated, which requires interpretation rather than definitive labels. Effective systems should make uncertainty visible, communicate confidence levels, and provide contextual explanations to help contributors refine or contest AI outputs. Second, tools like CAT show a shift from AI’s traditional role in moderation toward direct content creation, which raises questions about authorship and editorial responsibility. Contributors highlighted that AI should function as a co-creative partner who offers suggestions without overriding human judgment, so that collaboration preserves human agency while expanding the possibilities of open knowledge work.

RQ2: How do Commons contributors, through their interactions with AI-generated tags, situate the social and cultural knowledge they perceive in images within Wikidata’s ontology?

[edit]

In part of this study, we also investigated how participants interacted with AI-suggested image tags to situate the social and cultural knowledge they perceived in images within Wikidata’s ontology. Our analysis shows that Commons contributors engage in far more than simply accepting or rejecting AI-generated tags. Their practices span multiple dimensions of evaluation and sensemaking: attending to accuracy and specificity, weighing usefulness, making ethical judgments, situating images within broader cultural narratives and contexts, improving ontologies and knowledge infrastructures, and treating AI outputs as collaborative seeds. Together, these practices reveal a multi-layered framework of interaction that characterizes human–AI collaboration in semantic image tagging.

Accuracy and Specificity

Contributors engaged with AI-suggested tags by carefully verifying their factual correctness, refining them for precision, and managing uncertainty. They rejected incorrect or misleading suggestions to preserve trust. They edited AI outputs to add scientific or cultural specificity that the algorithm flattened into overly generic categories. When possible, they refined tags to more precise levels (e.g., binomial names). They highlighted the need for tools that support hierarchical navigation of specificity rather than forcing overly broad or narrow choices. In cases of uncertainty, contributors preferred restraint, either withholding tags entirely or selecting safer, broader categories. This highlights a human editorial norm that values knowledge integrity over AI’s tendency to over-tag.

Usefulness

Contributors highlighted that the value of AI-suggested image tags lies not just in technical accuracy but in their usefulness for discovery, interpretation, and reuse. They resisted cluttering files with trivial or background descriptors and instead prioritized tags that highlight what is distinctive or central to an image. Usefulness was judged at multiple levels: for individual files where prominence and salience matter, at the scale of the Commons repository where overused generic tags like “sky” or “tree” lose discriminatory power, and across diverse use cases where niche or implied properties may expand discovery pathways. Contributors sought to ensure metadata improves Commons’ navigability rather than diluting it with noise.

Ethical Judgments

Contributors found that AI-assisted image tagging is never neutral and involves navigating ethical concerns as much as technical ones. They weighed privacy and consent when tagging identifiable people or places and balanced the value of specificity against the risks of exposure or misidentification. They also critiqued culturally inappropriate tagging that trivializes sensitive contexts. In addition, they warned about the dangers of AI errors in high-stakes domains such as medicine or law and reflected on systemic biases shaped by Commons’ contributor demographics and skewed training datasets that overrepresent Western contexts.

Narrative and Contextualization

Contributors treated AI-assisted image tagging not just as object recognition but as narrative knowledge work that enriches images with historical, cultural, and contextual meaning that AI alone could not capture. They corrected surface-level tags by situating images within broader frames, identifying historic events, establishing provenance to ensure credibility, preserving cultural heritage sites that no longer exist, and drawing on lived local experience to highlight contextually relevant details. Through these practices, contributors showed that the value of an image lies as much in what it represents and preserves as in what it depicts.

Ontology and Knowledge Infrastructure

Contributors also approached AI-assisted tagging as ontological work, which aims to align image metadata with Wikidata’s broader knowledge structures. They noted persistent coverage gaps, such as missing entries for everyday but significant entities, that limited tagging accuracy and often required creating new Wikidata items. They also found that many AI-suggested tags did not belong under Depicts but were better suited to other properties like main subject, instance of, or medium, and argued for richer property support. Finally, they highlighted the need to capture nested representations (e.g., a photo depicting an artwork that itself depicts a person), which current systems often flatten.

Collaboration

Contributors treated AI-suggested tags not as a final product but as the start of collaborative workflows. They used AI suggestions as provisional placeholders or backlogs for subject-matter experts to refine, and sometimes even left imperfect tags in place to prompt correction by others. In addition, they valued AI outputs as cues for research, with suggestions, accurate or not, that sparked inquiry into overlooked aspects of images. In this way, AI functioned less as an autonomous classifier and more as a partner that seeded community-based refinement and collective knowledge-building.

We further discussed the design implications of this multi-layered framework of interaction for future AI-assisted image tagging systems.

Encouraging Inquiry Through AI Prompts

Our findings show that Commons contributors often approached AI-suggested tags not as final answers but as starting points for inquiry. This highlights an important opportunity for design. Instead of positioning AI as an authoritative classifier, future systems can encourage contributors to use suggestions as sparks for research. The way suggestions are presented makes a difference [21]. A prompt like “This might be a gray swallow. Would you like to check references?” signals uncertainty, invites verification, and keeps contributors in charge. Linking suggestions directly to Wikidata entries or Wikipedia articles further supports this process by providing quick access to definitions, context, and references. These resources help contributors check, refine, and build on what the AI suggests. Systems should also support hierarchical navigation so contributors can move between broader and narrower terms (e.g., bird → swallow → gray swallow). This mirrors the way they already manage uncertainty and narrow down choices. With these design choices, AI shifts from giving answers to working alongside contributors as a partner in collaborative inquiry.

Supporting Collaboration Through Staged Editing

We found that contributors often treated AI-suggested tags as a starting point within collaborative workflows. They often approached AI outputs as backlogs for specialist intervention or as cues for further research. To support this practice, systems should enable staged editing where contributors can accept provisional tags while clearly flagging them as “needs review.” These placeholders can function as structured backlogs that guide domain experts or experienced editors toward images requiring refinement. At the same time, systems should allow contributors to attach explicit uncertainty indicators such as “low confidence” or “possible alternatives.” Making uncertainty visible helps distribute work more effectively: general contributors can capture and organize provisional knowledge, while specialists focus on reviewing and validating it. This division of labor parallels practices on Wikidata [22], where some editors perform simple, repetitive edits to add and maintain data, while others take on more complex tasks such as creating classes or defining properties that require expertise in ontology engineering. In this way, AI shifts from acting as an autonomous classifier to becoming a collaborator whose suggestions seed broader editorial processes and sustain Commons’ culture of iterative, collective curation [23].

Supporting Narrative Tagging

Our findings show that contributors valued tags not only for identifying objects but also for conveying the broader stories images can tell. Rather than focusing only on what appears in an image, AI-assisted systems can prompt contributors to consider who might use the image and for what purpose. Questions such as “Why might this image matter historically?” or “What story does this image tell, and to whom?” encourage the addition of contextual metadata that situates images within cultural, historical, or social narratives, while also prompting contributors to think about multiple perspectives and use cases. This storytelling lens helps surface different kinds of metadata: the central subject that anchors the story, the contextual setting that shapes interpretation, and the background details that may serve niche or unexpected needs. In this way, AI shifts from simply recognizing objects to acting as a partner that sparks narrative thinking and supports richer, more meaningful tagging.

Prioritizing and Safeguarding Tags

Our findings suggest that not all image tags carry equal weight for understanding or discovering an image, and AI-assisted systems should help contributors make these distinctions. Systems can prioritize by surfacing candidate tags with suggested prominence, for example, distinguishing between the main subject and background detail, so contributors can focus on what is distinctive rather than trivial. At the same time, algorithms should recognize when certain labels are over-saturated at scale (e.g., sky, tree) and prompt contributors to consider whether such tags meaningfully aid discovery. To foster balance, systems should also highlight underrepresented geographies, languages, or cultural categories to encourage contributors to add tags that counter systemic skew rather than reinforce it. Safeguards are equally important. For high-stakes images, such as those in medical, legal, or financial contexts, tagging should be limited or flagged for expert review to avoid harmful mistakes, as suggested in prior work [24]. Similarly, AI should avoid suggesting names for private individuals or identifiable bystanders and reserve personal identification only for notable figures who already have Wikipedia articles or Wikidata entries. By combining prioritization with safeguards, AI-assisted tagging can both improve metadata quality and uphold ethical and cultural responsibilities on Commons.

Bridging Commons and Wikidata

The AI-assisted tagging tool connects image tags to the broader knowledge infrastructure of Wikidata [25], an ontology collaboratively created and maintained by Wikidata editors. This ontology is a work in progress [26], and our findings show that Commons editors often need to revise existing entries, add missing items, or represent nested relationships. Future system design should make these interactions easier and more approachable. For example, the interface could include short tutorials or inline guidance on how to edit Wikidata, along with a “quick add” option that allows contributors to create new items without leaving the tagging workflow. Systems should also expand beyond the basic Depicts statement by surfacing other structured data properties such as main subject, instance of, or medium, and by providing explanations or examples to help contributors select the most appropriate property. These features would not only support Commons editors in placing tags more accurately within the ontology but also strengthen the integration between Commons and Wikidata as interconnected knowledge infrastructures.

Taken together, these design implications highlight that AI-assisted tagging systems should not aim to replace human judgment but to support and extend it. By encouraging inquiry, supporting collaboration, enabling narrative framing, prioritizing tags while safeguarding against harm, and bridging Commons with Wikidata, future systems can better align with contributors’ multi-layered practices.

Conclusion

[edit]

In this study, we present a qualitative analysis of 595 user comments and 16 interviews that explore Wikimedia Commons contributors’ lived experiences with the CAT tool. Our work provides an empirical understanding of seven key challenges that shaped CAT’s mixed reception and eventual deactivation, along with community-informed suggestions for improving the tool.

We develop a multi-layered interaction framework that illustrates how contributors, through engaging with AI-generated tags, situated the cultural and social information perceived in images within Wikidata’s ontology. Building on this framework, we propose design implications for developing AI-assisted image-tagging systems that better align with contributor practices.

Dissemination

[edit]
  1. Research report for the Wikimedia community
  2. One paper submitted to CSCW (currently under R&R)
  3. One paper submitted to CHI (currently under review)
  4. Presentation of research progress and outputs at two events: Considering Cultural and Linguistic Diversity in AI Applications Workshop (CALD-AI workshop) and Wiki Workshop (12th edition).

References

[edit]
  1. Yu, Y., & McDonald, D. W. (2022). Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-35.
  2. Yu, Y., & McDonald, D. W. (2022). Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-35.
  3. Yu, Y., & McDonald, D. W. (2023). " Why do you need 400 photographs of 400 different Lockheed Constellation?": Value Expressions by Contributors and Users of Wikimedia Commons. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW2), 1-34.
  4. Gillespie, T. (2010). The politics of ‘platforms’. New media & society, 12(3), 347-364.
  5. Thornton, K., & McDonald, D. W. (2012, October). Tagging Wikipedia: collaboratively creating a category system. In Proceedings of the 2012 ACM International Conference on Supporting Group Work (pp. 219-228).
  6. Wang, M., Ni, B., Hua, X. S., & Chua, T. S. (2012). Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM computing surveys (CSUR), 44(4), 1-24.
  7. Thornton, K., & McDonald, D. W. (2012, October). Tagging Wikipedia: collaboratively creating a category system. In Proceedings of the 2012 ACM International Conference on Supporting Group Work (pp. 219-228).
  8. Sen, S., Lam, S. K., Rashid, A. M., Cosley, D., Frankowski, D., Osterhouse, J., ... & Riedl, J. (2006, November). Tagging, communities, vocabulary, evolution. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (pp. 181-190).
  9. Yu, Y., & McDonald, D. W. (2022). Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-35.
  10. Yu, Y., & McDonald, D. W. (2022). Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-35.
  11. Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia culture gap: quantifying content imbalances across 40 language editions. Frontiers in physics, 6, 54.
  12. Forte, A., Larco, V., & Bruckman, A. (2009). Decentralization in Wikipedia governance. Journal of Management Information Systems, 26(1), 49-72.
  13. Halfaker, A., & Geiger, R. S. (2020). Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-37.
  14. Kittur, A., Suh, B., Pendleton, B. A., & Chi, E. H. (2007, April). He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 453-462).
  15. Viégas, F. B., Wattenberg, M., Kriss, J., & Van Ham, F. (2007, January). Talk before you type: Coordination in Wikipedia. In 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07) (pp. 78-78). IEEE.
  16. Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011, February). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 277-288). Berlin, Heidelberg: Springer Berlin Heidelberg.
  17. Cosley, D., Frankowski, D., Terveen, L., & Riedl, J. (2007, January). SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces (pp. 32-41).
  18. Halfaker, A., & Geiger, R. S. (2020). Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-37.
  19. Smith, C. E., Yu, B., Srivastava, A., Halfaker, A., Terveen, L., & Zhu, H. (2020, April). Keeping community in the loop: Understanding wikipedia stakeholder values for machine learning-based systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
  20. Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative research in psychology, 3(2), 77-101.
  21. Zhang, Q., Wen, R., Hendra, L. B., Ding, Z., & LC, R. (2025, April). Can AI Prompt Humans? Multimodal Agents Prompt Players? Game Actions and Show Consequences to Raise Sustainability Awareness. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1-29).
  22. Piscopo, A., Phethean, C., & Simperl, E. (2017). Wikidatians are born: Paths to full participation in a collaborative structured knowledge base.
  23. Yu, Y., & McDonald, D. W. (2022). Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-35.
  24. Menking, A., & McDonald, D. W. (2020). Image Wishlist: Context and Images in Commons-Based Peer Production Communities. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-21.
  25. Müller-Birn, C., Karran, B., Lehmann, J., & Luczak-Rösch, M. (2015, August). Peer-production system or collaborative ontology engineering effort: What is Wikidata?. In Proceedings of the 11th International Symposium on Open Collaboration (pp. 1-10).
  26. Piscopo, A., & Simperl, E. (2019, August). What we talk about when we talk about Wikidata quality: a literature survey. In Proceedings of the 15th International Symposium on Open Collaboration (pp. 1-11).