Research:Modeling collective meaning

Created

21:03, 17 July 2023 (UTC)

Contact

Aaron Halfaker

Microsoft

Collaborators

User:Ciell

Wikimedia

Duration: 2020-05 – 2023-12

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Open collaboration systems like Wikipedia (or really any shared community of practice) operate by developing, sharing, and formalizing concepts that enable them to work together effectively. Machine learning technologies have proven a powerful technology for these massive collaborations to manage production at scale through modeling and applying shared concepts of vandalism, quality, topic, and other subjective aspects of collective work. In this paper, we explore approaches and practices for developing shared concepts in conjunction with capturing that meaning in the behavior of a machine learning model. This paper describes a case study of modeling article quality in Dutch Wikipedia while co-defining the meaning of quality with Dutch Wikipedians. This case study blurs the line between “social governance” mechanisms and the reification of norms in a ML model. We report insights about using applied machine learning as a tool for eliciting reflection and alignment around shared meaning as well as provide practical insights for designing participatory processes around developing probabilistic algorithmic systems.

Introduction[edit]

The development and application of shared concepts is central to the functioning of open collaboration systems like Wikipedia. From the central pillars (e.g., Verifiability^[1]) to shared understandings of what types of articles are welcome and what articles should be deleted, Wikipedians manage broad swaths of collaborative work by discussing, debating, recording, formalizing, and citing shared concepts in the form of "essays, policies, and guidelines". This process of developing and capturing shared understandings is well described by the research literature.

Machine learning technologies have become core infrastructure for managing the scale of Wikipedia. Machine learning models are used to detect vandalism^[2]^[3], measure the quality of articles^[4]^[5], route newly created articles to reviewers by topic^[6]^[7], detect policy and style issues in text^[8]^[9], and even to generate encyclopedia articles directly from source material^[10] -- as a few examples. Without exception, each of these models is designed to align an algorithm's behavior with a tangible shared concept -- often extensively documented in Wikipedia (see corresponding footnotes above). Perfect alignment between algorithm and documented norm is often seen as the gold standard to achieve.

In this paper, we report on a case study of building a machine learning model in the context where no documentation nor established norm were available to align the behavior of the algorithm to pre-established expectations. This case study allows us to unpack the relationship between meaning captured in norms and meaningful behavior of a machine learning model. Through this unpacking, we contribute a deeper understanding of what it means to align algorithmic behavior in a social context and position an ML model as mediator between meaning and work. This both provides implications for how system developers might consider the design of well-aligned algorithmic systems in social contexts and adds a new thread to a conversations about genre ecologies and articulation work.

First, we describe the context Dutch Wikipedia and the problem of defining article quality sufficiently to model. Next we highlight novel themes and concepts that emerged throughout the co-development process. Finally, we align these insights with the literature on aligning ML model behavior, the governance of the commons, articulation work and genre ecologies. Finally, we conclude with reflections on practical approaches to performing participatory algorithmic design in social settings.

Related work[edit]

Machine learning & Social production[edit]

Lots of different types of models to support different types of work: Vandalism, Quality, Topic, Toxicity, Writing quality
Systems for hosting models: ORES, Automoderator (https://dl.acm.org/doi/pdf/10.1145/3338243)
Social effects of ML-powered systems. E.g. rise and decline and governance

Alignment in ML[edit]

Alignment framing from Christian, B. (2020). The alignment problem: Machine learning and human values. WW Norton & Company.
Alignment is about choice, meaning, and values. An aligned model captures agreed upon meaning and supports values.
Alignment in what should be modeled
- What is important to model? That depends on the values of the person you ask.
- This is where Value-Sensitive Algorithm Design (VSAD) comes in.
Alignment in what is modeled (“Conceptual alignment”)
- When building ML technologies to support productive work in collaboration with humans, we are generally trying to build models that exemplify "human concepts". (Machine Teaching)
  - Meaning/concepts is about fleshing out the details of a value. (Machine Teaching)
  - The problem of transferring an abstract human concept to a machine intelligence is discussed extensively in the machine teaching literature.
- But often, ML models are not just serving a single human and their own private concepts. ML models are instead trained on community behavior with the intention of modeling a shared concept.
(TODO: Literature about how groups of people form and reify shared concepts. "Norms"?)
- https://reagle.org/joseph/2010/gfc/chapter-3.html
- Wikipedia: a self-organizing bureaucracy (tandfonline.com)
- Articulation work - Schmidt
- Angry Mastodons - Morgan
Alignment in the effects of using models.
- VSAD is about values applied to how something is modeled and how a model is used.
- But a more fundamental problem for alignment is capturing meaning in the first place.

Ostrom's principles[edit]

"A resource arrangement that works in practice can work in theory."
"She showed that when natural resources are jointly managed by their users, in time, rules are established for how these are to be cared for and used in a way that is both economically and ecologically sustainable." https://en.wikipedia.org/wiki/Elinor_Ostrom
- Focus on participation of locals to adapt the use of common resources local conditions.
Relevant principles:
- #1 Clearly defining the group boundaries (and effective exclusion of external un-entitled parties) and the contents of the common pool resource
  - (old) Boundaries of Wikipedia are complex -- uniformly the majority of Wikimedians in good standing (AGF) have rights. In ML development, deployment, maintenance, and use -- we employ the same inclusive boundaries.
  - Wikimedians are resource appropriators. They are the entitled parties. Devs play a supporting (not directing) role.
- #2 The appropriation and provision of common resources that are adapted to local conditions;
  - The appropriation and provision of ML models is adapted to the application context (via locals)
- #3 Collective-choice arrangements that allow most resource appropriators to participate in the decision-making process;
  - Processes that allow most editors to participate in decision-making
- #4 Effective monitoring by monitors who are part of or accountable to the appropriators;
  - Effective monitoring by editors or someone trusted-by and accountable-to editors.
- #5 A scale of graduated sanctions for resource appropriators who violate community rules;
  - ML powered tool developers and users (appropriators) are held to the same standards and subject to the same sanctions as Wikipedians -- which are graduated (see https://nl.wikipedia.org/wiki/Wikipedia:Richtlijnen_voor_moderatoren and Banning of a vandal paper)
- #6 Mechanisms of conflict resolution that are cheap and of easy access;
  - Mechanisms of bug/modeling issue resolution that are responsive and accessible to editors
- #7 Self-determination of the community recognized by higher-level authorities
  - Self-determination of the community recognized by developers/modelers/maintainers
- #8 In the case of larger common-pool resources, organization in the form of multiple layers of nested enterprises, with small local CPRs at the base level.
  - Wikipedians organize themselves in nested governance structures: The appropriators (CPR), The working group, The language community, and The Wikimedia community. (e.g. 13 century paintings of Mohammad) ML development and appropriation is subject to the same nested governance structures.

Think of ML model developers as the "fence builder" of the pasture. "They have expertise on different kinds of barbed wire, but not inherent authority."

Standpoint epistemology[edit]

Standpoint epistemology gives us a lens to view the concerns and views of different groups of people who may or may not be represented in decision-making processes.
- Shared community meaning definition and alignment is about developing and implementing a shared standpoint.
- Without a shared standpoint, it would be impossible to have an aligned model.
TODO: Do any of the scholars in this space discuss "shared" standpoints?

Isn’t this just VSAD?[edit]

It is VSAD in that stakeholders are involved in the early stages of algorithm development. VSAD is a goal – not a specific method. The steps described represent a common pattern of project management (identify stakeholders, build prototypes, refine, and evaluate).
VSAD discusses involving “stakeholders”. We’re engaging with the articulation mechanisms of a community.
The focus here is less about identifying the (sometimes conflicting) “values” expressed among stakeholders and more about developing "shared meaning"

Developing a collective meaning - Articulation work[edit]

Articulation work (Schmidt & Bannon) - build a shared understanding of what aspects of quality are important and meaningful to the work of building an encyclopedia.
- Mixture of prescription and description (Morgan, Injunctive Norms)
While it’s not novel to see the development, formalization, and reinterpretation of norms through documentation as articulation work, it is somewhat novel to see model development as a component of articulation work
- Policy/Guideline - Formal description of a norm
- Essay - In context discussion of a norm
- Models/Bots - Deterministic* application of a norm at scale
  - ML models are different in that they generally have more complex and unknown qualities that are shaped by the input data.
  - But Bots also go through audit cycles and encourage reflection on the underlying norm being operationalized. (Geiger, Lives of Bots)
There’s actually a rich history of seeing the roles algorithms (and specifically ML models) in the articulation of cooperative work in Wikipedia
- Vandalism detection models (Halfaker papers)
- Encoding rules in bots (Geiger, Muller-Birn)
TODO: Talk about Wikimedian Consensus process here. Also “Policy lens”

The case: Dutch Wikipedia Article Quality[edit]

In May of 2019, we attended the Wikimedia Hackathon, a yearly in-person event organized by the Wikimedia Foundation that “brings together developers from all around the world to improve the technological infrastructure of Wikipedia and other Wikimedia projects.” As part of our activities at that event, we ended up talking to technically inclined Wikipedians from Dutch Wikipedia who had heard about Article Quality models used in English Wikipedia and were interested in what it might take to set up such a model for Dutch Wikipedia.

We worked together to file a request to build the models in the relevant task tracking system\footnote{\url{https://phabricator.wikimedia.org/T223782}} and to populate the request with basic questions that are useful for understanding how a community like Dutch Wikipedia already thinks about quality:

How do Wikipedians label articles by their quality level?
What levels are there and what processes do they follow when labeling articles for quality?
How do InfoBoxes work? Are they used like on English Wikipedia?
Are there "citation needed" templates? How do they work?

We developed these questions through past experiences building similar models for other Wikipedia communities. The answers to these questions were surprisingly complicated. Many Wikipedia communities adopted an article quality scale similar to English Wikipedia, but Dutch Wikipedians reported on the task that they did not have a complete scale, and instead, they had some processes for tagging the lowest quality articles ('Beginnetje') and highest quality articles ('Etalage'), but everything between had no definition. This contrasts to English Wikipedia were with levels from Stub, Start, C, B, GA, and FA in ascending order with strict definitions\footnote{c.f., ^[4]}.

At this point, it became clear to us and our Wikipedian collaborators that setting up an article quality model like English Wikipedia would require the complicated work of defining a scale of article quality. Participants in the discussion made it clear that they would not be interested in merely adopting a scale from another wiki. And so we set about doing so using the mechanisms that Wikipedians tend to follow to build consensus and shared understandings about their work. Our Dutch Wikipedian collaborator posted to De Kroeg (“The cafe”)\footnote{\url{https://w.wiki/655w}}, a central discussion space, about the potential of bringing article quality models to the wiki and included information about how they had been used in other wikis. The proposal was met with light skepticism – concerns about whether an AI could detect quality in the first place – but an agreement was reached that it was acceptable to start experimenting and allow people to use the predictions on an opt-in basis.

Quality scale	Model
v1b - A and E class are borrowed from pre-defined, on-wiki concepts. B is defined as “not quite A”, D is defined as “no longer E” and C is defined by article length\footnote{\url{https://w.wiki/5qzc}}	v1 - Trained by gathering examples of template introductions/removals and and using length constraints. We expected this model to be wrong but to help probe editors to elicit reflection what their quality scale should look like
	v1.1 - Trained using the same data as v1 but with minor technical improvements to the way that “Infobox” templates and references are tracked.
v2 - B, C, and D classes are more clearly defined. For example, C-class requires the presence of an “Infobox” and D-class requires that there is at least one source in the article.\footnote{\url{https://w.wiki/5qzd}}
v2.1 - Refined version of v2 based on reflection when applying v2 scale to articles. Source requirement moved C-class and softened.\footnote{\url{https://w.wiki/5qzi}}
	v2 - Trained using a mixture of data sourced from template usage (for A- and E-class) as well as the results of labeling activities and on-wiki re-labeling using the v2.1 scale.

Over the course of the next 1.5 years, we engaged in an iterative sense-making and engineering process that leveraged Wikipedian processes for performing articulation work\footnote{Articulation work here refers to its usage in a CSCW context where different actors negotiate shared understanding and come to agreements of how to work together. C.f. \cite{suchman94supporting}.} and online spaces for the co-development of a machine learning model along with guidelines for how quality should be assessed in Dutch Wikipedia. Beyond the discussion on De Kroeg, we created a “project page” for the effort\footnote{\ANON{\url{https://w.wiki/66qM}}} where we described the model, hosted technical descriptions of the quality scale (see \ref{table:quality_scale_and_models}), and posted prediction sets for auditing. Our Dutch Wikipedian collaborator gathered a small community of local Wikipedian collaborators around these documents and discussions in order to iterate with us. Table \ref{table:quality_scale_and_models} describes the alignment between iterations of the quality scale and the model.

In the rest of this section we describe aspects of this collaboration that make salient the co-development of meaning and model.

Audit sets & labeling campaigns[edit]

When we first set out to model article quality for Dutch Wikipedians, we wanted to use as much past work as we could before trying to define any new aspects of quality. As mentioned above, Dutch Wikipedians had already developed formal processes and definitions for the top and bottom quality classes (beginnetje and etalage respectively). Through discussion with our Wikipedian collaborators, we settled on a rough scale that added three quality levels between these two extremes:

B-class: Articles that had been submitted for review as etalage but rejected were assumed to be high quality but not high enough quality
D-class: Articles that were tagged as beginnetje, but the tag was removed were assumed to be slightly higher quality than beginnetje.
C-class: Articles that were between B- and C-class. We ultimately decided to set a formal length criteria for these articles (between 3000 and 5000 bytes of text)

It was apparent to all involved that this scale was overly simplistic but we suspected that, through exploring the limitations, we might elicit the latent shared understanding of the quality of articles from Dutch Wikipedians. Based on past work aligning model behavior with communities of Wikipedians (e.g. \cite{halfaker20ores} & \cite{asthana21automatically}), we planned to seek feedback and prompt iteration on the quality scale through the auditing process.

The initial audit[edit]

The first step in our auditing process involved generating article quality predictions for all articles in Dutch Wikipedia and randomly sampling 20 articles from each predicted class for review (5 classes X 20 predictions = 100 articles in the assessment set). We used article text from the June 2021 database dump of Dutch Wikipedia\footnote{\url{https://dumps.wikimedia.org/}} to generate predictions. Since the quality of Wikipedia articles is highly skewed with the vast majority of articles in the lower quality range, this stratified approach allowed our collaborators to assess the performance across the scale. We posted the predictions with links to the specific version of the article we scored to a wiki page and invited Dutch Wikipedians to leave open ended comments about whether or not each article.

Some evaluations clearly implied adjustments to the naive v1b scale. For example, on an article predicted to be C-class, \anon[one Wikipedian][User:Effe] commented (translated):

Not a 'good' article, and I would personally rate it as D because of its focus on a summary, and the lack of further sources beyond the one report. But, strictly speaking, does it seem to meet the criteria?

While this is just one example, there are many things going on in this comment. First, it directly critiques the model’s prediction and suggests that the article in question should be rated lower (D-class). It also raises concerns about “focus on a summary” with regards to writing quality and calls out the lack of quality sources. It also calls into question the naive C-class criteria we started with (3000-5000 characters) as capturing what this editor imagines the C-class \emph{should represent}.

Many such comments were met with follow-up discussion. For example, \anon[another Wikipedian][User:Sylhouet] left the following comment on an article predicted to be D-class:

Only two sources that are also inaccessible, uninformative, clumsily edited. As far as I'm concerned, at the bottom of E.

While \anon[a third Wikipedian][User:Ciell] challenged the downgraded assessment with:

[...] Please note that E is meant for real beginnetje, unless we find a (measurable) way (and agreement on this) to also include poor quality articles [...]

In this example, we can see \emph{source quality, information quality, and editing quality} being raised as important criteria to add to the scale. We can also see another editor ensuring that there is still space in the lower end of the scale for articles that are even more lacking.

Beyond these substantive and meaningful concerns about the nature of quality and how a scale might be applied to these articles, our collaborators also noticed that our process for collecting articles for review seemed to miss some critical features of quality such as the presence of Infoboxes\cite{Structured data content that appears in the upper right side of high quality articles}. Others noted that pages that were not seen as articles were included in the set – such as “list articles” like Lijst van spelers van Middlesbrough FC (List of Middlesbrough FC players). We were able to address these issues directly through improved feature engineering and sampling methods (which lead to v1.1 of the model).

Labeling and re-auditing[edit]

After this initial audit, the majority of our energy was put into supporting our Wikipedian collaborators in developing their quality scale. The initial audit provided substantial new insights into what did and did not belong at each quality level. The quality scale v2 (see Table~\ref{table:quality_and_models}) emerged as a well-articulated description of the new consensus.

With a well articulated definition, we felt it was time to ask our collaborators to help us build a new version of the model by manually labeling articles. This would allow us to encode this new shared understanding in examples that the quality model could learn from. So, we set out to build a stratified sample of articles to label. Since the consensus about requirements for the two extreme classes (beginnetje and etalage) had not changed, we only really needed to gather labels for the middle quality classes (D, C, and B). So we set out to do the best we could to gather a stratified sample of likely mid-class articles for labeling. We applied version 1.1 of the model (same as version 1 with minor improvements to some feature code and removing some “list articles” from our sample) to Dutch Wikipedia articles again and sampled 25 D-predicted articles, 50 C-predicted articles, and 25 B-predicted articles for labeling. We sampled more C-predicted articles because we expected that the predictions for that quality class were very poor due to the naive initial specification and therefore the actual labels for that group would be distributed across D and B classes. In order to ensure consensus on the label, we configured Wiki Labels\footnote{\url{https://meta.wikimedia.org/wiki/Wiki_labels}} to require 3 labels per article from different editors.

After the labeling campaign finished, we observed a large amount of disagreement. 56 of the articles had a disagreement between the labelers. Table N presents the first 8 rows of the 56 row audit table we constructed. With this table, we went back to the Wikipedians who performed the labeling work and asked them about why there might be so much disagreement. See vignette N for an overview of the discussion that followed and a reflection on the co-development of machine learning model and collective meaning. As the discussion progressed, \anon[one Wikipedian][User:Sylhouet] and \anon[another Wikipedian][User:DimiTalen] added/corrected labels in the “Definitieve label” column.

With the completion of this discussion and labeling work, we were ready to train an updated version of the model. Despite all of this work, true C-class articles still remained elusive. Even after gathering all of the C-class labels from this process and all “Strange Ducks” (See vignette N) that Wikipedians agreed should be considered C-class, we managed to gather 32 examples to train with. Given that all of the other classes had far more examples, and it was useful to train the model with a balanced sample, we ended up training v2 of the model on 32 examples from each quality class (32 \times 5 = 160 total observations). Despite this limited training set, we achieved 80.8% accuracy across the classes and agreed to deploy the model for testing with our Wikipedian collaborators.

Strange Ducks: Mispredictions as a focus of articulation work[edit]

Origin. One of the authors of this paper is an administrator and active contributor on Dutch Wikipedia. She often works in a roll as a “tool coach”\footnote{A term coined by Sumana Harihareswara–a technical contributor to Wikimedia projects–used to describe someone who fills a role in technical projects that technical contributors tend to not be great at\cite{https://www.harihareswara.net/posts/2021/sidestepping-the-pr-bottleneck-four-non-dev-ways-to-support-your-upstreams/#coaching-and-cheerleading}} for various technical projects around Wikimedia. So she is experienced in bridging the relationship between technical contributors (our dev team) and the needs of Wikipedians. When we shared the first version of the article quality model with the Dutch wiki community, she searched for an effective strategy for communicating the strange and inconsistent behaviors that people would see with the rest of the development team (technical, Wikipedian, but not Dutch Wikipedians).

Implementation. She came up with a strategy of providing a space–a section on a wiki page–for our collaborators to share any behaviors they thought were wrong, unexpected, or otherwise worthy of discussion. She named this section “Vreemde eenden in de bijt” which roughly translates to “Strange duck in the pond” – a dutch euphemism for odd things that don’t belong. She then developed a weekly cadence to (1) review the submissions in discussions with Wikipedians on the associated talk page and (2) bring a summary of those discussions to the development team. As a local community member, she was able to help answer our developers’ questions about why a behavior is considered to be strange and what behaviors might be more aligned with Dutch Wikipedian expectations. She was also able to discuss these issues with their submitters in their native language and situated within their shared cultural context.

A key to making the Strange Ducks strategy work was getting the model’s predictions in front of Wikipedians in the course of their natural work. So, we developed a javascript-based “gadget” that Wikipedians could enable in their preferences. This gadget surfaced quality predictions on screens that editors will usually see in the course of browsing and editing Wikipedia. See Figure for screenshots and details about the gadget.

Together, the ArticleQuality.js gadget, the Strange Ducks repository, and our “tool coach’s” bridging role between Wikipedians and the development team represent a sociotechnical system for continuous reflection about the behavior of the model and the implications of the guidelines (documentation of collective meaning)

Figure N: Screenshots of the ArticleQuality.js gadget’s features.

Discussion. This interaction pattern formed a cross-lingual, cross-cultural bridge between Dutch Wikipedians and our processes for developing and refining the article quality model. It also formed a space for reflection and renegotiation of shared understandings between Wikipedians. While the purposeful audits and labeling campaigns provided focused opportunities for review and reflection among our Wikipedian collaborators, the ArticleQuality.js and Strange Ducks strategy is continuous and enabled specific concerns to be raised, with specific examples, at any point in time. Unlike our collaborative auditing and labeling sessions, reflection via Strange Ducks can happen on any schedule as needed and without any developer lead effort.

Strange ducks is a good example of a “high potential interaction” in an active learning sense. If a Wikipedian flags something as a Strange Duck, there are three potential explanations for why it was strange. See Figure N for a breakdown of the decision-tree that applies equally to Strange Ducks and any prediction needing discussion from a labeling campaign. The model’s prediction is incorrect and does not reflect the guidelines (fix the model) The guidelines don’t match Wikipedian understanding of what quality is (fix the guideline) The models’ prediction is correct, it matches the guidelines, and the guidelines are correct (fix submitter’s understanding)

Deciding which case a Strange Duck fits into is a matter of discussion, but regardless of the results of the discussion, the outcome is valuable to the functioning of the entire system. Either the model needs to change, the guidelines need to change, or this is an opportunity to build consensus around the guidelines and how they should be applied.

Exploring the implications of a rule[edit]

In a practical way, the work of developing and refining a machine learning model is a valuable tool for exploring and refining collective understanding and agreement. By participating in the process of model development with collective auditing, the same individuals who are interested in arriving at a functional and interpretable set of norms and practices can explore the implications of specific rules they might consider in their design.

The citation requirement. After the release of v1b of the article quality model trained using data gathered using the preliminary version of the article quality scale, we organized an audit of the model. We applied the model across all current versions of articles in Dutch Wikipedia and then randomly sampled 20 articles from each category for a total of 100 articles in total. This activity provided a useful technical probe for exploring how to define the poorly described middle classes in the quality scale.

Through this discussion, the engaged editors came to a consensus around a clear definition of “C” and “D” class articles. With a clear definition, we decided it was time to attempt to gather labels that reflect this new consensus. Given that the definition of A and E class articles had not changed, we sought to gather as much data as possible about B, C, and D-class articles. We went back to our dataset of v1 model scores and sampled 25 B-class, 50 C-class, and 25 D-class predictions. We chose to sample a higher number of C-class predictions because we suspected that the model was least accurate for this class and we would need more manual labeling to gather an appropriate number of labels.

We loaded this sample into the WikiLabels system, configured the system to require three independent labels per article and invited people active in the discussion on Dutch Wikipedia to label these articles. Once the labeling work was completed, we were surprised to see that the vast majority of observations (81/100) were labeled as D or E class. And so we turned back to discuss this surprising result with the editors.

A discussion grew around a requirement for “Het artikel bevat minstens één bron.” (The article contains at least one source.) Several editors reported that they applied this source criteria very strictly and observed that old styles of sourcing content (e.g. via a comment associated with an edit) could be the reason that some seemingly high quality articles are getting labeled as lower quality. The discussion quickly turns into reflection about what aspects of quality they wish to capture in their scale. For example:

“Kijkend naar deze uitslagen, denken jullie dat de 'geen bron-voorwaarde' (beide voorbeeldartikelen hebben geen bron) in de C-versie van de kwaliteitsschaal juist is? Of moet deze misschien versoepeld worden?” (Looking at these results, do you think the 'no source condition' (both sample articles have no source) in the C version of the quality scale is correct? Or perhaps it should be relaxed?)
“Van mij mag de grens 'bron/geen bron' wel een niveau hoger” (For me, the boundary 'source/no source' may be a level higher)
“[...] ik denk dat het bron-criterium wel een goede reflectie is van de kwaliteit.” (I think that the source criterion is a good reflection of the quality.)
“Jouw voorstel om de broneis te verplaatsen naar C spreekt me wel aan.” (Your proposal to move the citation requirement to C appeals to me.)

Based on this discussion, our local collaborator updated the quality scale to reflect the consensus to move the requirement to C-class and to soften the language of what can be considered a source (“eventueel als een algemene bron onder een kopje ‘literatuur’ of ‘externe link’” which translates to "possibly as a general source under a heading 'literature' or 'external link'"). We invited editors involved in this discussion to re-label the articles with a high level of disagreement by posting a table of the article revisions and original labels on a wiki page^[11]. After this new labeling process, we arrived at 12 new examples of “C” class articles and 18 new examples of “D” class articles.

By combining the results of this audit with labels gathered from A and E class articles from the previous models, we were able to arrive at a new dataset that is aligned with the updated version of the quality scale. We used this dataset to train and release a new version of the article quality model that more closely reflected this updated scale.

Fitting it all together: The meaning cascade[edit]

The article quality documentation on English Wikipedia is formalized as a guideline ^[12]. In other words, the documentation describing the norms and practices around article quality assessment capture the collective meaning of the concept of article quality itself for Wikipedians. These documents form a genre ecology that captures the formal and informal concepts used by Wikipedians to articulate (work together) in Wikipedia. From concrete work practices to statements of principle, the entire cascade of meaning is intentionally kept in alignment. Several meta-processes exist in Wikipedia for maintaining this alignment -- all of which are built on top of peer discussion and documentation practices.

In the case of Dutch Wikipedia's article quality, we can see the behavior of a machine classifier at the intersection of several different branches of CSCW scholarship.

Genre ecologies: A ML model is a new type of genre that more tightly couples principles and work practices.
Commons governance: An ML model represents a shared commodity among Wikipedians and thus the work to govern behavior and access can (and maybe should) follow Ostrom's principles.
Articulation work: Both in development and in application, a well aligned model both represents the result of careful articulation work (through discussion, audit and refinement of the model behavior) and a mechanism of articulation in and of itself (e.g. by driving attention to high potential, low quality articles).
Alignment in social contexts: ...

Reflective process as a means to model refinement[edit]

The work of developing and refining a machine learning model in this context extends the rules from the on-wiki text documentation into the behavior of the model itself. In the same way that one might code rules into a guideline\footnote{Wikipedia’s best practice documentation.} as documentation, one can see rules play out in the behavior of machine learning model.

As \ref{figure:model_cascade} suggests, in an approximate way, a policy document in Wikipedia documents and represents a way of understanding the collective meaning of Wikipedians (genre: principle documentation^[13]). Guidelines represent a way of understanding policies (genre: best practice document^[13]). We assert that machine learning models designed to apply guidelines also represent a way of understanding that guideline in a specific setting. As an algorithm, the ML model represents a set of executable rules that can be applied to any new, valid input. In our case, we can apply these rules in a repeatable and immediately objective way to any Wikipedia article.

All models are wrong^{[citation needed]} is a common aphorism that we feel is useful when considering the implications of this cascade of modeling. Models, as algorithmic mediators of process, “enact the objects they are supposed to reflect or express” but they are inherently imperfect in that they ‘‘transform, translate, distort, and modify the meaning or the elements they are supposed to carry.”^[14] Rather than developing a perfect model (which could be inherently impossible as a model must simplify and will inherently distort) our goal is to design a model that is useful. With a machine learning model, usefulness is often measured through the application of fitness statistics\footnote{en:Model evaluation}, but in our case, the crowd auditing pattern allowed us to go beyond detecting the rate of errors, but to ask, “How much does this type of error affect the usefulness of the model?” and “Is this an error in the model or is it an opportunity to reflect on the guidelines (best practices) or policy (principles); or is it an opportunity to re-negotiate collective meaning?” These questions focus issues of alignment on the intended use of the model and away from the impossible and less useful idea of perfection.

Further, this is not a special case for ML models. This pattern of error detection, utility assessment, and refinement is consistent across the cascade (from collective meaning to ML model) – just as we can detect modeling bugs through the application of the model and reassessment, we can detect bugs in best practices by exploring whether the model's “bugs” are failures to accurately represent the guideline, whether the guideline fails to accurately represent the policies (the principles behind guidelines), or whether the policy fails to usefully reflect the shared collective meaning of the members of the community of editors. In the case of citation requirements, the question of whether or not model behavior was a bug or not brought the discussion all of the way up to the level of collective meaning. How should we think about citations when it comes to quality?

Applying the meaning cascade to the citation issue[edit]

In the case of the citation requirement, the original guidelines that required at least a single citation for being included in D-class seemed like a reasonable application of the collective understanding and policy around article quality. But in practice, when reviewing the predictions of a model based on that guideline and labeling new articles, it became clear that this was a misalignment between the collective understanding and practice. Through review and discussion, guidelines were updated to reflect this understanding given the new knowledge about the context of work. And thus the modeling process was updated and applied in order to ensure that the new model (v3) would reflect this update to the meaning cascade.

Bibliography
^[2] ^[4] ^[6] ^[8] ^[14] ^[13]

References[edit]

↑ See Wikipedia:Verifiability
↑ ^a ^b Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational Linguistics and Intelligent Text Processing: 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II 12 (pp. 277-288). Springer Berlin Heidelberg.
↑ WP:Vandalism
↑ ^a ^b ^c Warncke-Wang, M., Cosley, D., & Riedl, J. (2013, August). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (pp. 1-10).
↑ WP:Content assessment
↑ ^a ^b Halfaker, A. (2017, August). Interpolating quality dynamics in Wikipedia and demonstrating the Keilana effect. In Proceedings of the 13th international symposium on open collaboration (pp. 1-9).
↑ WP:WikiProject Directory
↑ ^a ^b Asthana, S., Tobar Thommel, S., Halfaker, A. L., & Banovic, N. (2021). Automatically labeling low quality content on wikipedia by leveraging patterns in editing behaviors. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1-23.
↑ Wikipedia:Manual of Style, Wikipedia:Verifiability, Wikipedia:Neutral point of view
↑ .
↑ https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok
↑ Wikipedia's policy and guideline pages describe its principles and agreed-upon best practices. Policies are standards all users should normally follow, and guidelines are generally meant to be best practices for following those standards in specific contexts.
↑ ^a ^b ^c Morgan, J. T., & Zachry, M. (2010, November). Negotiating with angry mastodons: the wikipedia policy environment as genre ecology. In Proceedings of the 2010 ACM International Conference on Supporting Group Work (pp. 165-168).
↑ ^a ^b Introna, L. D. (2016). Algorithms, governance, and governmentality: On governing academic writing. Science, Technology, & Human Values, 41(1), 17-49.

[1] See Wikipedia:Verifiability

[adler11vandalism-2] Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational Linguistics and Intelligent Text Processing: 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II 12 (pp. 277-288). Springer Berlin Heidelberg.

[3] WP:Vandalism

[wang13actionable-4] Warncke-Wang, M., Cosley, D., & Riedl, J. (2013, August). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (pp. 1-10).

[5] WP:Content assessment

[halfaker17keilana-6] Halfaker, A. (2017, August). Interpolating quality dynamics in Wikipedia and demonstrating the Keilana effect. In Proceedings of the 13th international symposium on open collaboration (pp. 1-9).

[7] WP:WikiProject Directory

[asthana21issues-8] Asthana, S., Tobar Thommel, S., Halfaker, A. L., & Banovic, N. (2021). Automatically labeling low quality content on wikipedia by leveraging patterns in editing behaviors. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1-23.

[9] Wikipedia:Manual of Style, Wikipedia:Verifiability, Wikipedia:Neutral point of view

[10] .

[11] ttps://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok

[12] Wikipedia's policy and guideline pages describe its principles and agreed-upon best practices. Policies are standards all users should normally follow, and guidelines are generally meant to be best practices for following those standards in specific contexts.

[morgan10angry-13] Morgan, J. T., & Zachry, M. (2010, November). Negotiating with angry mastodons: the wikipedia policy environment as genre ecology. In Proceedings of the 2010 ACM International Conference on Supporting Group Work (pp. 165-168).

[introna15algorithms-14] Introna, L. D. (2016). Algorithms, governance, and governmentality: On governing academic writing. Science, Technology, & Human Values, 41(1), 17-49.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]