Talk:Machine learning models/Production/Language agnostic link-based article topic

From Meta, a Wikimedia project coordination wiki

Hi all! In order to structure feedback on this page, we are broadly seeking to answer the following questions:

  • What aspects of the model card are useful, informative, or helpful?
  • What aspects of the model card are confusing or unhelpful?
  • Are there any features or sections that aren't on the model card that you would like to see?

Of course, please feel free to put any other general comments you would like on this talk page. Thanks! -HTriedman (WMF) (talk) 18:44, 17 March 2022 (UTC)[reply]

Clarifying whether a model predicts or is based on data referring to individuals vs content[edit]

I think one of the most useful privacy distinctions they could be expressed more clearly on the cards (like with icons and a simple one line item “This model predicts things about: Editors, Readers, Content) is whether:

  • Was the model trained on data that identifies individual editors or readers?
  • Does the model attempt to predict the behavior of editors or readers?

If the answer is no to both of these the risks are categorically lower IMO and are mainly to do with inherited biases of the content corpus the model was trained on. Steven Walling • talk 16:01, 18 March 2022 (UTC)[reply]

Hi — I've made some updates to the model card (specifically the infobox) that are trying to address this point. Any more feedback? HTriedman (WMF) (talk) 21:34, 28 March 2022 (UTC)[reply]

Who is the intended audience?[edit]

You asked for a review of this from Wikimedia-l, a Wikimedia community email list.

My opinion: this page is incomprehensible to 99% of subscribers of that list due to relying on understanding of concepts which are unfamiliar to general audiences.

Wikimedia-l subscribers include Wikimedia community members from every demographic. If your intent is to provide information to the sort of people who subscribe to Wikimedia-l, then transfer control of this text away from the model creators and give it to any creative writer who has never heard the term "machine learning". Tell them to cut out all the parts they do not understand, then to rewrite the rest without any further guidance. What they publish would be a good estimate of what this page communicates to Wikimedia-l and the Wikimedia community in general.

If you want this kind of reality check then here are people who can give you this kind of feedback in a day for cheap - https:// www. fiverr .com/search/gigs?query=creative%20writing

If this document is not for the general Wikimedia community, then note at top who the audience is supposed to be.

I greatly appreciate this kind of reporting and want to see more of it, but simplify please if non-researchers are part of the audience. Bluerasberry (talk) 17:23, 18 March 2022 (UTC)[reply]

Link to context?[edit]

I love this and hope that it helps set a precedent for other organizations.

In order to help us review the contents of the card, could you provide some links to the Wikimedia model card initiative, and to published research which inspired this work? That way we can better understand the emerging practices, compare other examples, and look for any gaps in our cards. Adamw (talk) 12:11, 21 March 2022 (UTC)[reply]

Hi @Adamw! Thanks so much for the feedback — we've done a lot of research into the existing literature to get to this point. I'll link some of the more foundational papers in this subfield, and note that I spent a few months interviewing their authors/other domain experts in fall/winter 2021.
Let me know if you have any other questions/comments/thoughts!
- HTriedman (WMF) (talk) 21:31, 23 March 2022 (UTC)[reply]
PS— Here are a couple of other examples of model cards:
- HTriedman (WMF) (talk) 21:44, 23 March 2022 (UTC)[reply]

Namespaces outside of 0[edit]

What about draft articles? Those often show up in sandboxes and user space. There's also the draft namespace. It seems like this model would be useful and appropriate for those pages too. --EpochFail (talk) 15:17, 22 March 2022 (UTC)[reply]

Terminology[edit]

When advertising this effort, it might be useful to define what the term "model card" might mean. The first thing that comes to mind is some kind of model or blueprint for a playing card. When opening the examples, however, it becomes clear that "model" stands for "machine learning model" and "card" stands for "synopsis".

Other expressions like "Language Agnostic Link-Based Article Topic" are also quite a mouthful. Sentence case is recommended and prepositions are often useful. Nemo 22:49, 23 March 2022 (UTC)[reply]

Great step[edit]

Thanks for doing this. It is setting a new style standard, implicitly -- like infoboxes did in the first place. As such, I think the first and most important goal should be

1. Start (very) simple at the top. Concise statements, a single template that can be used for many models. A short lede summary that's understandable to non-ML readers.

Below you can add other detailed information, which might be more specific to this model, or more discursive, or more complicated. Most existing things that call themselves model cards are complex, fiddly, require lots of clicks to peruse... a bit like the mandatory 30-second video intro to a website. See what you can compress into a single view.

2. Make things like "false positive/negative rate" a tunable parameter in a parameterizable model, wherever possible, rather than choosing a fixed tradeoff.

For +/- rates in particular, publish the ROC curve in the top-level summary, rather than a point on it. Different audiences or use cases will need different tradeoffs, based on the risks or expectations in their use.

3. Other things I'd appreciate: metrics, the ability to test the model against a sample/upload

I like having a selection of metrics + datasets, and showing the table of their combinations.
I like the way google model cards include an option to upload an image to see how it's classified.

SJ talk  00:00, 24 March 2022 (UTC)[reply]

Updates are looking very nice! Quick thoughts:
  • Split models into a model page (e.g., Language agnostic link-based article topic model) and its model card (Language agnostic link-based article topic model/card). Then the lede can describe the model itself by reference, and focus on the card highlights. This also makes it more useful for people who need to know when the model has substantively changed to watch the model-card page, rather than the model-description page (which might be updated any time someone come up with better language).
  • Explain + link LiftWing
  • In the infobox: add "Maintainer" (link to lead maintainer) and "Updates" (type / frequency of updates, or link to process)
23:23, 9 April 2022 (UTC)

Input schema & features[edit]

I would suggest providing an explicit "input schema" or other "intended input domain" information. (1) The "Data pipeline" could be clear about the intended and functional inputs the "model"/system will accept: in this case, any Wikipedia article? (See also: EpochFail's comment.) (2) The system inputs should probably be described in some kind of consistent way: e.g., it seems like this model has two inputs (a) the text of an article at a particular revision (or perhaps current rev only? or not revdeled only? or "page not deleted" only?); and (b) an article->wikidata ID mapping (at what timestamp?). I think there's significant ambiguity here right now, e.g. is "identifying wikilinks in an article" a preprocessing step of the model, or are identified wikilinks retrieved from some other common source (e.g. pagelinks)? (3) Overall, other than details about the input data, I love the output schema and I like the rest of the provided info. I do think a specific input/output example in a collapse block might be useful for demonstrating the transformation the model induces on the input. Suriname0 (talk) 03:57, 24 March 2022 (UTC)[reply]

Hi @Suriname0 — thanks for this useful feedback! I've incorporated most of your comments into the "Data pipeline" section, and added a collapsable box with an example that seeks to give more of an intuition about the data pipeline. Any other comments? HTriedman (WMF) (talk) 21:56, 28 March 2022 (UTC)[reply]
I think it's great! Really awesome initiative! Suriname0 (talk) 04:40, 29 March 2022 (UTC)[reply]

Lead, Audience, Additional Details[edit]

Thanks for this! I'm coming mostly from a researcher perspective here and speaking as someone who regularly uses various Wikipedia/WMF APIs, models, and data products.

Overall and as others have said, it took me a moment to figure out what a model card is, but I get it now. Judging it as a template, I think it would be improved with some thinking about what kind of information belongs in the lead and who the audience is for the lead and the card as a whole.

Content-wise: Perhaps focus the lead on the specific problem/s the model is trying to solve or is useful for? The first sentence is key. Maybe a "how can we" type sentence followed by a "why we care" type sentence would be a good formula, e.g. "How can we predict what general topic an article is in, and do so consistently across many languages? This is useful for X Y and Z, but difficult because of A, B, and C." Re-ordering the information sections that follow will help with clarity -- model performance seems less important than several of the sections that follow it, and the 'information hub' could be an infobox on the right side (does the hub need a name, or could it just be unlabeled maybe?) The motivations section is closer to what a lead might be but not quite. The abstract for the WWW'21 paper actually seems like a pretty good candidate for sourcing text for the lead, with tweaks as appropriate. It might also be useful to explain how the 64 topics the model is using were derived.

Audience-wise: I can see a couple of uses for the card, both as a quick reference for myself when using the model and as a footnote in papers published using it; maybe it would be useful to think about documenting release versioning in the card so that can be cited properly. Maybe it would make sense to have a more narrative page distinct from the card so that the card can stay short? I am thinking about how this card contrasts with the ORES Mediawiki article; I refer to that page often when using/writing about ORES and have found it useful. For my purposes as a reader, the key pieces of information are: the link to the paper and the interface, the section labelled 'Data' at the bottom (there's nothing like knowing the pipeline to know what the model is really doing), the list of 'what to use this model for' and the recommendation of a .5 threshold (with the explanation that this is confidence not a relatedness proportion). These bits are a little scattered on the page.

Unrelated -- have you thought about adding a loop for some form of false positive/false negative reporting? Thanks again for the effort here; I'm looking forward to using this.Khascall (talk) 04:38, 24 March 2022 (UTC)[reply]

Hi @Khascall! I've made some changes (particularly to the top section) that you suggested here — the model information hub is now an infobox, I adapted your "how can we"/"why we care" approach, I moved the initial model performance table down, and I put in an example data transformation for a given page. Any more thoughts on how it looks now? HTriedman (WMF) (talk) 21:46, 28 March 2022 (UTC)[reply]
This is a nice improvement! Thanks for adding the bibtex'ified citation, too.
A couple of small notes:
  • In the ethics section, the first bullet leads off with "This taxonomy..." -- I am guessing the taxonomy is the 64 topics; not to be obsessive on this point but I really am wondering where the 64 came from at this point. Anyway, maybe this really wants to be something more like "The model fits articles into a taxonomy developed..." -- since the card is about the model (not the taxonomy).
  • I'm having a bit of trouble following the third bullet in the ethics section -- if the model is based on the wikidata of the articles in the article's wikilinks, how are wikiprojects involved in the trouble this describes? The bullet goes on to say "film labels largely are missing actors" -- is film one of the 64 topics, so this should be "articles in the film topic are missing actors"? And then, are the actors missing as in, the articles don't tend to talk about the actors, or is it that the actors don't tend to have wikilinks because they'd likely be redlinks -- i.e. the issue is that a content gap leads to poor recall? In that case it's not exactly a WikiProject gap, it's just another face of the content gap, I think? Or to the extent that there's a WikiProject coverage gap, perhaps that case needs to be made more explicitly. Perhaps a rephrase like this would help: "Content gaps and lack of wikilinking can lead to biases in recall for certain topics. For example, the model may fail to identify articles about Nollywood films (i.e. films produced in Nigeria) because of a lack of articles about Nigerian films in general and, to the extent they do exist, a lack of wikilinks to those articles. Thus recall is lower for articles about Nigerian films and actors than Hollywood (US) films and actors." Reading onward, I think I see some of the answer to my own question -- later I see that wikiproject labels are ground truth. I take it that somewhere it has been demonstrated (cite?) that wikiprojects sometimes fail to attend to some areas e.g. the global South in their efforts -- I don't disbelieve this but haven't reviewed the evidence on this front. So I take it the argument is that bias in the ground truth data becomes a model-wide weakness -- seems imminently plausible -- but now I am wondering why explicit discussion of this source of ground truth doesn't figure more prominently into the card. Instead it says that the training data is 30mil articles' worth of wikidata entries; tweaks to the lead section should clear this up.
Aaaaand since you asked for feedback, although I'm no aesthetics expert, I'll mention it does seem really box-box-box-ified now :) :). That makes it looks a bit off on a wide screen, for example. That said, I suppose lots of boxes is not unusual on a reference card. I don't see a lot of obvious candidates to remove in a de-boxifying pass except maybe Performance Notes -- as a list of bullets I suppose it could be treated like the Ethics bullets. Best wishes -- Khascall (talk) 01:37, 29 March 2022 (UTC)[reply]

Example inputs and outputs[edit]

It would be helpful to me to see what the model outputs. That would help me answer questions like "Can I get a confidence estimate for each topic or is it just a prediction for the most likely topics?" in an intuitive way. --EpochFail (talk) 19:33, 24 March 2022 (UTC)[reply]

Hi @EpochFail! Just added an example model I/O section with a model output to address this point. As for the question about namespaces outside of 0, the model was trained only on articles from ns0, so predicting on other namespaces may lead to unintended or faulty behavior. To your other question about navlink templates — I don't actually know the answer to that. Are links in the navlink templates included in the pagelinks database that this model gets training data from? If so, then it probably has an effect of some magnitude.
Any other thoughts/questions/comments on the structure of this document? HTriedman (WMF) (talk) 22:11, 28 March 2022 (UTC)[reply]

Does this model have strange behavior with navlink templates?[edit]

So some articles have navlink templates. See Tungsten carbide as an example. The bottom of the article has two expandable templates with a bunch of links to "Tungsten compounds" and "Salts and covalent derivatives of the carbide ion". This is a very different link pattern than what is seen in the text. Does the presence or absence of these huge piles of related links help or hinder prediction accuracy? That might be very important for draft articles that might not include the relevant navlink template yet. See Minnesota for a more extreme example. --EpochFail (talk) 19:38, 24 March 2022 (UTC)[reply]

Some confusing details[edit]

  • In some places the term "WikiProjects" is used. I have rarely seen it written like this, with this capitalization. I can see it referring to two different things: Either simply to our cluster of about 1000 Wikimedia wikis, or more specifically to individual wiki projects in the project namespaces of these wikis. Not sure if it makes much of a difference, but might be worth clarifying.
  • A sentence explains that the list of 64 topics reflects mostly English Wikipedia. This is quite surprising given that one of the main properties of the model is that it would be language-agnostic. There is some contradiction here that I feel should be mentioned and explained earlier in the document.
  • What does "0.877 (micro); 0.836 (macro)" mean? This appears in a few places, is labeled as being a "percentage", but doesn't look like one and doesn't match the rest of the column that actually lists percentages.
  • The model relies solely on links. I wonder if there are wikis where the rules and culture of linking is so different that the model can't produce useful results any more? For example: When we decide which word is worth being a link in an article, we focus on rare, hard to understand things. We rarely link basic concepts like sport. This is just not useful for the reader in most contexts, even if it might indeed be the best possible topic for an article. I understand this is not how the model works. It doesn't use literal links but the network of links (did I got that right?). Still I find it worth exploring how linking practices differ between wikis, languages, and cultures.

--TMg 17:14, 7 April 2022 (UTC)[reply]