Research talk:Knowledge Gaps Index/Taxonomy

Feedback collection September 2020[edit]

We invite you to provide your feedback about the first draft of the taxonomy between now and 2020-09-30. We encourage you to share your feedback as comments to the questions below. We understand that some of you may prefer a private pathway to provide feedback. In that case, please share your feedback through the Google Form that we link from here.

Relevant material to learn about the taxonomy:

Short summary and motivation (pdf on Commons)
A presentation (video on Commons, video on YouTube)
The complete write-up of the draft of the taxonomy of knowledge gaps (full paper on arXiv)

Consider the 35 gaps identified in the taxonomy: Are there gaps that we have missed to identify?[edit]

For an overview of all gaps, check Figure 1 on page 5 in the full paper (or the image on commons); and for the gaps on readers, contributors, and content the respective tables in the summary (pages 2, 3, 4) or the full paper (pages 6, 12, 16).

Add your response here and sign.
My understanding was that the four main knowledge gaps on Wikipedia were its requirement for reliable sources, its recency bias, the ethnicity bias and the gender gap. The requirement for reliable sources has a multifold skew on knowledge gaps, partly on how accessible those sources are as free, online sources in the same language as the article are more likely to be used than offline, paywalled or other language sources. But also on whether reliable sources actually exist, and here we have known knowns such as the nineteenth century cricketer who we know played in a first class match, but whose first name is now lost to history. Known unknowns such as the protagonists in a dark ages battle for which we only have archaeological evidence, and unknown unknowns such as the history of preliterate tribes, especially those who got swept aside by war. Lastly the requirement for reliable sources and the general notability guideline gives us huge knowledge gaps re the lives of people and other subjects who don't make our notability criteria, in some cases because they died too young, in other cases because discrimination stopped them from getting a post then reserved for men, or aristocrats. The recency bias is a product of Wikipedia only starting in 2001, coverage of many areas of popular culture and history is very incomplete earlier than this with gaps filled in by small numbers of enthusiasts who by default have great influence. Certain sports and TV series that still have fans are well covered for decades prior to the birth of Wikipedia. Others less so. The ethnicity gap is mentioned in the report, but the main emphasis seems to be an internal US one. The extreme contrast between the amount of editing from Africa v from Europe is barely picked up, and there is a probably false implication that the gender gap is larger, not least because it tends to be talked of first in the report. The gender gap in the community is well established and clear, the gender gap in content needs to be put into context against the secondary sources available to us. Classic examples such as Fellows of the Royal Society and Members of the House of Commons need pointing out; We have had articles on all female members of the Royal Society for some years, if and when people fill in the gaps and create articles on missing fellows and MPs the gender gap will appear to increase if we simply measure the male female ratio and don't index it to sources such as indexes of biographies, lists of Nobel prize winners etc... WereSpielChequers (talk) 19:13, 11 September 2020 (UTC)[reply]
I have always been interested in the gaps since I noticed the heavily USA-biased Wikipedia content having to do with my area of work back in 2004. Since then I have come to realize that highlighting gaps is only limited by our ability to measure them.

I started measuring various parts of the gender gap out of curiosity back in 2013 as soon as I figured out a way to measure it using external websites. Later I made queries with Wikidata thanks to the tools of User:Magnus Manske. Now with Sparql I am able to get even more complex questions answered. I recently updated one of my early 2014 slides based on a dataset en:RKDartists I use a lot with my d:WD:SOAP work on Dutch 17th-century paintings (the dataset is heavily Dutch-biased, but is much larger than Dutch 17th-century painters). Click through to see earlier snapshots of the same slide, based on queries done on those earlier dates. The Gendergap (no space) is what I define as the percentage of female contributors to Wikimedia projects, which we are still unable to measure, but which we can approximately infer through surveys on various language projects. The relationship of the Wiki contributor Gendergap to various gender gaps in societies exists but is not measurable. Examples in societies that are measurable are in publishing, academic professor positions, politician positions, awards, national TV & film productions etc. The gender gap in coverage on Wikipedia language versions theoretically reflects such societal gaps 1-1, but we know from external databases that this relationship is not 1-1 and it would be good to measure these regularly in order to monitor our progress to highlight areas needing attention. Jane023 (talk) 10:05, 18 September 2020 (UTC)[reply]
The taxonomy of 35 gaps is very comprehensive. I understand that some gaps are categories that both apply to readers and contributors (through sociodemographics) or how they are represented in content. I understand that "language" and "geography" are two special categories that should be part of each of the three dimensions. Language is essential as it is defining characteristic for a project and I would include it as part of the content. The language gap is huge in Africa and Oceania, but not in Europe. Geography is also an essential category as it is useful to segment groups of editors or readers and I would include it as part of the sociodemographics facets. Marcmiquel (talk) 15:23, 21 September 2020 (UTC)[reply]
As WereSpielChequers mentioned, there is a recency bias in content, which could be thought of as a "time gap", because some centuries are not covered with the same depth in every language. This would be an addition to make to the "diversity" facet, which I think should be called "topic". Diversity is am overarching theme in every dimension and we might get more diversity as we bridge the different gaps. But the "time gap" can also be part of contributors and readers, as we have a problem retaining newcomers and the communities are becoming more and more aged. The lack of newcomers or editors with different lifespans in the projects has also an impact on the content created. Even though community growth and retention is another avenue of research, I would include "time gap" as part of te Contributors dimension (in particular in the Contextual gaps) as it is very desirable to bridge this gap. I would also rename "Contextual Gaps" to "Engagement", as it includes gaps related to motivation and community role and, maybe, this newcomers gap. Marcmiquel (talk) 15:23, 21 September 2020 (UTC)[reply]
I think it would be useful to have a "Wikipedia and free knowledge awareness gap" in the "Information Need" facet. This may be useful to explain aspects prior to engage in reading content in Wikipedia and complementary to motivation and information depth. Marcmiquel (talk) 15:23, 21 September 2020 (UTC)[reply]
Where would Braile go in this taxonomy? Readability or Multimedia? Both? Other places under readers and contributors? How about spoken language pronunciation remediation? The Mediawiki, databases, tools, and bot infrastructure? Those all seem to be missing from the taxonomy. James Salsman (talk) 15:35, 23 September 2020 (UTC)[reply]
"We define readers as all users who connect to the site to consume Wikimedia content. While there exists a body of research[18][19][20][21][22][23] studying how content consumption happens outside of Wikimedia, e.g., voice assistants, search engines, or thirdparty apps, we scope down our definition of readership to readers who come directly to the projects to access content. This misses out several groups of people, all of whom arguably fall into our remit of making the sum of all knowledge freely available to all, but some of whom probably should be left out. I'm happy to concede that some people will be too young to directly use our content. I leave it to others to decide where this comes between fifteen months and fifteen years - I know my limitations, and deciding whether a child is old enough to have unfettered use of the internet is not a decision that I feel qualified to make. There are people who are unable to access our sites because they lack IT access, literacy, or literacy in a language where we have meaningful amounts of content; all of which we have long sought to address and with some success. and there are people who can't access our sites because they live in countries that ban our sites; an issue where we have long taken a principled position that a more commercial organisation would probably not have done. WereSpielChequers (talk) 23:40, 27 September 2020 (UTC)[reply]
There is a “background” gap defined for readers but not for contributors. Is there a reason for that? What makes “background” irrelevant for contributors if it is relevant for readers? From the Swedish chapter’s point of view, I’m happy that you actively include accessibility as a facet for readers, with the objective that “readers with different technical setup and skills can easily access Wikimedia projects”. That is very much in line with our work with Wikispeech, and an area where I think that the Wikimedia movement can do more. Eric Luth (WMSE) (talk) 12:07, 29 September 2020 (UTC)[reply]
N-grams which appear frequently in LexisNexis news stories but have no Wikipedia articles are called "LNNP" terms in [1] and described in Section 3 as either "domain-specific (e.g., forward-looking statements) or being a general phrase (e.g. tens of thousands)." The domain-specific N-grams should almost certainly be included in the taxonomy, and do not appear to be explicitly or even implicitly at present. 107.77.165.30 18:51, 11 October 2020 (UTC)[reply]
I feel sociodemographics in Readers / Contributors, and probably even content diversity, are missing some important gap types: geography, nationality, politics, religion and race/ethnicity. Geography seems like a very obvious thing to me (there is a "locale" category but AIUI it is not the same), we know there are vast disparities there. It is a subclass in content diversity, but the distribution of readers and editors seems just as important. Nationality is maybe the same thing, but political or regulatory intervention (e.g. blocking access to Wikipedia like it happened in Turkey) can create gaps on a national level which might not be easy to capture on geographical metrics. Politics is an area where we know the composition of editors is very disproportional at least in large wikis (anecdotally, the mismatch between e.g. conservative/liberal editor ratio and population ratio is of a similar size to the gender gap), and if left unchecked that kind of gap is liable to be used as an attack vector by actors who are trying to discredit Wikipedia, or (in the case of governments and such) trying to regulate or ban it), so it seems like an important thing to get a handle on. Religion and ethnicity are usually usually the tenets by which an oppressed minority is differentiated from the majority, so if we want to get a handle on the extent to which minorities get excluded from content consumption and production on Wikimedia projects, those seem to be decent proxies. --Tgr (talk) 08:24, 12 October 2020 (UTC)[reply]
In Content>Accessibility, I think it would be good to have something like familiarity or background knowledge required - anecdotally, there are big differences between wikis, and even topics / contributor teams within wikis, in how much of the content can be grasped by e.g. a high-schooler or a university student. (Maybe that could go under readability, but currently that seems to focus on much lower-level metrics.) --Tgr (talk) 08:24, 12 October 2020 (UTC)[reply]
I wonder if a "currentness" category made sense in Content>Policy? (It's not really a policy, but seems closest to those two.) Just like the same article might be well-referenced on some wikis and poorly on others, it might be up-to-date on some and outdated on others. E.g. articles mass-created and then abandoned by some bot or translation tool five years ago might look well by other metrics like topic coverage, depth or verifiability; currentness would identify the core problem with them. --Tgr (talk) 08:24, 12 October 2020 (UTC)[reply]
I wonder if "impactful topics" should have a Readers counterpart? Sure, some topics are impactful simply by having the potential for transforming the reader's life foundamentally (say, basic healthcare information), but to a large extent impact means matching content coverage to what readers care about. So knowing readers' topics of interest seems quite useful. --Tgr (talk) 08:51, 12 October 2020 (UTC)[reply]

Are there gaps that could be improved or should not be part of the taxonomy?[edit]

For example, do you have suggestions for how to improve the naming of the different gaps? Or are some gaps misrepresented or lacking crucial aspects?

In your paper, you mention "culture gaps" and "culture-specific content gaps", but in the graphic you have "cultural context" as the gap, which seems like something slightly different to me. What is meant by "cultural context" here? Would it be better to call this gap just "culture" instead (perhaps hewing closer to the paper)? Kaldari (talk) 01:10, 11 September 2020 (UTC)[reply]
you should do a structural analysis of why gaps persist, (working toward an action plan for each gap) for example - Multimedia Gap persists because selection bias of verbal editors, rather than visual learners, and bias of WMF not to invest resources in rich multimedia. Slowking4 (talk) 02:33, 16 September 2020 (UTC)[reply]
In the 4.3 Content diversity (which I would call Topic gaps, as I said before), I think it would be good to mention that these gaps are found both at article and in-article level. I measure mostly gaps at article-level, but there is a lot of literature on how certain points of view are neglected because of political conflicts, cultural idiosincrasies, etc. Sometimes these are referred as "cultural biases" or "language point of view", as they show the differences between the same article in different language editions, but in the end, they are gaps at a different level. I would only mention these two levels at the beginning of the subsection or in the table. Marcmiquel (talk) 15:42, 21 September 2020 (UTC)[reply]
In the dimensions Readers and Contributors there are two gaps ("Background" and "Additional Characteristics") which I would define them in more detail, even though there may not be enough sources. So that they are properly addressed and also to make the categories consistent across sections. These two gaps could be recategorized as "Cultural group", "ethnicity" and "sexual orientation" and introduced in both readers and contributors. In the content dimension there is already a "cultural context topics gaps", which includes both the topics from a cultural group (e.g. an ethnic group or a national group of people) or the cultural context (e.g. the topics related to a geographical context in one language... which we could say that relate also to a cultural group). Having "Cultural group" in readers and contributors would make it consistent. I think it is important to be consistent with the same categories across dimensions so that we are able to compare the same gap in different dimensions of Wikimedia. Marcmiquel (talk) 15:42, 21 September 2020 (UTC)[reply]

A few comments on my end (based on the short summary; if it is explained in the long study, you may disregard from the comment)
- In my opinion, “verifiability” is a more multifaceted concept than what the description of it in the summary reflects. I wouldn’t say that it is only about reliability. A source might be reliable, but inaccessible for others. Views on that vary greatly across projects. We often understand a “good source” as a published book rather than a web-link, but if I can access the cited book from no library in my country and it is out of stock to buy, how can I verify the content? The book might for sure still be reliable – but the information cited in the article is not verifiable. In Wikipedia terms, wouldn’t an unreliable website in a sense be more verifiable than an inaccessible book? Or to give another example, that we often come across in the WikiGap Campaign: Can or should a Swedish language article cite a source in Hindi, even though most Swedish speakers would not be able to understand or check the source? A constant issue in WikiGap, especially when contributors in the Global South write on English Language Wikipedia, is that they use sources that are not understood by users with powers, and articles might get deleted, just because admins on the project don’t understand, or don’t know how to read or access, the cited information. I assume you know most of these tendencies of course, but summarizing verifiability as being “only” about reliability risks being a bit too reductive, in my view.
- I think it is great that “Structured Data” is included as a content gap. Many articles, especially in smaller language versions with few active community members, might have counterparts in other languages, but because the few active editors don’t know about or how to use Wikidata, they don’t create the proper connections. I am not sure to what extent structured efforts from WMF, or perhaps the use of bot, can solve this issue, but I see it as quite important to look more into.
- I do wonder what is meant by “impactful topics”. Several community based efforts have been launched, dating way back some of them, trying to define what topics are central to all projects (like “Lists of articles every Wikipedia should have). Those efforts are often criticized for their eurocentrism. Wikimedia Sverige have tried globally, together with international partners, to define objects that are lacking from Wikipedia with a more global coverage. But these objects are still within certain themes, and might not fall under the definition of “common interest”. So: how do we know what is of common interest? And who decides?
- My last comment is about “geography”. I would hope, though I don’t know if that is possible, that the focus is not only between geographic regions or populations, but also within. Many countries, for example, might have a very good coverage of the capital or city regions, but very poor coverage of villages, towns and the countryside. Because of historic or political reasons, different regions might have lacking data, information or restrictions on users. I would define this diversity gap as important as well. Eric Luth (WMSE) (talk) 12:10, 29 September 2020 (UTC)[reply]
I propose to add important wikidata properties (missing dimensions) like Country (P17, P495), Nationality (P27), Domain (P101), Profession (P106), Region (P706), City/administrative territory (P131, P19, P20, P1071), Location (P276), Date of creation (P571) as important knowledge gaps, beside the evident properties like Language (P103, P1705), Alphabet (P282), Birth date (P569). Geert Van Pamel (WMBE) (talk) 08:31, 30 September 2020 (UTC)[reply]
For content diversity, tracking depth and not just the existence of an article is important (are the articles on the topic stubs or feature-length articles? at least on some wikis, ORES should make judging that easy). I'm probably stating the obvious here, but I didn't see it mentioned in the paper. --Tgr (talk) 08:36, 12 October 2020 (UTC)[reply]

Are you aware of any sources we have missed for the knowledge gaps in the tables?[edit]

Check the tables with the gaps on readers, contributors, and content in the summary (pages 2, 3, 4) or the full paper (pages 6, 12, 16). For example, a major community initiative dedicated to addressing a specific gap.

There will be gaps we miss this time around. This is like the encyclopdias, a task that will not be finished in one attempt. In a way it could be considered a metaencyclopedia (Metapedia?) - an encyclopedia about encyclopedias. · · · Peter (Southwood) ^(talk): 10:05, 11 September 2020 (UTC)[reply]
While I probably wouldn't be the person who can add information here in any case, I think you would get more useful answers if you provided the existing sources in a more manageable fashion. The summary doesn't make any mention them, and in the full paper you'd have to continuously jump between tables, text and the reference section to see what sources are mentioned already, which seems quite cumbersome who just wants to check out what sources you are using on the one or two topics that are their expertise. --Tgr (talk) 08:40, 12 October 2020 (UTC)[reply]
...

What are possible use-cases of this taxonomy for you?[edit]

For example, our motivation for the taxonomy is to build a comprehensive framework for understanding and measuring knowledge gaps.

Three things come to mind immediately:
1. There will be gaps I may be able to help fill
2. There will be fields of knowledge I am currently unaware of that may be of interest to me
3. This looks like it could be a good starting point for some navigational tool(s) that will help find existing content that is not obvious enough to just use a search engine (search engines work really well when the right search string is used, not so well when you don't yet know what to call what you are looking for}. To identify gaps it will be necessary to first identify and classify existing content. The framework of the existing content will suggest some of the gaps, and could be very useful in itself. · · · Peter (Southwood) ^(talk): 10:00, 11 September 2020 (UTC)[reply]
I think the visuals would interest those potential partners we might be doing a presentation to (such as those working in the field of knowledge for example). So outreach could be a use-case. Anthere (talk)
The Taxonomy can be used to track scholarly research outputs about knowledge gaps using Scholia. --Csisc (talk) 18:09, 19 September 2020 (UTC)[reply]
I think any stakeholder in the Wikimedia movement should be able to use this taxonomy or part of in one way or another for the simple reason that the taxonomy encompasses everything (readers, contributors and content). How? Communities: Understanding the content gaps mainly and obtaining actionable resources to bridge them. And in the WMF, Advancement: understanding the content gaps to obtain resources and engage in partnerships; Community engagement: encouraging the use of tools, indexes, etc. in the community/affiliate programs to bridge the content and contributor gaps; Product: understanding the contributor and reader gap so they can improve the technology (interface, etc.) and improve the accessibility. Marcmiquel (talk) 16:20, 21 September 2020 (UTC)[reply]
We are working with international partners to develop efficient ways of finding missing topics. I think that this taxonomy might make this much easier, as we can show, in a summarized form, what kind of gaps we actually need to overcome, in our case especially in regards to contributor and content gaps. What I would like to see is perhaps a more externally understandable version of this taxonomy. If verifiability is a gap, what does that mean in Wikimedia terms? I think the information is a good foundation for instructional material for Wikimedia affiliates as well as external partners on how to help overcome the gaps. Eric Luth (WMSE) (talk) 12:10, 29 September 2020 (UTC)[reply]
I think such research will help us to work in a much efficient way in terms of research, community building and filling up knowledge gaps.Rajeeb (talk) 15:41, 30 September 2020 (UTC)[reply]
The taxonomy provides a comprehensive framework for targeting considerations as I work with small African-language WPs where all three top-level dimensions require development. Where to focus contributor efforts for optimizing results and triggering a multiplier effect? How to enlist external partners (e.g. public education strategists, academics, professionals in the private sector) to engage in content supervision or similar involvement? The whole Vital articles set-up is due for a serious diversity overhaul, for which I apparently need to join the appropriate task force. May I also mention the urgent circumstances of the Corona virus and COVID-19 restrictions making the free digital access to knowledge a cornerstone of education in our time. -- Deborahjay (talk) 20:43, 30 September 2020 (UTC)[reply]
Contributors gaps could guide targeted outreach by the WMF, affiliates and wiki communities. Reader gaps could be a starting point for identifying deeper problems with contributors and content; better understanding reader interests, skill levels, access barriers etc. can also be useful for prioritizing content work and technical work. Content gaps can guide content work and help people organize on-wiki interventions, and guide targeted outreach. Better verifiability and neutrality metrics could help defend against spam and disinformation campaigns. --Tgr (talk) 08:48, 12 October 2020 (UTC)[reply]
...

Do you have any other thoughts you would like to share with us about the taxonomy of knowledge gaps?[edit]

Anything that is not covered above.

I have not read very far yet, but what I have seen is promising regarding how you are going about this. More later, perhaps. · · · Peter (Southwood) ^(talk): 09:36, 11 September 2020 (UTC)[reply]
It does not feel right not to be able to copy edit typos and spelling errors in the document;-/ · · · Peter (Southwood) ^(talk): 17:44, 11 September 2020 (UTC)[reply]
Over a decade ago marketing had a rule of thumb that people would be on the internet for two years before they started using it for shopping (this may have been a dodgy rule of thumb and may be out of date, but I have never seen the Wikipedia equivalent). Editing Wikipedia is clearly not an entry level hobby for new internet users, but it would be good to know how long the typical Wikipedian was online for before they first edited, and of course whether that varies by Mobile v Desktop. This obviously then will need to be correlated with the history of internet access in particular countries. Current internet access rates are only the tip of the iceberg - a community that has only come online in the last five years is likely to have generated less content than a community that has been online since before Wikipedia was started. Apologies if this is in but missed by me, I have only had time to read the report once. But at the end I was wondering if this had been missed or I had missed it. WereSpielChequers (talk) 19:24, 11 September 2020 (UTC)[reply]
one super practical thought... please always remember to tag (categories, or even structured data) the documents you publish on Wikimedia Commons, so that people have a chance to find them even if they have no idea of the existence of this page on meta. When I reworked the Gender gap portal, this was one of the tendencies I noticed with reports published by WMF staff... they were put on Commons... but no categories added. This makes content curation very complicated and ultimately... yeah... accessibility issues... knowledge gaps ! So please... you make a fabulous job with the taxonomy, a great job with the pdf presentation and with the video presentation from Miriam. Just do remember to tag those resources to help us find them when we "wander" on Commons ;)
Once a gap is found, work on that gap reveals all sorts of related gaps that are nearly impossible to fill because of a lack of sources. For example with gender gaps, once we have interconnected family trees on Wikidata we notice that lots of men sprout magically from other men, without any female intervention to speak of, sometimes several generations over. Using provenance information and the simple fact that men often had multiple wives before a male heir, slowly patterns of inheritance and related family trees can be brought together. In the case of colonialism, the conquering nation rewrote the histories of conquered nations in the conquering language and eradicated the stories in the native language. This was often due to negligence rather than deliberate obstruction of records, but the effect is the same: only limited sources survive to revive stories of the conquered people. I think it is safe to say that no "taxonomy of gaps" will ever be complete, and indeed they be very messy. When it comes to women's stories, such as the history of women's fashion, birth practises and early childcare, very few early sources are available that were written by women. We can't go back and fill in those missing pieces, but with the help of big data and overlapping scientific research we can guess stories based on adjacent surviving stories. This kind of thing is not something that is easily sorted into different branches of an ontology but is more the kind of insight that can be interpreted after the fact. So for me, reading the term "taxonomy of gaps", my first thought is about the language of an original story and its topic vs. the language of the source used to verify it. It is sad that so many small Wikipedias need to reuse English sources for lack of local sources. It would be nice to make this issue visible by measuring & monitoring it across language versions, but I am not sure how that fits into this definition of "taxonomy of gaps". Jane023 (talk) 16:56, 18 September 2020 (UTC)[reply]
It will be interesting to apply semantic annotations, topic modelling, word and graph embeddings, and semantic similarity measures on research publications about knowledge gaps to extract and retrieve all terms about knowledge gaps and their links with interesting related concepts. Such an automatic process can assign the most used term to every concept of the taxonomy and to enrich the taxonomy with non-taxonomic relations and consequently to turn the taxonomy into a large ontology. I will be interesting in performing this work. --Csisc (talk) 18:00, 19 September 2020 (UTC)[reply]
To ensure the sustainability of the ontology, it will be interesting to involve it in a large-scale collaborative knowledge graph, particularly Wikidata. Like this, it can be updated and adjusted using human efforts and bot edits. The only barrier for doing this is the License matter. The Taxonomy should be released under CC0 License. --Csisc (talk) 18:07, 19 September 2020 (UTC)[reply]
If I had to say one more thing, perhaps I would highlight the fact that in readers and contributors there is the "accessibility facet", which I would not include there. Some of its gaps I think it makes more sense to consider them part of the sociodemographics. I think that it makes sense to consider "tech skills" as part of the contributor and next to education. But I would not include Internet connectivity as a gap in both readers and contributors (what's the difference between them in this regard if they live in the same territory?). I think this opens a debate on how we think about the barriers, which are apparently gaps in the context or in the Wikimedia movement. The difference between the "accessibility gaps" and the rest is that the other are part of the desired outcome. We want diversity in terms of gender, age, language, etc. in readers and contributors (also in content), while in this case we do not want diversity in terms of "internet connectivity". It just is. There are some barriers or gaps in the context and in Wikimedia that matter a lot and we should be able to include in the taxonomy, measure and use them to understand the gaps in the three main dimensions, content, readers and contributors. For example, the Wikimedia websites usability or the freedom of speech are very important to allow contributors to edit... The first is related to the "Wikimedia structure" and the second is related to a political and geographical "context". We know they are missing sometimes. I would suggest the creation a fourth dimension named "Medium" or "Context and structure",... where to include all these other gaps. I think this would be beneficial in engaging many other parts of the movement in measuring and improving in bridging the gaps related to the outcomes (readers, contributors and content) and in those that can be seen as the barriers that cause them (context and wikimedia structure). Marcmiquel (talk) 16:20, 21 September 2020 (UTC)[reply]
Finally, I would like to congratulate the team for putting forward such a very necessary project. I've been working on the Diversity Observatory for three years and researching gaps, diversity and participation for a long time and I think that a project like this is fundamental to make the movement more mature and grow larger and more diverse. Marcmiquel (talk) 16:20, 21 September 2020 (UTC)[reply]
Okay, a bunch of things: (Sorry for not sorting these.)
- First of all, the lack of a wiki page was bugging me, so I converted the full paper to wikitext and posted it here.

@Yair rand: I really appreciate you doing it. Given the limited time we had, we worked in latex to have an easier time for managing references. I can totally see how this can bug the Wikimedia readers. Thank you! @Pbsouthwood: see the link for the wiki page above as you weren't happy with the pdf either if I read one of your comments accurately. --LZia (WMF) (talk) 15:33, 23 September 2020 (UTC)[reply]

- The "structured data gap" doesn't really seem like it's the same category as anything else here. I don't see how it's a "gap" in the way these other things are. It doesn't match to the mentioned definition's "unbalanced coverage across its inner categories". It seems to basically be, "we don't have enough structured data". That's no more of a distinct gap than "we don't have enough textbooks in Wikibooks". It's a general content category. What are the "inner categories" which present imbalance?
- The summary talks about knowledge gaps across the Wikimedia projects, but the content seems to be almost entirely about Wikipedia plus a few bits on Commons and Wikidata. Even within Wikipedia, there's a clear focus on the English Wikipedia. This is a concerning, especially if this taxonomy is to achieve widespread use around Wikimedia.

@Yair rand: I understand your point about different project types. I have a clarifying question about your comment about a clear focus on English Wikipedia (which is something we actively want to change across many of our work). Can you expand in what sense you observe this focus? Is it through the examples of community activities? or is it when we use references to back things up and the references are focused on enwiki and what happens there? --LZia (WMF) (talk) 15:33, 23 September 2020 (UTC)[reply]

- The Readers and Contributors section each have "Sociodemographics" subsections that mostly match up, except that what is called "Background" in the readers section (section 2.1.7) is called "Additional characteristics" (section 3.1.7) in the Contributors section, and is also curiously absent from the chart/graphic, the only entry listed in the content but absent from the chart. It deals with an important element, including political/cultural/ideological/religious diversity of the editor community, and I think it shouldn't be unlisted in certain presentations of the taxonomy.
- "Contextual gaps" is renamed "motivations" in the chart. Why is this?
- Re the "Contributor motivation gap": I'm not sure this is a thing we really care about, except for monitoring for general awareness purposes. Why would it be a problem if one Wikipedia has a lower or higher number of people who are contributing for a particular reason? This isn't explained in the text.
- The taxonomy might be helpful for structuring page navigation systems on Meta.
There's a lot of good stuff here. Hope to see more in this area. --Yair rand (talk) 05:45, 23 September 2020 (UTC)[reply]
Thank you for working on this! My comment is something that I brought up with the Research team, and they encouraged me to post it here so that more people could follow along and weigh in. I was confused for a while about the terminology of "Knowledge Gaps". The confusing part is that I'm used to thinking about "knowledge" as synonymous with "content", and so it was surprising to see "contributors" and "readers" as part of "knowledge". I think the idea is that "readers" and "contributors" obviously have gaps, just like content does, and we need all three of those things in order to make the sum of all knowledge available to everyone. But I do think it's a bit of a new definition of "knowledge" for a lot of people who are hearing about this work. I think it could be interesting to redefine "knowledge" via this project, but I just wanted to flag that would be a change for some people. -- MMiller (WMF) (talk) 05:00, 25 September 2020 (UTC)[reply]
(Adding a comment from wikimedia-l here so we don't lose track of it in the final analysis of feedback.) Add a Discussion section to the paper and address the meta-gap question as part of it. --LZia (WMF) (talk) 23:43, 25 September 2020 (UTC)[reply]
The 2010 strategy identified a List of things that need to be free (not much progess on which has happened). The 2020 strategy also expressed a wish for curating more forms of knowledge than we currently do. This might be stretching the concept of gaps a bit, but these could be seen as a new dimension of gaps: whole areas of knowledge that we could have content about but don't. --Tgr (talk) 08:58, 12 October 2020 (UTC)[reply]
We define readers as all users who connect directly to the projects to access Wikimedia content excluding how content consumption happens outside of Wikimedia, e.g., voice assistants, search engines, or third-party apps. - while I can certainly see the difficulty of including indirect audiences into the research, and while users receiving Wikipedia content in some strongly digested form such as Google knowledge panels would be an apples to oranges comparison, offline distribution of Wikipedia content (Kiwix etc) would be important to at least estimate, as it covers or tries to important reader gaps such as people with no internet access at all. --Tgr (talk) 09:08, 12 October 2020 (UTC)[reply]
Thanks for pioneering this research! I think it is a massively important area of self reflection, and from the report it seems you have thought things through very thoroughly. --Tgr (talk) 09:09, 12 October 2020 (UTC)[reply]
It would be amazingly useful to have similar research and insights about our community of technical contributors and our technical information. For example, understanding the types of technical knowledge and skills that developers have is crucial to being able to provide the right content in documentation and to provide appropriate "on-ramps" to support developers in contributing at various levels and gaining additional tech skills where/when they want to. --TBurmeister (WMF) (talk) 16:30, 08 October 2021 (UTC)[reply]

If you want to continue to engage with this research, please sign with your username below.[edit]

We will keep you updated on major developments in this project.

User talk:Suriname0 en.wikipedia.org
User talk:Pbsouthwood en.wikipedia.org
User talk:WereSpielChequers en.wikipedia.org
User talk:marcmiquel ca.wikipedia.org
User talk:Anthere fr.wikipedia.org
User talk:Fuzheado en.wikipedia.org
User talk:DDJJ nl.wikipedia.org
User talk:Jane023 en.wikipedia.org
User talk:Marajozkee bn.wikipedia.org
User talk:Nealmcb en.wikipedia.org
User talk:Csisc www.wikidata.org
User talk:James Salsman meta.wikimedia.org
User talk:Eric Luth (WMSE) se.wikimedia.org
User talk:Rosiestep en.wikipedia.org
User talk:Deborahjay en.wikipedia.org
User talk:Tgr hu.wikipedia.org
User talk:AGreen (WMF) www.mediawiki.org
User talk:TBurmeister (WMF) www.mediawiki.org
User talk:hfordsa en.wikipedia.org

General feedback[edit]

It's great to see a whole paper on this -- essential! I am finding it hard to map the taxonomy onto my own framing of knowledge gaps, can you help me connect them?

My confusion: I don't see a focus on the tremendous central gaps in knowledge on the projects.

For content, this is depth + breadth + freshness -- many areas are excluded, or underattended, or abandoned.
Easy to observe that even on the largest projects, en/de, the # of topics that exist on some other project but not on theirs approaches 50% - and to infer that we're not even 10% of the way to basic topical coverage.

Even the excellent Covid Handbook, compiled by Wikipedians and others, chose not to use a Wikimedia project for that essential global educational collaboration.
For contributors, this is reach. We have been discouraging contribution for so long it is easy to forget that we once proudly described WP as the encyclopedia "anyone can edit".
Most people who engage in collaborative multi-party editing today do so on Google Docs or some other proprietary tool, not on MediaWiki and certainly not on a Wikimedia project.

Most people who have contributed to a WM wiki in the past no longer do so, and no longer think of doing so as the natural way to share knowledge. This should be explored and investigated.
For readers, this is reach in much of the world.
Here we may reach a majority of the world through reusers of our knowledge, but that excludes many.

Those excluded are left out, among other things, by age, language level, preferred tools + intake formats. We have long avoided having a print or media arm, despite its natural audience and advantages.

Most people reached in this way don't know they are reading the result of WM work, wouldn't notice if we were replaced by a proprietary intermediary

Most people reached don't know how to add to, customize, rebalance that knowledge to incorporate what they know.

I do see an excellent discussion of systemic bias. That is treated largely as static bias of what is there, with little attention to the dynamic bias of what we exclude or disallow or discourage. The dynamic bias seems like what we can most directly change.

Here are the first things I think of around coverage gaps. Only the 0th item seems to fit the current taxonomy...

0) exclusion via lack of awareness, interest, or expertise
1) exclusion via deletionism
2) exclusion via topic notability norms (including pop culture + current events)
3) exclusion via source notability + limiting source formats
4) exclusion via license pessimism
5) exclusion via file format (!) and codec pessimism
6) exclusion of dense specialist knowledge via review bottlenecks
7) exclusion via knowledge type [model, dataset, map layer, genealogy tree] 
8) exclusion / rejection via behavior on the projects
9) exclusion / rejection under 1-4 via differential application of policy

Some of these, like file-format and review-bottleneck exclusion, are primarily technical restrictions.

The knowledge-type exclusion was once a question of resourcing. When Jimbo asked early on "what would we do if we had $1 million", most of the answers focused on types of knowledge we would support and free. Now it's just a lack of recognition of the validity and import of those communities and knowledge stores.

Some of these, like 1-4 above, are social+regulatory+technical restrictions that could be alleviated with simple tools (including extensions, alternatives, and sandboxes) -- just as nupedia's social restrictions were alleviated w/ the technical solution of a wiki for the drafting stage.

And the last two are purely social restrictions, projecting systemic bias in the community of practice onto who joins and what contributions are welcomed. I'd like to see that subset of gaps addressed directly, and not split up across other parts of a taxonomy.

Please help me see how these fit in to the taxonomy. –SJ talk

Examples[edit]

Two quick instances, to illustrate the above:

Media: This is presented in the taxonomy primarily as an alternate mode of conveying information, for accessibility.; However videographers and video production; and sound designers and audio production; often have no linearized-text corrolary: they natively capture a different kind of knowledge.; However video-sharing sites has 1000x to 100,000x the raw volume of material that is on the Wikis, and largely on topics of social interest, current events, popular culture, how-tos, and discussions or reviews of other works; Many of the most common formats on video-sharing sites are excluded by norm or by structure in current Wikimedia projects : but none are outside the scope of our vision!; Video + audio sharing and meme/idea propagation are themselves an essential form of long-running collaboration that speaks to the essence of wiki-nature in a more than textural world; The most common video formats -- all of which can be transcoded on upload to a free format, and many of which are built into phones and other hardware such that video-creators have no other format options on creation -- are excluded from the projects. Immediately leaving that community out of our collaborative.; Video communities are generally excluded by not having easy ways to browse, annotate, and edit videos. It is great to see VideoWiki as a lab project, but it is tiny compared even to the community of folk who do post vids to commons.

Data: dataset file formats are largely excluded by Commons. Tabular data is treated as a special case in a dozen areas: on Wikidata, on SD-on-Commons, converted to wikitables, embedded in a PDF.; However collaborative data is a major part of the public and scholarly record, and tables in articles have become increasingly widely-used and important.; There is no shared namespace for datasets, no way to easily reference them with an [inter]wikilink, no way to capture structured data about their provenance.; The same holds true, in spades, for models trained on data and used to generate other outputs. While Abstract Wikipedia takes a step towards a world where generative elements such as functions get their own namespace, it is not currently designed to be a hub for more general models.; Large and rapidly-growing communities of datasharing and coordination do not have anywhere in the Wikiverse to work together. This leads them to create their own more siloed spaces elsewhere.

Alternatives[edit]

I also don't quite see where positive alternatives fit; though a few alternative projects + lab experiments are helpfully cited or mentioned.

The easiest and most impactful thing we could do to fill many current gaps, including gaps in the community of contributors, is to return to inviting everyone to edit and to contribute whatever they have to share as free knowledge with the world. We have stopped doing that, for the most part; but hopefully have not forgotten how or why to do so.

The second easiest is to start tracking down interesting and broadly support suggestions for new projects, reaching out to them, and asking them what gaps they see and experience, what they need, what they are currently doing instead.

Suggestion:

Let us create spaces to share rough drafts of knowledge.
Make them free of deletionism and preemptive filtering.
In particular, do not delete work that does not harm others

Avoid filtering by notability, file format, and lack-of-license-certainty.
Give contributors a personal workspace in which they do not need to fight with others -- in which they are welcomed or left alone rather than dissuaded
Differentiate those spaces from the current [main spaces] of current projects. Downrank contributions in those draft spaces, to avoid unwarranted halo effect from the reputation of more curated wikiwork. And to avoid gaming.
Allow that those draft spaces may themselves become popular and impactful, in their own right.

–SJ talk 16:24, 26 September 2020 (UTC)[reply]

Sj, Do you have any ideas on how to incorporate these ideas but avoid becoming a repository of garbage, propaganda, misinformation and spam? · · · Peter (Southwood) ^(talk): 12:01, 28 September 2020 (UTC)[reply]

Pbsouthwood, indeed I do.

First, there is a class of information that we explicitly filter: harassment is expunged from even the edit history; spam in the form of links to blacklisted sites are blocked on edit, and other spam and nonsense are speedily deleted. None of that would change for this more open space. A few other speedy-deletion criteria might apply: those for which there is no chance of the deleted material ever turning into useful free knowledge. And a few normal deletion criteria would apply: hoaxes, mis/disinformation, self-promotion.
Second, WP itself is, in comparison to the (50 thoroughly-edited articles of) nupedia, a 'repository of stubs, possible garbage, propaganda, misinformation... and many nice things too'. We manage in part with page and section and sentence banners and templates noting problems, and in part with soft security: review and clean-up after the fact. A proper playground for drafts would be another step in that direction.
Third, as noted above: strongly differentiate the draft space, visually and semantically. Give it its own domain, google-rank, perhaps default nofollow for links, and a visually different skin.
Fourth, specify the positive goals of the draft space. This attracts those who are driven to realize the potential of the space, and to maintain it over time. There's not a lot of overlap b/t people who thrive in a highly flexible environment that appreciates Ignore-All-Rules, and those who thrive around policies and wikilawyering.

A space anyone can -- and is encouraged to -- edit. A permanent draft space for articles that are not yet notable. A space that is light on deletion, and uses tags and categories instead for flagging problems. A permanent archive for media in copyright limbo, while they await definitive clearance. Excellent for people working on pop culture, current events, biographies, or any niche subject with limited mainstream media.

Sj, This may be a bit of a culture shock for some Wikipedians, particularly those who currently would prefer to see the existing draft space closed down. I wonder whether it would survive introduction. I am not against the concept, but it would take some careful communication and other preparation to gain acceptance on ENWP. I think that the very people who one would expect to do the gruntwork of deleting the inevitable spam and nonsense would have a problem with what they might consider more spam and nonsense than they currently have to handle, which is already too much. More tools might help, but they would have to be flexible and nuanced tools, or they would just be wielded like another hammer. One of the problems we have is people having to maintain topics on which they are deeply ignorant, because someone has to do it. When that is combined with a lack of confidence, excessive reliance on the letter of the law, a touch of obsession and stress, we get a few of the perennial behavioural problems. I think the editing environment would be less bitey if there was less overload on the maintenance crew. It would also be necessary to hash out a set of policies and guidelines for the new version of draft space before opening it for business, or the problems would be settled by trial by combat, with the status quo having a bigger set of weapons. · · · Peter (Southwood) ^(talk): 07:41, 30 September 2020 (UTC)[reply]

I think it can be done, but we need to rethink the concept of draftspace. Having Drafts in a separate namespace removes them from much of the collaborative editing, reduces the disincentive to spammers of getting that red message of this page has been previously deleted and generally is a complicated clusterf***. If drafts were articles in mainspace that were set as "NoIndex", and only visible to logged in editors, then we could raise the bar on business related articles "broadly construed" to having at least two reliable and independent secondary sources before such articles were published into Mainspace. WereSpielChequers (talk) 08:32, 30 September 2020 (UTC)[reply]

┌─────────────────────────────────┘
I like the ideal of a default "NoIndex + login to view (+ author can view)" option, and a different skin, for the current idea behind drafts. I think a flexible sandbox would still benefit from being a separate site, specifically to minimize policy creep and complexity. PBS -- I would aim for heavy reliance on automated tools and a new community of practice, one comfortable with starting small, simpler policies, greater inclusion, and longer timeframes. At first the users could be, one at a time, good contributors to other projects who found they couldn't continue their work there without it being removed. Today, those people tend to abandon that project, or leave altogether never to return, often believing that the projects are intentionally excluding their contributions for topical reasons. –SJ talk 17:44, 30 September 2020 (UTC)[reply]

Specific comments[edit]

This is a really great effort and it’s no small feat to gather all this research together in one frame. My comments all surround one missing piece in this, and that is the issue of power. For me, it is power that is missing from the document as a whole – the recognition that editing Wikipedia is as much about power as it is about internet access or demographic features. Framing Wikipedia’s problems as gaps that need to be filled is a mistake because it doesn’t enable us to see how Wikipedia is a system governed by unequal power dynamics that determine who is able to be an effective contributor. More specific comments below:

In Section 3, you leave out technical contributors from your definition of contributor. I understand why you might do this but I think it is a mistake as you note: “software and choices made in its design certainly are highly impactful on what types of contributors feel supported and what content is created.” As argued in my paper with Judy Wajcman (Ford, H., & Wajcman, J. (2017). ‘Anyone can edit’, not everyone does: Wikipedia’s infrastructure and the gender gap. Social Studies of Science, 47(4), 511–527. https://doi.org/10.1177/0306312717692172) gendering on Wikipedia happens at the level of infrastructure and code and it matters who is developing software tools.
In Section 3.1.4, you frame language fluency as less important given that “lower fluency individuals can be important for effective patrolling in small wikis [114], increase the diversity of contributors, and allow for the cross-pollination of content that might otherwise remain locked up in other languages [74]”. But it is important to recognise that there are potential problems when editors from powerful language groups (Europe and North America) contribute to small language encyclopedias (e.g. see Cebuano Wikipedia). https://www.quora.com/Why-are-there-so-many-articles-in-the-Cebuano-language-on-Wikipedia
In Section 2.1.7 you write about “ethnicity and race” in the context of “Sociodemographic gaps”. I worry that we have virtually no critical race scholarship of Wikipedia and that the sentence you begin with “Ethnicity and race are very contextual as to what it means about an individual’s status and access to resources” downplays the extent to which Wikipedia is a project that prioritises knowledges from white, European cultures. It seems to be a significant gap in our research, one which this strategy will not solve given the emphasis on metrics as an evaluation tool. I urge the group to discuss *this* severe gap with critical race scholars and to start a conversation about race and Wikipedia.
In Section 3.3.3, you write about the “tech skills gap” and the research that has found that “high internet skills are associated with an increase in awareness that Wikipedia can be edited and having edited Wikipedia” so that “edit-a-thons in particular can help to bridge this gap”. In some early work with Stuart Geiger, we noted that it isn’t just tech skills that are required to become an empowered member of the Wikimedia community. Rather, it is about “trace literacy” – “the organisational literacy that is necessary to be an empowered, literate member of the Wikimedia community”. We wrote that Literacy is a means of exercising power in Wikipedia. Keeping traces obscure help the powerful to remain in power and to keep new editors from being able to argue effectively or even to know that there is a space to argue or who to argue with in order to have their edits endure.” Our recommendation was that “Wikipedia literacy needs to engage with the social and cultural aspects of article editing, with training materials and workshops provided the space to work through particularly challenging scenarios that new editors might find themselves in and to work out how this fits within the larger organizational structure.” Again, this is about power not skills. (There is a slideshow of the paper from OpenSym conference and the paper is at https://www.opensym.org/ws2012/p21wikisym2012.pdf and https://dl.acm.org/doi/10.1145/2462932.2462954)
Section 4.1 looks at “Policy Gaps” although I’m not sure it is appropriate to talk about policies here as gaps? What’s missing here is notability policies and it is in the notability guidelines where the most power to keep Other voices out is exercised. More work needs to be done to investigate this but the paper above is a start (and perhaps there are others).
Section 4.2.2 talks about “Structured data” as a way of improving knowledge diversity and initiatives such as Abstract Wikipedia aiming to “close (the) gap”. Authors should recognise that structured data is not a panacea and that there have been critiques of these programmes within Wikipedia and by social scientists (see, for example, https://journals.sagepub.com/doi/full/10.1177/0263775816668857 Ford, H., & Graham, M. (2016). Provenance, power and place: Linked data and opaque digital geographies. Environment and Planning D: Society and Space, 34(6), 957–970. https://doi.org/10.1177/0263775816668857 open access at https://ora.ox.ac.uk/catalog/uuid:b5756cd4-6d1e-4da1-971e-37b384cd18ca/download_file?file_format=pdf&safe_filename=EPD_final.pdf&type_of_work=Journal+article)
The authors point to metrics and studies of the underlying causes of Wikipedia’s gaps in order to evaluate where the gaps are and where they come from. It is very important to recognise that metrics alone will not solve the problem, but I’m dismayed to see how little has been cited in terms of causes and interventions and that the only two papers cited are a literature review and a quantitative study. Quantitative research alone will not enable us to understand causes of Wikipedia’s inequality problems and qualitative and mixed methods research are, indeed, more appropriate methods for asking why questions here. For example, this study that I conducted with Wikipedians helped us to understand that the usual interventions such as editathons and training would not help to fill targeted gaps in articles relating to the South African education curriculum. Instead, the focus needed to be on bringing outsiders in – but not by forcing them to edit directly on wiki – this simply wouldn’t happen, but to find ways of negotiation required for engaging with new editor groups in the long-term project of filling Wikipedia’s gaps. Again, the focus is on the social and cultural aspects of Wikipedia and an emphasis on power. (See https://journals.sagepub.com/doi/full/10.1177/1461444818760870 and https://osf.io/preprints/socarxiv/qn5xd)
In terms of the methodology of this review, I noticed that the focus is on the field of “computational social science” which “tries to characterise and quantify different aspects of Wikimedia communities using a computational approach”. I strongly urge the authors to look beyond computational social science to the social science and humanities venues (including STS journals like Science, Technology and Human Values and the Social Studies of Science as well as media studies venues such as New Media and Society).

Also, I’m unsure what this document means in terms of research strategy, but I recommend three main gaps that could be addressed in a future version of this:

A closer engagement with social science literature including critical data studies, media studies, STS to think about causes of Wikipedia inequality.
A dialogue with critical race scholars in order to chart a research agenda to investigate this significant gap in Wikipedia research.
A moment to think about the framing of the problem in terms of “gaps” is the most effective way of understanding the system-wide inequalities within Wikipedia and Wikimedia.

Finally, I believe that regular demographic surveys of Wikimedia users would be incredibly helpful for research and would move us beyond the data that we can regularly access (i.e. metrics) which, as you point out, does not reveal the diversity of our communities. I wish that I had more time to point to other research here, but I work for a university that is under severe strain at the moment and this was all I managed to find time for. I hope it is useful and I look forward to the next version of this!--Hfordsa (talk) 00:47, 30 September 2020 (UTC)[reply]

Regular demographic surveys of Wikimedia users -- Yes, a thousand times yes!

Specifically: I'd love ~quarterly surveys, shown to 1% of users / readers until there are a threshold of responses, where the question-set is limited to a single page. A few perennial questions can help anchor comparisons across multiple iterations. –SJ talk 01:26, 30 September 2020 (UTC)[reply]

I'd settle for a resumption of annual surveys - provided they were seasonally consistent. I still think the annual survey was one of the best results from the 2009 Strategy program, just a shame it only happened annually, for one year..... WereSpielChequers (talk) 08:37, 30 September 2020 (UTC)[reply]

I was thinking quarterly so they become no big deal -- any ideas that don't get included can be tweaked in the next one. Or if no new ideas arise, it can be the same survey every quarter. One of the many things that prevented the previous surveys from going out was the tremendous decision-friction in settling on the questions and timing and implementation, and then having to publish a custom paper from the results. Something more continuous, with more of a dashboard to visualize responses, feels more fitting and easier to get underway. –SJ talk 17:46, 30 September 2020 (UTC)[reply]

Hfordsa A question for clarification.

What is "critical race scholarship"?

Also a fan of regular demographic surveys, for all classes of Wikimedia users. · · · Peter (Southwood) ^(talk): 07:56, 30 September 2020 (UTC)[reply]

I agree with Hfordsa’s points her three main suggestions --Simulo (talk) 12:09, 4 November 2020 (UTC)[reply]

Methodology and stuff[edit]

I love this work, and it's super important! Thanks so much for doing it, and congratulations!!! It's clear that a huge amount of effort and careful reflection went into this. Also, apologies for joining the discussion so late.

My comments are mostly about methodology, conceptual framework and links to other research.

Core concept

A broad selection of topics are covered in this work. This is not a problem in itself. However, the topics' organization under the core concept of knowledge gap feels a bit forced.

While the definition of knowledge gap as a “[disparity] in participation or coverage of a specific group of readers, contributors, or content” (paper, p. 4) technically does encompass all the categories (gaps) you set out, some categories are still extremely different from others.

For example, there are sociodemographic disparities in readership and contributorship, which are themselves social inequities we seek to change. That's different from, say, a lack of multimedia content, which is, as far as I understand, a possible cause of unequal access to knowledge, and not itself a problem whose solution has intrinsic ethical value. The way you would understand and evaluate disparity in these two cases is also very different.

I feel the clearest explanation for why these topics are considered together is the initial motivation for this work, as described in the presentation video: the topics are all important for “track[ing] our progress” towards the goal of knowledge equity.

Since it doesn't seem that the expected uses of this work actually require the topics to all belong to a single class of thing, you might consider just doing away with that central element. So, instead of a “taxonomy of knowledge gaps”, you could just have “topics of interest and proposed definitions for future research to support knowledge equity”, or something like that. It's not as catchy, but I think it's a clearer and more accurate description of this diverse product. The topics could still be grouped and linked in different ways, as appropriate.

~~Research result tries to be too many things at once~~ (Edited, see below.)

I feel there's a tension among the various things this research product is expected to be. Is it a tool for classifying stuff? A composite index? A framework for encouraging discussions? A literature review? These are different things, each with different requirements.

Here is an example of tension among these expectations. From what I understand, the main work undertaken to construct these categories was a review of previous research and community projects and documents. So, in that sense, it's like a literature review. However, it's described, first and foremost, as a composite index (p. 2). But you wouldn't expect a composite index to emerge from a literature review. Rather, it would be based on a theory of how multiple, interrelated, measurable phenomena contribute to a more complex, but still conceptually coherent, phenomenon. It might include hypotheses about how the phenomena under study interact.

However, the concepts and theory in this paper don't quite seem to work the way I'd expect for a composite index. (For example, I don't see justifications for why certain categories should be included for readers but not contributors, or a clear, unifying conceptual framework for disparity across all the categories. Sorry if I've missed something.) But it also doesn't have a comprehensive overview or summaries of the works surveyed. So it's not quite a literature review, either.

Edit: Oops!! I just noticed that I misread the part about composite index--this work isn't intended to be a composite index, but rather a first step towards one. Really sorry!!!! I still feel a bit funny about the primary reliance on the literature survey for definitions to be used in a composite index, though.

Conceptual framework for understanding society

Unequal participation in knowledge consumption and production, and bias in knowledge representation, are social phenomena. That nearly all the social interactions we'd study here are mediated by computers doesn't make them less social.

I feel this work would benefit from engaging with one or more theories of society (of which there are tons). Doing so would help tease out unacknowledged assumptions, formulate hypotheses, guide research decisions, and make links to research on related phenomena outside Wikimedia projects.

For example, a theory of societies as cultural systems might suggest explaining both the gender bias in article content and the gender gap among contributors as expressions of unequal gender roles in the cultures of contributors. And since contributors to large projects certainly belong to many different cultures, such a theory would highlight possible links to cross-cultural studies of gender roles and inequality.

(BTW, there are also some really great things in the conceptual framework you do provide. For example, the guiding principles [p. 23] are fantastic! Also, see below for more thoughts on links to other research.)

Specific suggestions for changes in categories and definitions

Harmonize sociodemographic categories for readers, contributors and content. Having subtle differences in these categories and definitions across dimensions will make research to link them more difficult; harmonizing will do the opposite.
Include categories for geographic location, rural vs. urban setting (“locale”) and cultural background across all dimensions.
Instead of “income”, use established measures of socioeconomic status across all dimensions.

Links to other research

Here are more ideas about other research to link to:

For sociodemographic disparities: cross-cultural studies of discrimination, racism, oppression of all types.
For content gaps related to sociodemographics, including subtle biases within articles: Critical Discourse Analysis.
For a discussion of sources and research on historically marginalized groups: Microhistory.

Expansion and publication of literature review

I would love to learn more about the literature reviewed for this paper. How about a separate paper with summaries of research and other documents reviewed, organized by topic? That could also provide more transparency into the specifics of how the literature influenced decisions about the categories and definitions set out here.

Apologies if I've focused mostly on criticisms, and many congratulations once again! There's some really amazing work here!!!! :D

AGreen (WMF) (talk) 10:40, 19 October 2020 (UTC); edited 02:38, 20 October 2020 (UTC)[reply]