Research:Developing Metrics for Content Gaps (Knowledge Gaps Taxonomy)/Content-Gap Mapping

From Meta, a Wikimedia project coordination wiki

In this page, we develop a general recipe to for content mapping. From the overview in the previous subsection, we observe that for the different gaps, in each case there are many (often implicit) choices to map the gap to the relevant content. We aim to formalize it in two different steps involved in identifying relevant articles for each gap.

We propose a process to map content to one or more gaps and obtain a selection of articles that is both accurate and extensive (see the Diagram).

Step 1. Operationalization[edit]

Explain the gap and define the scope with the grouping system and expected entities (ontology).

  • definition of the gap.
  • grouping system. different categories in which the gap can be divided into.
  • ontologies. what entities are relevant for the groupings.
Grouping system[edit]

The gap is defined as a disparity in content coverage over different groups. Therefore, the starting point is to define a set of groups for a gap to reveal differences in the coverage. For gender, the groups may be more or less known, such as “men”, “women”, “non-binary”, etc., however, in most cases the grouping is not obvious and requires an explicit choice. An external source can add more authority for the particular choice of a grouping. For example, for geography one can use the ISO-3166 standard .

Additional notes on groupings:

  • Groupings determine the level of granularity of the analysis. For example, the grouping of time (which is continuous) into decades would prevent us to observe disparities in coverage from year to year.
  • Groupings can also be defined as a hierarchy with different levels (e.g., geography countries are part of a continent, and regions part of countries). For example, ISO 3166-2 defines the relationship between continents, subcontinents, countries, and regions.
Ontology[edit]

In many cases, for a given gap and grouping, we only consider a subset of the content (i.e. the articles) as relevant because the mapping between groups of the gap to all articles is either unfeasible or simply would not make sense. Often gaps focus on certain types of entities, such as persons (for the gender gap), or places (for the geography gap), as well as time period, object, or concept. Therefore, the choice of a suitable ontology is crucial in order to be able to easily interpret the gap in terms of Wikimedia content. The choice of an ontology and the corresponding entity-types is heavily informed by the gap’s grouping defined in the previous step.


This is a table with the different possible operationalization for the gender gap, along with some combinations that are currently supported by communities practices. The underlined combination is the most common one.

name gap grouping (eg.,) source gap 2 grouping source ontologies
  Gender biographies gender women - - - - person
  Gender biographies + Geography gender women - geography countries ISO 3166 person
  Gender biographies + Professions gender women - professions person
  Knowledge domains + Gender gender women - knowledge d. person
  Gender discourse (feminism topics) gender women - - all kinds of entities
  Gender topics of interest gender women - - all kinds of entities
  Gender-related creations gender women - - all kinds of entities


We operationalized the five gaps with the following grouping and ontologies:

name gap grouping (eg.,) source ontologies
 Gender biographies gender women, men and non-binary person
 Sexual orientation biographies LGTBTQ+ sexual orientations person
 Geographical places geography countries/continents ISO 3661 places
 Cultural Background cultural back. local (language speaking territories - countries and regions) and non local Ethnologue, ISO 3166, ISO3661-2 all kinds of entities
 Time gap time decades (tbd) all kinds of entities

Step 2. Map articles to the grouping(s)[edit]

Diagram with the different steps to map articles to a gap.

Routine A. Map articles of a Wikimedia project to the grouping with univocal annotation.

  • Identify annotation of articles that relates to a grouping in Wikidata and in the desired Wikimedia language project, create features, and make a selection.

What is an annotation? The different spaces of content whose meaning can be related to the grouping. e.g., specific wikidata-property-value-triplet, coordinates, etc.

What is a feature? The field in the database we created in order to store the annotation. We can retrieve the annotation or process it to create several features from the same annotation. e.g., the number of selected properties of an article, has_coordinates_binary, etc.

We use the grouping system and the expected ontologies to look for annotation and create features. Once we have all the features in the database. We can look into their values in order to assign the articles to the gap and its groupings. e.g., for the gender gap, we use Wikidata properties for “instance of” a person (to limit) and then the property for gender.

  • Make a decision on the mapping based on the accuracy (precision) or the extensiveness of the selection (recall).

If the precision of the features is too low, in the selection of articles there is “undesired content”. We need to manually assess the precision of the collected content (its relation to the grouping) - in case there is undesired content, we need to identify which features are responsible for it. Depending on the percentage of false-positive, we may need to discard this feature and use another one.

If the recall of the features is too low, outside the selection of articles there will be other articles that could have been collected. Are the mapped Wikimedia project items all the items that relate to the grouping? If yes, we stop. Otherwise, these articles become “primary” and we look for other articles in 2B. If annotations from users have low coverage it is problematic because we would always miss a large part of articles in this case. Therefore, comparing the number of articles in each group would be unreliable. One example would race/ethnicity.


Routine B. Map articles of a Wikimedia project to the grouping with ambiguous annotation.

We take into account the articles mapped ("primary articles") in 2A.

  • Identify annotation of articles that relates to a grouping in Wikidata and in the desired Wikimedia language project, create features using the annotation and the “primary articles”, make a selection based on these features.

Some of these features may be created directly from the annotation. For other features, we may need to run the graph or do some calculations. We set a manual threshold or train a classifier to make a decision on the grouping membership based on the features.

  • Make a decision on the mapping based on the accuracy (precision) or the extensiveness of the selection (recall).

Manual Assessment. We check whether the content is the article selection is precise. Are the mapped Wikimedia project items all the items that relate to the grouping? If yes, we stop. Otherwise, we need more annotation.

Example: Gender Gap[edit]

Step 1. Gap groups, characteristics, and ontologies[edit]

We want to select a collection of articles for the gender gap, taking into account only the biographies. We assume the different genders are men, women, and non-binary. We will not use external sources for the group system. Take values available in Wikidata.

We assume that biographies are articles that represent the entity person.

Step 2. Map articles to groups[edit]
Univocal annotation[edit]

We take the Qitems (articles) that have the properties:

  • P31 (instance of ): Q5 (human)
  • P21 (sex or gender).

We do not use Wikipedia or any other project annotation.


Example: Local Content (or Cultural Context Content)[edit]

Step 1. Gap groups, characteristics, and ontologies[edit]

We want to select a collection of articles for a language cultural context, in other words, the cultural background of the territories where the language of a Wikipedia language edition is spoken either as native or official. The group is thus ccc-content and non-ccc content.

We do not limit the scope of the articles, they can be in this sense. We assume that the articles we collect may represent “places”, “people”, but also many other types of entities.

We identify articles as CCC (for the respective language/Wikipedia) in the following way: we assume that each language can be associated with territories at a country or at a subcountry level (region) and we use Ethnologue for external source for this group. Subcountry is only taken into consideration where the language is not spoken in the entire country.

We assume that the language name, the territory name, and the demonym for its inhabitants. We do not limit the scope of the articles, they can be in this sense. We assume that the articles we collect may represent “places”, “people”, but also many other types of entities.

  • List of language-native speaking territories (Ethnologue)

Obtaining a collection of Cultural Context Content (CCC) for a language requires associating each language to a list of territories, in order to collect everything related to them as a context.

We consider the territories associated with a language as the ones where that language is spoken as native indigenous or where it has reached the status of official. We selected the political divisions of the first and second-level (this is countries and recognized regions). Many languages could be associated with countries, i.e. first-level divisions, and second-level divisions were used only when a language is spoken in specific regions of a country.

In order to identify such territories, we used ISO codes. First and second level divisions correspond to the ISO 3166 and ISO 3166-2 codes. These codes are widely used on the Internet as an established standard for geolocation purposes. For instance, Catalan is spoken as an official language in Andorra (AD), and in Spain regions of Catalonia (ES-CA), Valencia (ES-PV), and Balearic Islands (ES-IB). For the Italian Wikipedia, the CCC comprises all the topics related to the territories (see dark blue in the map): Italy (IT), Vaticano (VA), San Marino (SM), Canton Ticino (CH-TI), Istria county (HR-18), Pirano (SI-090) and Isola (SI-040), whereas, for the Czech language, it only contains Czech Republic (CZ). A widespread language like it is English comprises 90 territories, considering all the countries where it is native and the ex-colonies where it remains as an official language, which implies that the CCC is composed of several contexts.

  • List of keywords (territory name, language name, and demonym)

The language territories mappings compound a database with the ISO code, the Wikidata qitem and some related words for each territory. In particular, we include the native words to denominate each territory, their inhabitants’ demonyms, and the language names (e.g., eswiki españa mexico … español castellano).

This word list has been initially generated by automatically crossing language ISO codes, Wikidata, Unicode, and the Ethnologue databases, which contain the territories where a language is spoken and their names in the corresponding language. The generated list for each territory has been subsequently manually revised and extended (using information from the specific articles in the correspondent Wikipedia language edition). Wikipedians were invited to suggest changes and corrected a few lines of the database (e.g. regions where Ukrainian is spoken in countries surrounding Ukraine).

Step 2. Map articles to groups[edit]
2A: Univocal annotation[edit]

Wikidata properties were used as additional features to qualify articles. Every article corresponds to one entity in Wikidata identified by a qitem, and has properties whose values correspond to the qitems of other entities. Such entities might in turn correspond to the language or to the territories associated to it, bringing valuable information for our aim. Hence, we created several groups of properties and qualified each article in order to ascertain whether it is reliably or potentially part of CCC.

  • Keywords on title: Looking at article titles and checking whether they contain keywords related to a language or to the corresponding territories (e.g., “Netherlands National football team”, “List of Dutch writers”, etc.).

The second Feature was obtained looking at article titles and checking whether they contain keywords related to a language or to the corresponding territories (e.g., “Netherlands National football team”, “List of Dutch writers”, etc.).

  • Geolocated articles: P625 (coordinate location). The feature is derived from the geocoordinates found in Wikidata and the mysql geotags table. As the usage of geocoordinates and ISO codes is not uniform across language editions and may contain errors, a reverse geocoder tool was used to check the ISO 3166-2 code of the territory of each geolocated article.

Geolocated articles not in CCC (reliably non-CCC)

Articles that are geolocated in territories associated with other languages are directly excluded from being part of a language’s CCC. Even though there might be some exceptions, articles geolocated out of the territories specified in the language-territory mapping for a language are reliably part of some other language CCC.

We may map Qitems that contain statements including:

  • Country properties: P17 (country), P27 (country of citizenship), P495 (country of origin), and P1532 (country for sport).

Entities for which some of these properties refer to countries mapped to the language, as established in the language-territories mapping, are directly qualified as reliably CCC. These entities are often places or people.

Country not in CCC property

For the Wikidata properties country_wd presented above, we checked whether they referred to territories not associated with the language. Hence, similarly to the previous feature, they are reliably related to some other language CCC.

  • Location properties: P276 (location), P131 (located in the administrative territorial entity), P1376 (capital of), P669 (located on street), P2825 (via), P609 (terminus location), P1001 (applies to jurisdiction), P3842 (located in present-day administrative territorial entity), P3018 (located in protected area), P115 (home venue), P485 (archives at), P291 (place of publica- tion), P840 (narrative location), P1444 (destination point), P1071 (location of final assembly), P740 (location of formation), P159 (headquarters location) and P2541 (operating area).

The location properties are iterative. The method employed uses in the first place the territories from the Languages Territories Mapping in order to obtain the first group of items, and next, it iterates several times.

Entities for which some of these properties have as a value a territory mapped to the language are directly qualified as reliably part of CCC. Most usually, these properties have as values cities or other more specific places. Hence, the method employed uses in first place the territories from the Languages Territories Mapping in order to obtain the first group of items, and next it iterates several times to crawl down to more specific geographic entities (regions, subregions, cities, towns, etc.). Therefore, all articles were finally qualified as located in a territory or in any of its contained places. It is good to remark that not all of the location properties imply the same relationship strength.


Location not in CCC

For the Wikidata properties location_wd presented above, we checked whether they referred to territories not associated with the language. Hence, similarly to the previous feature, they are reliably related to some other language CCC.

  • Strong language properties: P37 (official language), P364 (original language of work) and P103 (native language).

The following Wikidata properties provide full confidence and take qitems/articles previously selected:

  • Part_of: P361 (part of). Entities associated through this property with one of the entities already qualified as reliable CCC were also directly qualified as part of CCC. This property is used mainly for characterizing groups, places and work collections.
  • Created_by: P19 (place of birth), P112 (founded by), P170 (creator), P84 (architect), P50 (author), P178 (developer), P943 (programmer), P676 (lyrics by) and P86 (composer).

Entities associated through some of these properties with one of the entities already qualified as reliably CCC are also directly qualified as reliably part of CCC. Although some of these relationships can be fortuitous, we considered them as important enough in order to qualify one article as CCC, assuming a broader interpretation of which entities are involved in a cultural context. This property is usually used for characterizing people and works.

Precision: The selection is precise.

Recall: We cannot stop at this point, we need more content.

2B: Ambiguous annotation[edit]

Primary Articles / Annotation

The following Wikidata properties are direct and provide partial confidence.

  • Weak language properties: P407 (language of work or name), P1412 (language spoken) and P2936 (language used).

These properties are related to a language but present a weaker relationship with it. Therefore, entities associated through some of these properties with the language (or one if its dialects) may be related to it in a tangential. Hence, they were qualified as potentially CCC.

Wikidata

The following Wikidata properties provide partial confidence and take qitems/articles from 2A:

  • Affiliation properties: P463 (member of), P102 (member of political party), P54 (member of sports team), P69 (educated at), P108 (employer), P39 (position held), P937 (work location), P1027 (conferred by), P166 (award received), P118 (league), P611 (religious order), P1416 (affiliation) and P551 (residence).

Entities associated through some of these properties with one of the entities already qualified as reliably CCC are potentially part of CCC. Affiliation properties represent a weaker relationship than created_by. It is not possible to assess how central this property is in the entities exhibiting it, hence these were qualified as potentially CCC.

  • Has_part: P527 (has part) and P150 (contains administrative territorial entity).

Entities associated through some of these properties with one of the entities already qualified as reliably CCC are potentially part of CCC, as they could be bigger instances of the territory that might include other territories outside the language context.

Wikipedia

  • Categories: we may collect categories including the keywords and iterate the category graph to collect all the articles.

At each iteration, for every article we find, we can assign the level or distance jumps from the top category. In the case of multiple paths, we will assign the minimum number of jumps to the top category, and also create another feature indicating the number of paths between the article and the top category.

Each article on Wikipedia is assigned directly to some categories, and categories can, in turn, be assigned to higher-level categories. We then started from the same list of keywords used for feature 2, and identified all the categories including such keywords. For example, “Italian cheeses” or “Italian cuisine”. We then took all articles contained in these categories, and iteratively went down the tree, retrieving all their subcategories and the articles assigned to them. In this way, we did not only get a binary value, but also discrete indicators for an article: the shortest distance in the tree from a category containing a relevant keyword, and the number of paths connecting the article to one of such categories. As the category trees may be noisy, we did not consider this feature reliable, and we assigned the articles retrieved in this way to the group of potentially CCC articles.

  • Inlinks/Outlinks: We may want to count the number of inlinks and outlinks pointing towards the articles mapped in 2A for articles in the Wikimedia project. We can also compute this number relative to the total number of inlinks and the total number of outlinks.

We may want to compute features for a negative groundtruth, in 2A (Geolocated articles in other territories) and in 2B (Located articles in other territories and Inlinks/Outlinks to geolocated articles not in the language-related territories).

This feature aims at qualifying articles according to their incoming and outgoing links, starting from the assumption that concepts related to the same cultural context are more likely to be linked to one another. Hence, for each article, we counted the number of links coming from other articles already qualified as reliably CCC (inlinks from CCC), and computed the percentage in relation to all the incoming links (percent of inlinks from CCC) as a proxy for relatedness to CCC.

Likewise, for each article, we counted the number of links pointing to other articles already qualified as reliably CCC (outlinks to CCC) and the corresponding percentage with respect to their total number of outlinks (percent of outlinks to CCC). We expect a high percentage of outlinks to CCC to imply that an article is very likely to be part of CCC, as its content refers to that cultural context.

Inlinks / Outlinks to geolocated articles not in CCC (potentially non CCC)

The last feature aims at qualifying articles according to how many of their links relate to territories which are not mapped to the language. Similarly to Feature 12, the number of inlinks and outlinks to geolocated articles not mapped to the language were counted along with their percentual equivalent (i.e. inlinks from geolocated not in CCC, percent inlinks from geolocated not in CCC, outlinks to geolocated not in CCC, percent outlinks to geolocated not in CCC). Articles qualified by these features are potentially part of other languages CCC.

  • Classifier

We need to convert some non-numerical features to binary.

We want to use negative sampling to augment the negative ground-truth sample.

We will only provide the classifier for testing the articles that contain a feature based on the category graph (e.g., the distance level or the number of paths).