Jump to content

Research:Fossil fuel industries and climate change on Wikipedia

From Meta, a Wikimedia project coordination wiki
Created
18:44, 12 January 2026 (UTC)
Contact
Collaborators
no affiliation
Nat Hernández
Duration:  2025-10 – 2026-01
This page documents a completed research project.


Fossil fuel industries and climate change on Wikipedia: Automatic topic coverage assessment

[edit]

We present a language-independent approach to analyze whether a given topic is represented in a series of Wikipedia articles of interest. In particular, we study whether climate change impact is represented in articles about fossil fuel industries, but we describe a methodology and provide corresponding software code to reproduce for other articles and topics.

We propose three separate and complementary automatic approaches to assess whether an article refers to a given topic: (1) by the topic of its internal wikilinks, as inferred from (a) their categories and (b) their associated Wikidata item ontology; (2) by the semantic similarity between its paragraphs and the topics of interest, using embedding models; and (3) by the nature of its web references. We provide both detailed and summarized results for each of these approaches.

We also discuss possible adjustments to offer these analyses as a service through an API, to be leveraged from a dedicated website or on-wiki using specialized templates.

Introduction

[edit]

Wikipedia is an important part of the free knowledge ecosystem. In the Wikimedia community, individual contributors and groups usually focus on topics they are mostly interested in, sometimes gathering around in so-called WikiProjects. These projects usually have a list of Wikipedia articles they help maintain.

Assessing whether these articles touch on a specific topic of interest is a cumbersome and time-consuming task, given that these lists are sometimes long, and that articles keep changing all the time. In addition, some WikiProjects span across multiple languages, raising the number of articles even higher. For example, there are multiple WikiProjects aiming at improving Wikipedia articles related to climate change.[1]

In this project we wanted to come up with an automated language-independent methodology to assess whether and to what extent topics of interest are covered in a series of Wikipedia articles. In particular, we wanted to evaluate whether articles about fossil fuel industries mention the impact that these companies have on climate change in the Wikipedia language editions of the 6 UN languages + Portuguese. But we wanted our approach to be extensible to other Wikipedia languages, as well as to other topics and articles. And, eventually, that it could be offered as a service through an API to be used by Template Gadgets or on a dedicated web page.

We proposed a three-way complementary approach, analyzing different aspects of the articles selected:

  1. Wikilink topics
  2. Semantic meaning of textual content
  3. Reference types

In each of these analyses, we define a series of queries and then match the article contents to them to decide whether and to what extent articles refer to the topics of interest.

In the following sections we explain how articles were selected, how each of these approaches was implemented, we provide summarized and detailed results, explain how to reproduce them and how to use our software for other similar cases.

Methods

[edit]

As introduced above, we used three separate and complementary approaches to evaluate whether and to what extent Wikipedia articles about fossil fuel industries refer to the impact of these companies on climate change.

We designed and ran our automatic analyses using Jupyter Notebooks, hosted on Wikimedia PAWS environment. See below in this section to understand how to run the same or similar analyses yourself. In general all analyses follow similar steps: (1) retrieval of article titles from a Wikidata Query Service (WDQS) query, (2) downloading, processing and caching relevant article content, and (3) matching of queries to processed content and output of results. Notebooks also provide code to iteratively run the previous steps for different Wikipedia language editions and to compare the results across them.

As mentioned in the Introduction, in developing these notebooks we tried as much as possible to leave room for other analyses including other articles, topics and languages, and to provide an efficient foundation for considering offering the analysis as a service in the future.

Article selection

[edit]

To get a list of articles of interest we defined a query in WDQS to fetch all Wikidata items which are an instance of business (Q4830453) in the industry of petroleum industry (Q862571), and the corresponding Wikipedia articles linked to them.

[edit]

In this approach we use outbound wikilinks (i.e., internal links to other Wikipedia articles) as a proxy to infer what an article is talking about. To do so, we analyze the nature of the linked-to articles in two different ways:

  1. Linked-to-articles' categories
  2. Linked-to-articles' Wikidata item’s ontology

Article data

[edit]

Initially we used data from the pagelinks table (through one of the Wiki Replica databases), as it is much faster than downloading and parsing the articles' wikitexts. However, this proved problematic since this table stores all outgoing internal links from the articles, including links present in transcluded templates. While on the one hand this may be desired, as in the case of Article Excerpts, in other cases it adds a lot of noise, as in the case of Navigation Templates, sometimes included at the end of articles

We therefore decided to use MediaWiki Action API to download and parse the wikitext instead.[note 1] This in turn brings some further advantages, such as being able to know what sections the wikilinks appear in (not possible from the pagelinks table).[note 2] In addition, having the wikitext is needed for our semantic approach.

Wikitext fetching and parsing
[edit]

To fetch article wikitext we use MediaWiki Action API. After fetching the wikitext, we parse it using mwparserfromhell[2] to extract wikilinks. We could have used an API endpoint returning parsed text directly, but those available are limited to one article per request, whereas raw wikitext endpoints allow up to 50 articles per request.

To avoid fetching and processing the same wikitext every time the notebook’s kernel is restarted, we calculate and save a SHA-1 hash for each article wikitext downloaded, along with a list of wikilinks extracted. Next time the notebook is run, these hashes are compared against the revision hashes available from the database replica to decide whether wikitext must be downloaded and processed again or not.

mwparserfromhell wikilink extraction capabilities are limited. For example, it doesn’t remove double spaces in link targets. Therefore, to better parse wikilink target titles in order to find them in the database tables, we process them through a MediaWiki Action API’s Parse action request.

[edit]

Once extracted and cleaned up, for each wikilink target article we fetch its categories and its associated Wikidata item from the database replica.

For querying, originally we wanted to also fetch category ancestors and Wikidata item classes and superclasses, recursively, up to a given height (bottom-up approach). However, the category and class tree grow rapidly, so we decided to go the other way round: up from a given category or Wikidata class query, down to a given depth (top-down approach).

The bottom-up approach may be reconsidered in the future for efficiency, especially in the case of offering these analyses as a service through an API, to avoid having to traverse down the category or Wikidata class tree for each new query.

Querying

[edit]

Once wikilinks have been extracted from the articles selected, they can be matched against a series of custom queries based on (a) the categories they belong to, and (b) the Wikidata item they are associated with.

Category queries
[edit]

A category query is made of a Wikidata QID linked to a Wikipedia category (e.g., Category:Greenhouse gas emissions (Q8500689)) and an optional subcategory depth; e.g., Q8500689|3.

A wikilink in an article will be marked as matching a query if the linked-to article is under the query category or one of its subcategories (down to the depth level specified).

We use Wikidata QIDs to refer to categories instead of the Wikipedia category name directly to make comparison across Wikipedia language editions possible. Note that although some Wikipedia categories may not be linked to an item in Wikidata, most are expected to be. This has been analyzed in a separate notebook for English and Spanish Wikipedias, and discussed in Spanish Wikipedia.

Note that once the Wikipedia category has been retrieved for the QID given, the corresponding subcategories are fetched entirely from the corresponding Wikipedia, using PetScan.

The current top-down approach implies fetching category and subcategories for each query and comparing them against the categories for each wikilink. This implies some extra time each time a query is run, which may limit interactivity and re-implementation as an API service. To prevent this we may move to a bottom-up approach instead, where category ancestors are fetched recursively for each wikilink target article. As explained above, we tried this but it proved resource-expensive, because the category tree grows rapidly. In the future we may reconsider this approach, including alternatives to traverse the category tree such as that provided by the WDQS category graph.

Wikidata item ontology queries
[edit]

Another way to query the nature of a wikilink target article is by the class of the Wikidata item associated with it.

Most Wikipedia articles are linked to a corresponding item in Wikidata.[3] In turn, an item in Wikidata can be an instance of (p31) or a subclass of (p279) other items.[4] This way, analogously to the Wikipedia category trees used in the previous section, we may use Wikidata class trees to describe wikilink target items.

In this context, a Wikidata query is made of a QID and a depth value. Using WDQS, the corresponding tree of items is retrieved and then matched against the items associated with the wikilink target articles.

Semantic

[edit]

This approach consists of representing parts of the articles as points (vectors) in a multidimensional semantic space using a language model. These are usually referred to as embeddings. We then represent natural language queries in the same space using the same model. This way we can evaluate the semantic similarity between queries and article parts by measuring the distance between the vectors representing them. This should allow us to match expressions conveying similar meanings, in a way more flexible than exact or fuzzy text matching. For example, for query "carbon emissions", while exact or fuzzy matching techniques would only match textually similar passages such as "carbon emission" or "carbon dioxide emissions", a semantic approach should be able to match semantically similar passages such as "methane emissions" or "greenhouse gas emissions".

For this approach there are a few important decisions to be made, mainly how to split articles into smaller parts, what language model to use to represent parts into vectors, and what queries to use.

It is worth noting that a similar approach has been used previously in Wikimedia projects for semantic search prototypes. See for example:

Note that semantic search in Wikimedia projects seems to be a field in active development at the moment.[5] It would therefore be worth it revising the specifics of our implementation in the near future.

Embedding model

[edit]

We use the Python library Sentence Transformers[6] to initialize and make inferences with (i.e., calculate embeddings) embedding models. However, choosing an embedding model is not an obvious decision and there are multiple criteria to consider.

We wanted to use multilingual models so that the same concepts would be represented similarly in the resulting semantic space, irrespective of the language in which article parts and queries are written. This is important for comparing results across languages in our study.

Embedding models are neural networks trained on huge amounts of data. Their size is usually measured by the number of model parameters. Bigger models usually have larger context windows, which determine the maximum length of the text passages we can feed into them. However, bigger models are also more computationally demanding to run inferences through them.

We experimented with a few models and finally settled with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, although other promising models such as Qwen3-Embedding-0.6B remain to be tried.

Article data

[edit]

We fetch article wikitexts and split them into sections using mwparserfromhell. Then we convert wikitext into plaintext using mwedittypes'[7] wikitext_to_plaintext function. Note that this doesn’t parse the wikitext, it just removes wikitext syntax. We then split the resulting plaintext into paragraphs.

To feed paragraphs into the embedding model, we split them into shorter passages according to the model’s context window. To do so, we split paragraphs into sentences with mwedittypes tokenizer.[note 3] We then group sentences into passages shorter than the model’s context window. As a result, each paragraph is split into one or more shorter passages.

In some cases, this results in passages that are still longer than the context window, because sentences may be too long. In this case, we just feed them into the model, even if they are longer than supported. On the other hand, too short passages may not convey enough meaning and are ignored.

Article and corresponding section and subsection titles are added for additional context at the beginning of all passages before embedding.

At the end of this process, we come up with a long list of passages, each made of one or more sentences. Each passage belongs to a paragraph (as explained above, there may be more than one passage per paragraph), which in turn belongs to a section, which belongs to an article.

Embedding models benefit from having graphics processing units (GPUs) available. Unfortunately, these are not available for community projects from Wikimedia servers. Therefore, the notebook has been written such that article passages are hashed and embeddings are cached and reused as long as passages remain the same. This way, embeddings using a given model must be calculated only once most of the time and can be reused across sessions. As a result, although the first time the code is run on PAWS for a given set of articles and a given model it will take considerable time, the following times it is run it should be much faster.

Alternatively, embeddings can be calculated on a more powerful device, such as on a personal computer, or on Google Colab,[note 4] and cached embeddings uploaded to PAWS. The notebook includes a cell to download cached embeddings from our PAWS instance.

References

[edit]

Finally, the third approach we considered was the analysis of the articles' external references. Of the three approaches this is possibly the most indirect to assess the topics discussed in the articles. But it is the simplest and provides additional information that may be useful as well, for example whether references are academic, governmental, etc.

There is a similar and much more detailed tool: the Internet Archive Reference Explorer (IARE),[8] available online at https://internetarchive.github.io/iare/. But It is better suited for analyzing single articles, rather than a list of them as we are doing in our research.

Article data

[edit]

Analogously to what we did in the Wikilinks analysis, we first started getting articles' external links through the externallinks table of the database replicas, as this would be faster and better suited for later offering these analyses as a service. However, this data source would only provide URLs, no matter where they are found in an article, and without further context. As it happened in the Wikilinks analysis, we found this was too noisy as well. For example, the More citations needed template automatically adds external links to search for sources in Google and JSTOR.

Therefore, we decided to fetch and parse articles' wikitext instead. For parsing we also used mwparserfromhell.[note 5] We extracted all links found in references, i.e., within <ref> tags, whether or not they appeared inside templates. Each reference was given an arbitrary unique ID. A reference may have one or more links (e.g., journal article references, including both journal link and DOI link; or archived references, including both original and archive URLs). DOIs found in references were converted to URLs and extracted too.

Once extracted, the list of links for each article is cached along the current article's hash. This way we don’t have to reprocess the articles as long as they remain unchanged.

[edit]

Some links found in references are links to archived pages on the Wayback Machine or archive.today. In these cases, we replaced these archive links with their corresponding original link.

We split URLs in their constituent parts:[9]

  • Scheme
  • Netloc
  • Path
  • Fragment
  • Query
  • Fragment
  • Params

In turn, netloc part is further split using tldextract[10] into:

  • Subdomain
  • Domain
  • Suffix

Finally, suffix is further split in generic top-level domain (gTLD) and country-code top-level domain (ccTLD) based on IANA’s Root Zone Database.[11]

Querying

[edit]

In this case queries are defined as functions applied to the link properties, mostly their netloc or one of its sub-parts, namely the domain, suffix of gTLD.

Code

[edit]

For the relatively limited scope of this particular project, we created a series of Jupyter Notebooks on Wikimedia PAWS to run the analyses, one notebook per analysis approach (i.e., wikilinks, semantic, and references). However, this was exploratory and to better support other users and projects, a dedicated service running on Toolforge or Wikimedia Cloud VPS would be useful, as described above.

To quickly inspect what was done in each notebook, you can simply check their HTML previews at https://public-paws.wmcloud.org/User:Diegodlh/wmuy-caad/.[note 6] But forking them into your own Jupyter instance (local or on PAWS, Google Colab, etc) offers more options, such as navigating the notebooks' sections from the Table of Contents sidebar, or running them.

Running the notebooks on PAWS proves helpful because of the privileged access to Wikimedia-hosted APIs and database replicas available. However, as discussed in the Semantic section above, this platform is limited because of the lack of GPU availability.

Each analysis approach has a corresponding notebook and a folder where cache and output are saved. See the corresponding README.md file in each folder for a detailed description of the cache and output files.

Running

[edit]

The code in the notebooks is provided freely under a GNU GPLv3 license and available on Wikimedia GitLab at https://gitlab.wikimedia.org/diegodlh/wmuy-caad#. They can be forked on PAWS, locally or on third-party platforms such as Google Colab to run analyses on the same or different articles and queries.

If you want to run the analyses again, or run custom analyses (e.g., for another group of articles, or for other queries), you will have to copy (fork) them into a Jupyter instance of your own. Note that notebooks relying on access to database replicas will only work on PAWS.

For custom analyses, adjust the necessary parameters according to your needs. Most importantly, the list of article titles, the list of queries and, for comparisons across Wikipedia language editions, the list of languages. The list of titles may be manually specified, or retrieved programmatically from SPARQL queries or other sources. For example, the utils.py module along the notebooks provides a function to fetch titles from a PetScan query PSID.

Results

[edit]

Full results can be complex and we don’t mean to provide a full overview here. You can check them in the corresponding output/ directory for each analysis, as explained in the Code section above. In this section we will provide some example results.

At the time of writing this (this results may change if analyses are run again, because of the dynamic nature of Wikimedia projects), a total of 1,961 articles were analyzed, from Wikipedias in 7 different languages. Note that the number of articles differs across languages. This is because although the WDQS query is the same for all languages, not all matching Wikidata items have an article in every Wikipedia language.

[edit]
Figure 1: Mean number of unique wikilinks per article across Wikipedia language editions.

A total of 28,043 unique wikilinks were analyzed. Note that the mean number of unique wikilinks per article differs across languages (see Figure 1). For example, while the average number of unique wikilinks per article in the Portuguese Wikipedia is around 23, this number is around 40 in the English Wikipedia, for the articles selected. This may be related to different wikilinking practices, as well as to different article length, which hasn’t been analyzed.

See first rows of either xx_category_comparison.csv and xx_wikidata_comparison.csv output tables for the number of articles and mean unique wikilinks per article in every Wikipedia language.

Categories

[edit]

Wikilinks analyzed from all Wikipedia language editions selected are categorized under 109,059 base-level categories (i.e., not including their corresponding ancestor categories).

For querying we experimented with the following 3 queries:

Figure 2: Wikilink category query results. Proportion of articles with at least one wikilink matching each of the queries defined, or at least one of them ("any"). Comparison across different Wikipedia language editions.

For example, in English Wikipedia, 88% of articles selected have at least one wikilink with category “Greenhouse gases” or one of its subcategories. Whereas in Spanish Wikipedia this percentage is 6% (see Figure 2).

Although this may in fact mean that articles in Spanish Wikipedia link to fewer articles about Greenhouse gases, it may also be related to different categorization criteria across languages.

For example, the English Wikipedia article for ConocoPhillips links to Carbon dioxide and Greenhouse gas articles, both in the Greenhouse gases category. On the other hand, the corresponding Spanish Wikipedia article, much shorter, doesn’t link to any article in the corresponding category. This does indeed reflect a more limited coverage in Spanish Wikipedia.

On the other hand, while both YPF’s English and Spanish articles link to Petrobras article, in English Wikipedia Petrobras article is under one of Greenhouse gases subcategories, which is not the case in Spanish Wikipedia. In this case this doesn’t speak of less coverage in Spanish Wikipedia, but rather it may speak of broader categorization criteria in English Wikipedia.

Therefore, these results should be considered with caution. Readers may be interested in forking the notebooks and trying with different queries, maybe some with shallower depths or other root categories.

Full results are available in the output/ directory of the wikilinks analysis folder:

  • {lang}_category_groups.csv: list of categories and subcategories for each query
  • {lang}_category_matches.csv: list of all wikilink matches found
  • {lang}_category_pivot.csv: results per article
  • {lang}_category_summary.csv: summary of Wikipedia language results
  • xx_category_comparison.csv: summarized results compared side-by-side across languages

See corresponding README.md file for a full description of these files.

Wikidata item ontology

[edit]

For querying we experimented with the following 3 queries:

Figure 3: Wikilink Wikidata item ontology query results. Proportion of articles with at least one wikilink matching each of the queries defined, or at least one of them ("any"). Comparison across Wikipedia language editions.

For example, in English Wikipedia, around 5% of articles selected have at least one wikilink associated with a Wikidata item of the “human impact on the environment” (Q574376) class or one of its subclasses. In Spanish Wikipedia this percentage is a bit higher: 7% (see Figure 3).

For instance, the wikilink to "Mudanças climáticas" found in the Portuguese Wikipedia article for Petrobras is matched because it is linked to the item climate change (Q125928), which is an instance of the class human impact on the environment (Q574376).

Note the queries are exploratory and may be problematic. For example, the human impact on the environment (Q574376) | depth=3 query is too wide and matches environmental impact topics besides climate change. For instance, wikilinks to articles linked to item oil spill (Q220187) are matched because it is an instance of health and environmental impact of the petroleum industry (Q5381256), a second-level subclass of human impact on the environment (Q574376).

Likewise, queries carbon footprint (Q310667) | depth=0 and carbon dioxide equivalent (Q1933140) | depth=0 do match the specific concepts they represent, but fail to match relevant wikilinks such as those linked to greenhouse gas (Q167336). Additional queries, such as greenhouse gas (Q167336) | depth=1 and greenhouse gas emissions (Q106358009) | depth=2 may help capture these.

Full results are available in the output/ directory of the wikilinks analysis folder:

  • {lang}_wikidata_groups.csv: list of Wikidata items for each query
  • {lang}_wikidata_matches.csv: list of all wikilink matches found
  • {lang}_wikidata_pivot.csv: results per article
  • {lang}_wikidata_summary.csv: summary of Wikipedia language results
  • xx_wikidata_comparison.csv: summarized results compared side-by-side across languages

See corresponding README.md file for a full description of these files.

Semantic

[edit]
Figure 4: Mean number of paragraphs per article (in the articles selected) across Wikipedia language editions.

We experimented with the following queries, although a better understanding of which queries work and which don't is still needed:

  •    "Greenhouse gas emissions",
  •    "Climate change impact",
  •    "Climate change mitigation",
  •    "Global warming".

In article and language summary tables, we used a 0.45 similarity threshold to dichotomously decide whether a passage (and therefore the paragraph it belongs to) matches a query or not.

Figure 5: Semantic query results. Proportion of articles with at least one paragraph matching each of the queries defined, or at least one of them ("any"). Comparison across different Wikipedia language editions.

For example, around 5% of articles in English Wikipedia have at least one paragraph matching the “Greenhouse gas emissions” query. This proportion is higher in Spanish Wikipedia: around 7%; and even higher in Russian Wikipedia: around 12% (see Figure 5).

See Table 1 for a list of top-three matching passages for query “Greenhouse gas emissions” in all Wikipedia language editions. Note that two of the matching paragraphs are irrelevant (false positives). It is reasonable to expect that the false-positive rate would increase with lower similarity scores.

Some false negatives can also be found. That is, relevant passages scoring under the threshold. Lowering the threshold may reduce false negatives, though at the price of increasing false positives.

Also note that the queries we used help us evaluate whether an article refers to a given topic, but doesn’t reveal how it refers to it. For example, in the examples in Table 1, paragraphs referring to companies' efforts to diminish greenhouse emissions appear alongside others referring to their negative impact on climate change.

Table 1: Top-three matching passages for query “Greenhouse gas emissions”
Lang Article Passage English translation Score
en TotalEnergies TotalEnergies. Controversies. Environmental and safety records. 9% of global emissions from 1998 to 2015. - 0.668
en ConocoPhillips ConocoPhillips. Environmental record. Carbon footprint. ConocoPhillips reported Total CO2e emissions (Direct + Indirect) for the twelve months ending 31 December 2020 at 16,200 Kt (-4,300 /-21% y-o-y). Importantly, the figure does not include Scope 3 end-use emissions resulting from the consumption of fossil fuels produced by the company. - 0.666
en Equinor Equinor. Environmental record. Statoil was responsible for 0.52% of global industrial greenhouse gas emissions from 1988 to 2015. - 0.666
es Ecopetrol Ecopetrol. Ecopetrol es responsable del 0,27% de las emisiones industriales de gas de efecto invernadero a nivel mundial entre 1988 y 2015. Ecopetrol. Ecopetrol is responsible for 0.27% of global industrial greenhouse gas emissions between 1988 and 2015. 0.645
es Petróleos Mexicanos Petróleos Mexicanos. PEMEX está entre uno de los principales emisores de dióxido de carbono a nivel global, según un estudio de InfluenceMap. Petróleos Mexicanos. PEMEX is among the world's leading emitters of carbon dioxide, according to a study by InfluenceMap. 0.573
es Moeve Moeve. En términos de impacto ambiental, en 2021 Moeve se ubicó entre las diez empresas con mayores emisiones de gases de efecto invernadero en España, alcanzando 4,9 millones de toneladas de equivalentes de CO2. Moeve. In terms of environmental impact, in 2021 Moeve ranked among the ten companies with the highest greenhouse gas emissions in Spain, reaching 4.9 million tons of CO2 equivalents. 0.563
pt Petrobras Petrobras. A Petrobrás é responsável por 0,77% da emissão de gases efeito estufa na indústria no período entre 1988 até 2015 e, portanto, um dos maiores responsáveis para as mudanças climáticas que causam "riscos à saúde, aos meios de subsistência, à segurança alimentar, ao suprimento de água e ao crescimento econômico. Petrobras. Petrobras is responsible for 0.77% of greenhouse gas emissions in industry between 1988 and 2015 and is therefore one of the main contributors to climate change, which causes “risks to health, livelihoods, food security, water supply, and economic growth.” 0.582
pt Chevron (empresa) Chevron (empresa). De acordo com uma pesquisa de 2019, a Chevron foi responsável pela emissão de 43,35 bilhões de toneladas de CO₂ entre 1965 e 2017, sendo a segunda empresa com as emissões mais altas do mundo, atrás apenas da Saudi Aramco. Chevron (company). According to a 2019 study, Chevron was responsible for emitting 43.35 billion tons of CO₂ between 1965 and 2017, making it the second-highest emitter in the world, behind only Saudi Aramco. 0.536
pt ExxonMobil ExxonMobil. Os resultados da pesquisa de 2019 mostram que a ExxonMobil, com emissões de 41,90 bilhões de toneladas de equivalente CO₂ entre 1965 e 2017, foi a empresa com a quarta maior emissão mundial durante esse período. ExxonMobil. The results of the 2019 survey show that ExxonMobil, with emissions of 41.90 billion tons of CO₂ equivalent between 1965 and 2017, was the company with the fourth highest emissions worldwide during that period. 0.522
ar إني إني. الاستدامة. تُعدّ شَّرِكَة إني مَسؤُولَة عَن 0.59 بالمائة من انبعاثات غازات الاحتباس الحراري العالميَّة مُنذ 1988 إلى 2015. Eni. Sustainability. Eni is responsible for 0.59% of global greenhouse gas emissions from 1988 to 2015. 0.733
ar توتال إنرجيز توتال إنرجيز. وعلى غرار شركات الوقود الأحفوري الأخرى تمتلك «توتال إنرجي» تاريخاً معقداً من التأثيرات البيئية والاجتماعية بما في ذلك الخلافات متعددة الأطراف. فوفقاً لتقرير CDP Carbon Majors للعام 2017 كانت الشركة واحدةً من أكبر مئة شركةٍ تُنتج انبعاثات الكربون على مستوى العالم، وكانت مسؤولةً عن 0.9 ٪ من الانبعاثات العالمية ما بين العامين 1998 إلى 2015. Total Energies. Like other fossil fuel companies, Total Energies has a complex history of environmental and social impacts, including multiple controversies. According to the 2017 CDP Carbon Majors report, the company was one of the 100 largest carbon emitters globally, responsible for 0.9% of global emissions between 1998 and 2015. 0.635
ar بي بي بي بي. السجل البيئي. سياسة المناخ. في فبراير 2020 حددت بي بي هدفًا لخفض انبعاثات غازات الاحتباس الحراري إلى الصفر بحلول عام 2050، حيثُ تسعى بي بي للحصول على انبعاثات كربونية صافية صفرية في عملياتها وأنواع الوقود التي تبيعها الشركة، بما في ذلك الانبعاثات من السيَّارات والمنازل والمصانع. وقالت الشركة إنها تعيد هيكلة عملياتها في أربع مجموعات تجاريَّة لتحقيق هذه الأهداف. BP. Environmental record. Climate policy. In February 2020, BP set a target to reduce greenhouse gas emissions to zero by 2050, as BP seeks to achieve net-zero carbon emissions in its operations and the fuels it sells, including emissions from cars, homes, and factories. The company said it is restructuring its operations into four commercial groups to achieve these goals. 0.634
fr TotalErg TotalErg. Contribution au réchauffement climatique. Émissions de gaz à effet de serre. Selon The Guardian, Total serait à l’origine de 0,95 % des émissions industrielles mondiales de gaz à effet de serre entre 1988 et 2015. TotalErg. Contribution to global warming. Greenhouse gas emissions. According to The Guardian, Total was responsible for 0.95% of global industrial greenhouse gas emissions between 1988 and 2015. 0.799
fr TotalEnergies TotalEnergies. Contribution au réchauffement climatique. Émissions de gaz à effet de serre. Selon The Guardian, Total serait à l’origine de 0,95 % des émissions industrielles mondiales de gaz à effet de serre entre 1988 et 2015. TotalEnergies. Contribution to global warming. Greenhouse gas emissions. According to The Guardian, Total was responsible for 0.95% of global industrial greenhouse gas emissions between 1988 and 2015. 0.798
fr TotalErg TotalErg. Contribution au réchauffement climatique. Émissions de gaz à effet de serre. Elles correspondent aux scopes 1 et 2 d'un bilan des émissions de gaz à effet de serre (BEGES) et s'élèvent à environ 40 millions de tonnes équivalent en 2019 selon Total. Les émissions dues à la combustion des produits pétroliers et du gaz naturel utilisés par les consommateurs, qui correspondent à une partie du d'un BEGES. TotalErg. Contribution to global warming. Greenhouse gas emissions. These correspond to scopes 1 and 2 of a greenhouse gas emissions inventory (GHG) and amount to approximately 40 million tons of CO2 equivalent in 2019, according to Total. Emissions from the combustion of petroleum products and natural gas used by consumers, which correspond to part of the of a GHG inventory. 0.798
ru Новатэк Новатэк. Деятельность. Показатели деятельности. Согласно исследованиям некоммерческой организации , «Новатэк» ответственен за 0,14 % выбросов глобальных индустриальных парниковых газов в период с 1988 до 2015 года. Novatek. Activities. Performance indicators. According to research by a non-profit organization, Novatek was responsible for 0.14% of global industrial greenhouse gas emissions between 1988 and 2015. 0.762
ru Лукойл Лукойл. Компания ответственна за 0,75 % выбросов глобальных индустриальных парниковых газов в период с 1988 по 2015 год. Lukoil. The company is responsible for 0.75% of global industrial greenhouse gas emissions between 1988 and 2015. 0.633
ru TotalEnergies TotalEnergies. Деятельность. Подразделения по состоянию на 2021 год:. Газ и возобновляемая энергетика — добыча природного газа, переработка его в сжиженный газ, газовые тепловые, солнечные и ветряные электростанции общей мощностью 43 ГВт; выручка 30,7 млрд долларов. TotalEnergies. Activities. Divisions as of 2021: Gas and renewable energy — natural gas production, processing into liquefied natural gas, gas-fired, solar, and wind power plants with a total capacity of 43 GW; revenue of $30.7 billion. 0.597
zh 殼牌 殼牌. 从1988年到2015年,壳牌占了全球工业温室气体排放量的1.67%。 Shell. From 1988 to 2015, Shell accounted for 1.67% of global industrial greenhouse gas emissions. 0.721
zh 雪佛龍 雪佛龍. 新政策和发展. 雪佛龙已经逐步开始减少温室气体排放,来获得更清洁的能源。在美国的石油公司中,雪佛龙因为在可替代能源方面的巨大投资并定下目标来减少温室气体排放而获得了最高的评价。雪佛龙为全球地热能最大生产商,所产生的能源足可供应给超过七百万家庭使用。 Chevron. New Policies and Developments. Chevron has progressively begun reducing greenhouse gas emissions to achieve cleaner energy. Among U.S. oil companies, Chevron has earned the highest ratings for its substantial investments in alternative energy and its established goals to reduce greenhouse gas emissions. As the world's largest producer of geothermal energy, Chevron generates enough power to supply over seven million households. 0.631
zh 道达尔能源 道达尔能源. 参见. 2007年英国汽油污染 Total Energy. See. 2007 UK Gasoline Contamination 0.561

Full results are available in the output/ directory of the semantic analysis folder:

  • {lang}_matches.tsv: list of all passages and their similarity scores for each query
  • {lang}_pivot.tsv: results per article
  • summary.tsv: summarized results compared side-by-side across languages

See corresponding README.md file for a full description of these files.

References

[edit]
Figure 6: Mean number of web references per article across Wikipedia language editions.

Reference links were matched against the following queries:

  • isGov: netloc suffix is .gov or one of the equivalents listed in English Wikipedia’s .gov article.
  • isAcad: netloc is doi.org or dx.doi.org (remember DOIs found in references are converted to corresponding DOI links).
  • isOfficial: netloc matches company’s official site as listed in Wikidata.
  • isInfluenceMap: domain is one of InfluenceMap’s influencemap, lobbymap, financemap, or carbonmajors. InfluenceMap is a think tank focused on analyzing business and finance impact on climate change.[12]
  • isOrg: gTLD is org.
Figure 7: Reference query results. Proportion of articles with at least one web reference matching each of the queries defined. Comparison across Wikipedia language editions.

For example, in English Wikipedia, 7% of articles cite at least one source with DOI. This percentage is 3% and 4% in Spanish and Portuguese Wikipedias, respectively.

In the case of governmental sources, these percentages are 29%, 19% and 11% for English, Spanish and Portuguese Wikipedias, respectively.

Although understanding the nature of references may be of interest, the capacity of this approach to evaluate whether the topic of interest is represented in the articles is very limited.

Full results are available in the output/ directory of the references analysis folder:

  • {lang}_links.csv: A list of links found in references and whether they match each query
  • {lang}_refs.csv: Query results aggregated per reference
  • {lang}_arts.csv: Query results aggregated per article
  • {lang}_lang.csv: Summary of results for a given Wikipedia language edition
  • {lang}_netloc.csv: Link count per netloc
  • {lang}_gtld.csv: Link count per gTLD

See corresponding README.md file for a full description of these files.

Discussion

[edit]

This project was mostly an exploration. We wanted to evaluate different ways of assessing whether and to what extent a group of Wikipedia articles refer to a given topic. In particular, we focused on articles about fossil fuel companies and their impact on climate change. But we wanted to come up with a methodology that could be used for other articles and topics as well. And we wanted it to be as efficient as possible to provide the foundations to provide it as an online service in the future.

In this section we will briefly summarize what we accomplished, limitations and future directions of our approaches.

[edit]

We evaluated whether climate change impact is mentioned in a list of articles in different language Wikipedias by characterizing the wikilinks found in these articles, via the categories and the Wikidata items related to them. Given a set of boolean queries, we provided a list of matching wikilinks for each of these queries and compared results across articles and across languages.

On the one hand, using wikilink categories to assess subtopics in an article presents some limitations. One of these is that sometimes articles are categorized too broadly. For example, in the English Wikipedia, the ExxonMobil article is categorized under the Climate change denial category. Hence, if an article links to the ExxonMobil article one may wrongly assume that it is talking about Climate change denial. Article categorization criteria may also differ among Wikipedia language editions, thus complicating comparison of results across languages.

One possible workaround for the situation described above may be using a weighted score. That is, a category match would score higher the smaller the number of categories a wikilink belongs to. Alternatively, one may also try by limiting the depth of the category queries.

On the other hand, wikilink Wikidata item ontology analysis seems to be more specific, although it may lead to false negatives. It should also be noted that, although all articles selected have a Wikidata item associated, because they were retrieved with a WDQS query, this may not be the case for all wikilinks.

In both cases, the queries used were exploratory and we think they only scratch the surface of what could be possible with our approaches. We would like to invite others to experiment with these and other queries to come up with other and possibly more insightful results. Wikipedia categories and Wikidata items found to be related to WikiProject Climate Change may be worth considering.

A third way of characterizing wikilinks had been considered using the articletopic they are automatically classified into. However, we found some technical limitations to implement this approach (namely lack of access to CloudElastic replicas from PAWS)[13] and in the end we thought it wouldn't be informative enough to make the effort worth it anyways.

Semantic

[edit]

We implemented a way to automatically and language-independently assess whether a group of Wikipedia articles mention a specific topic using a semantic search approach with embedding models, and we achieved reasonably good results using CPUs only. However, it is worth noting that working with such language models presents technical and theoretical limitations.

Concerning technical limitations, our analyses and any future services we may provide are limited by the lack of GPU availability in Wikimedia-provided resources. This has been discussed elsewhere, for example on Phabricator.

Note that some alternatives remain which we haven’t tried yet, such as using OpenVINO for CPU inference,[14] or using a custom inference service running on Wikimedia Cloud VPS.[note 7] Also note that current developments and experimentation with OpenSearch 3 and vector search in Wikimedia may simplify these approaches in the future.[5][15][16]

Other limitations are not technical but rather inherent to the use of language models in general: how have these models been trained and what biases may come with them? Will results reproduce and reinforce these biases? Would they work better in some languages than in others?

Another textual approach we could have tried, in addition to semantic search, is exact or fuzzy keyword matching. This would be more limited than semantic search, because a detailed list of keywords would be needed, which in addition wouldn’t be language-independent, but it would be more transparent and reliable.

References

[edit]

We managed to provide a way to rapidly analyze the nature of web references in a series of Wikipedia articles. Although our outputs are not as detailed as those available from other specialized tools, such as Internet Archive's Reference Explorer,[8] our approach delivers basic information for a large number of articles at once.

This last approach was the least useful to evaluate whether articles mention a given topic, given that it is difficult to infer this solely from the URL of the cited source. However, we believe it provides useful complementary information that may help WikiProject members and independent contributors attain their goals.

General

[edit]

We prototyped a methodology to automatically and language-independently evaluate whether and to what extent a given topic is represented in a list of articles of interest. We did so using three separate and complementary approaches.

Results from different analyses can be interpreted separately. However, in the future it could be useful to check if they are correlated. We could do this on a per article level, or even more deeply on a per section or per paragraph level. For example: is it more likely that some query returns positive for a paragraph given that it is positive for another given query?

Something else we could have checked is whether differences between Wikipedia language editions are tied to the official language of the country where companies are based.

The three approaches described in this report are all "top-down", meaning that we define a series of queries and then survey the article contents (namely wikilinks, passages and references) to decide whether they match the queries or not. Alternatively, we wanted to try an analogous "bottom-up" strategy, where from the article contents we could infer the topics mostly covered in these articles. We tried this but found some difficulties, such as the challenges of managing a rapidly growing category tree, and the limits of topic modelling techniques to capture marginally represented topics, to name just a few of these difficulties. Consequently, we decided not to proceed with this bottom-up description, but we wanted to mention that it may be an interesting complementary analysis to pursue.

Future directions

[edit]

In this exploratory project we proposed and implemented a way to automatically and language-independently evaluate topic coverage in a group of Wikipedia articles. Although we openly shared our code so that others can evaluate their own topics and articles, we understand that doing so requires some technical skill.

In the future, we would like to refactor our notebooks into a service provided through an API, so that it can be leveraged from Template gadgets on-wiki, or from a dedicated website.

Toward this end we think we could benefit from better normalizing our outputs. For example, we could return a list of elements for each analysis (wikilinks, passages, and references, respectively) and a 0-1 matching score for each element-query pair. In addition, we could also provide results on a per-paragraph level, and return a list of paragraphs (per article and analysis) and whether they include at least one matching element, and the article section they belong to.

Acknowledgements

[edit]

We would like to thank Isaac Johnson and David Coronel for their careful and detailed suggestions.

We would also like to thank the Climate Action Against Disinformation (CAAD) coalition for their financial support.

See also

[edit]

References

[edit]
  1. "Climate change portal/Resources". Meta-Wiki. Retrieved 2026-01-26. 
  2. "MWParserFromHell v0.7 Documentation". mwparserfromhell 0.7.2 documentation. Retrieved 2026-01-26. 
  3. "Help:Sitelinks". Wikidata. Retrieved 2026-01-27. 
  4. "Help:Basic membership properties". Wikidata. Retrieved 2026-01-27. 
  5. a b Wikimedia Foundation Information Retrieval working group. "Report: Improving How Readers Find Information on Wikipedia". MediaWiki. Retrieved 2026-01-28. 
  6. "SentenceTransformers Documentation". Sentence Transformers documentation. Retrieved 2026-01-27. 
  7. "Edit Types". Wikimedia GitLab. Retrieved 2026-01-27. 
  8. a b Internet Archive (2026-01-22), Internet Archive Reference Explorer (IARE) App, GitHub, retrieved 2026-01-27 
  9. "urllib.parse — Parse URLs into components". Python documentation. Retrieved 2026-01-27. 
  10. Kurkowski, John (2026-01-26), tldextract, GitHub, retrieved 2026-01-27 
  11. "Root Zone Database". IANA. Retrieved 2026-01-27. 
  12. "Home". InfluenceMap. Retrieved 2026-01-29. 
  13. "Getting CirrusSearch data from PAWS". Wikimedia Cloud Services mailing list. 2025-10-13. Retrieved 2026-01-30. 
  14. "Speeding up Inference". Sentence Transformers documentation. Retrieved 2026-01-30. 
  15. "T412338 Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP". Phabricator. Retrieved 2026-01-30. 
  16. "T409898 Set up OpenSearch instance supporting vector search". Phabricator. Retrieved 2026-01-30. 

Notes

[edit]
  1. It is worth noting that because we started writing the notebooks with the database approach in mind, some of this code remains in the final notebooks, which are a somewhat complex combination of the database and API approaches.
  2. Although possible, what article section wikilinks appear in has not been indicated in our analysis.
  3. It has been suggested that we may use mwtokenizer instead.
  4. Although Google Colab offers GPU computation for free, there are limits which are not transparent. See discussion on Reddit here.
  5. Alternatively, using wiki-references-extractor was considered but not tried.
  6. PAWS notebook previews sometimes result in a 502 gateway error. This has been reported to the Wikimedia Cloud Services mailing list. In the meantime, you may add ?format=raw at the end of the URL to download and fork the notebook on your own instance.
  7. As suggested by Isaac Johnson. See for example https://github.com/wikimedia/research-api-endpoint-template/blob/mentorship-search/model/main.py.