WikiCred/2022 CFP/WiSCoM - Wikipedia Source Controversiality Metrics

From Meta, a Wikimedia project coordination wiki
WiSCoM - Wikipedia Source Controversiality Metrics
A WikiCred 2022 Grant Proposal
Project TypeResearch + Output
AuthorAndreas Kaltenbrunner
(Krtik)
Contactandreas.kaltenbrunner(_AT_)isi.it
Requested amount9,960 USD
Award amountUnknown
What is your idea?

The main goal of WiSCoM is to generate and assess actionable metrics for source controversiality in Wikipedia. To guarantee universality (i.e. applicability to all Wikipedia language editions), knowledge equity and avoid dependence on the specifics of a given language, we will solely rely on language-agnostic approaches using mainly data from editing activity.

The proposal will build upon existing work of the Contropedia project that already exhibited the potential of language-agnostic approaches to measure the controversiality of wikilinks in a given article, and develop similar techniques to approximate controversiality of a given source across multiple articles and Wikipedia language editions.

Contrary to the Contropedia project, WiSCoM will calibrate and assess the quality of the developed metrics, comparing them with the already existing lists of Reliable/Perennial sources in the 9 language editions where such a list exists (at the time this proposal is written).

To be able to deliver an actionable first proof of concept prototype, WiSCoM will start by focusing only on articles related to the topic of Climate Change. This choice is motivated by recent reports by BBC News (Dec 2021) of knowledge integrity issues on non-English Wikipedias, ranging from neglect - science that's out of date - to a lack of balance, to active disinformation. This issue has caught the attention of different movement volunteers and WMF staff, providing abundant data and contextual qualitative knowledge. Once proven useful for articles related to the topic, WiSCoM will seek support to be upscaled to the entire set of articles across the over 300 hundred language editions of Wikipedia.

Workplan:

The proposal consists of 3 phases:

  • Phase 1: Data preparation

    WiSCoM will build a comprehensive set of articles related to the chosen topic in the existing language editions of Wikipedia.

    To identify the subset of Wikipedia articles relevant to climate change, WiSCoM will follow a similar methodology as done for example in Markusson et al. (2016) [1] for articles related to Geoengineering and detailed in this blog-post [6] of the EMAPS project. The idea is to start at the Wikipedia page for the Category Climate Change in the English Wikipedia, and then expand and manually prune subcategories until a certain depth. Addintionally WiSCoM will also contrast and extend the obtained list of articles with this community curated list.

    Once the set of articles relevant to the topic is identified, WiSCoM will find their corresponding version in other language editions through inter-wiki links and retrieve their complete edit histories. Finally, all the references appearing in the revisions of these Wikipedia articles will be collected and standardised. In particular WiSCoM will identify the source domain of every reference-link.

  • Phase 2: Metric development & Evaluation

    Using the data collected in Phase 1 and based upon the definition of controversiality used in the Contorpedia project (See Borra et al., 2015 [2]). WiSCoM will develop metrics of controversiality first related to a specific reference and then aggregated to the corresponding source domains. Alternatively, the possibility of directly developing controversiality metrics for the source domain will also be explored.

    The approach of Borra et al. (2015) [2] is based on counting substantial disagreeing edits (i.e, edits which involve some deletions of text but are not marked as vandalism) of sentences which contain the element under study (a reference in our case). This approach is only a starting point that will serve as a baseline and will be adapted to reach better results in the evaluation phase.

    To be able to evaluate the metrics with the categories used in the lists of Reliable/Perennial source domains WiSCoM will map them in at least three categories (Generally reliable, Generally unreliable, undecided) with optional separation into more categories. For this mapping purpose WiSCoM will explore with both, simple straight forward discretizations of the controversiality metrics, as well as explainable machine learning algorithms to predict the categories based on different variants of the controversiality metrics and other language agnostic descriptors as input features.

    WiSCoM will perform several iterations of metric design and evaluation, focusing each time on how to improve upon the misclassified Source categories and comparing also with the scores obtained from the https://iffy.news/ project when available and from Media Bias Monitor and Media Bias Fact Check as used in recent study (Yang ad Colavizza, 2022)[3] about potential biases of news sources on Wikipedia.

    Finally, to avoid overfitting the project will also verify its results and the underlying processing pipeline on articles related to another set of topics not directly related to climate change. We plan to focus on vaccine hesitancy related articles starting from the Vaccine hesitancy category page, a timely topic given the recent related controversies about the COVID-19 vaccines.

  • Phase 3: Building of a prototype

    The final step of WiSCoM, will be the development of a proof of concept prototype capable of extracting the developed metrics for the sources of a set of given input articles.


Why is it important?

Wikimedia's Strategy 2030 report identified misinformation and disinformation as threats to the Wikimedia movement's goal of making free knowledge available to all. Specifically, "Wikimedia projects are vulnerable to government, political, cultural, or profit-driven censorship and misinformation campaigns, as well as outright falsified content".

Furthermore, there has been evidence that identifying non-reliable sources is an effective tool to combat disinformation and increase the knowledge integrity of Wikipedia. In this sense this project will combine the knowledge and community effort stored in the few language editions with already existing perennial sources list together with the interaction between community member thorough edits and reverts around references to generate measurements to assess source controversiality. This will facilitate information about source credibility to editors as well as readers and allow the generation of perennial source lists in many other language editions.


Link(s) to your resume or anything else (CV, GitHub, etc.) that may be relevant

Links to pages of the project PI Andreas Kaltenbrunner:

Links to pages of co-PI Kyriaki Kalimeri:

Links to pages of co-PI Yelena Mejova:


Is your project already in progress?

No, but we are in an advanced planning phase.


How is this project relevant to credibility and Wikipedia?

WiSCoM will provide both Wikipedians and non-Wikipedians with metrics to assess the controversiality of sources. Wikipedians can use these metrics to choose credible source domains when citing references in Wikipedia articles. Furthermore, since special focus is placed in the language agnosticity of the algorithms and measurements, this will be especially helpful to improve credibility for understudied languages for which otherwise little if any tools or algorithms would be available.

WiSCoM will furthermore be helpful to assist Wikipedians in the curation of new lists of Reliable/Perennial sources in language editions where such a list does not yet exist, as well as in giving initial assessment about new source domains to inform Wikipedians when adding sources to the already existing lists or about source domains whose credibility might have changed over time.

Finally the controversiality metrics developed may also show to be useful outside of Wikipedia by giving credibility signals about a source domain to any interested user.


What is the ultimate impact of this project?

The project benefits three key demographics:

  • Wikipedia editors: Editors of language versions with no list of perennial sources will be able to access the source controversiality metrics, which will allow them to better assess the reliability of the source of a potential reference. The articles in these Wikipedias would benefit as editors gain the ability to combat misinformation and disinformation more effectively.
    Furthermore, these metrics can assist them in building a list of perennial sources more efficiently. Even editors on language versions with perennial sources lists can benefit as it will allow them to assess new sources not yet added to the list or detect changes in credibility of sources already listed.
  • Internet users: Media consumers will benefit as they gain access to more reliable information in languages which are not the focus of large research and development efforts. The controversiality source score may furthermore be used to evaluate the credibility of websites, for example via combining the results of this project with the browser plugins developed in the Sourceror project. This will allow to improve the media literacy of internet users.
  • Researchers: The controversiality score developed by the proposal and the list of reliable sources generated through it will be a valuable resource to be used in misinformation and disinformation research.

Furthermore, having a scoring system providing the developed metrics on demand (e.g., through an API developed in a future extension of the project) will become an important asset for Wikimedia developers building tools for anti-disinformation activities. It will be interesting for other related projects such as CiteUnseen, or the Sourceror project.


Can your project scale?

Yes, potentially scalable to all Wikipedia Language Editions and all types of articles and sources which have been referenced above a certain number of times (to be determined in the course of the project). Computational limits may only arise through limits in server capacities and bandwidths, but are not foreseen to be very high.


Why are you the people to do it?

Andreas Kaltenbrunner, as PI of Contropedia (http://contropedia.net/) project, already showed the potential of language-agnostic computational approaches to measure the controversiality of wikilinks in a given article and his capacity to coordinate a project of this magnitude and bring it to a successful end. He has co-authored more than 10 scientific articles about different aspects of Wikipedia and will count with the support of his employer, the ISI Foundation.

The ISI Foundation (www.isi.it) is a private research institution located in Turin, Italy. Its mission focuses on promoting scientific interchange and cooperation at the highest degree of quality both in terms of creativity and originality of research. It aims to represent a pole of high-level interdisciplinary training in the fields of Data Science, Complex Systems and Life Sciences with an outstanding record of participation in EU framework research and innovation programs. More recently, the institute has opened a research line on Data Science for Social Impact. The activity in this domain research is mainly focused on bringing the expertise of ISI researchers in data science, data mining and machine learning, to the service of non-profit organisations that are working in the field of social innovation, philanthropy, international development and humanitarian action. Within this domain, the ISI Foundation has also established research partnerships with various organisations of the United Nations such as UNICEF, Data2X and the UN Global Pulse.

The ISI Foundation assigns every year through the Lagrange Program up to 10 scholarships for pre doctoral researchers and has given its support to assign one of the 2023 recipients of the grant to this project (if the application is successful), adding thus a significant amount of financial support to the project. Furthermore the projects co-PIs Kyriaki Kalimeri and Yelena Mejova, as well as the ISI Foundation in general have considerable expertise in topics related to Vaccine hesitancy (see for example Crupi et al 2022 [4]) which will help in the evaluation phase of WiSCoM with articles about this topic Category.

Furthermore WiSCoM counts with a letter of support from Pablo Aragón a researcher from the Wikimedia Foundation, and has declared his interest in providing active research and technical guidance for the project.


What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?

The proposal will help to reach the goal of Knowledge Equity, as it will develop methods that can be applied to smaller and less privileged language editions, which often correspond to communities that have been left out by structures of power and privilege in the offline world.


What are the challenges associated with this project and how you will overcome them?

The main challenge of the project will be to process large amounts of Wikipedia articles to extract edit and revert activity around references. While similar work has been done before in the context of the Contropedia project, the size of the edit history of articles is ever increasing and will need careful planning and filtering to be able to achieve the project goals in time. Starting the project only on a subset of articles (about the topic Climate Change) should help to mitigate this risk and allow to gain insight into how to scale the project to the entire set of Wikipedia articles in the future.

However the focus on a single topic (Climate Change) may also lead to overfitting the metrics to this topic, this will be mitigated in the evaluation phase testing the approach on a different subset of articles related to second different topic (Vaccine hesitancy).


How will you spend your funds?

We will use 6000$ to purchase and maintain a server to support the developments and host the data and the prototype, and 3960$ to cover the cost of the PI Andreas Kaltenbrunner of supervising a student (whose cost will be covered entirely by another grant awarded by the host institution itself).


How long will your project take?

1 year


Have you worked on projects for previous grants before?

Yes, but not for WIkiCreed. Andreas Kaltenbrunner has been PI of the Contropedia project (a EU FP7 EINS grant ) and worked in several EU and Spanish and Italian national projects.

References[edit]

  1. Markusson, N., Venturini, T., Laniado, D., & Kaltenbrunner, A. (2016). Contrasting medium and genre on Wikipedia to open up the dominating definition and classification of geoengineering. Big Data & Society, 3(2), 2053951716666102.
  2. a b Borra, E., Weltevrede, E., Ciuccarelli, P., Kaltenbrunner, A., Laniado, D., Magni, G., Mauri, M., Rogers, R., & Venturini, T. (2015). Societal controversies in Wikipedia articles. In Proceedings of the 33rd annual ACM conference on human factors in computing systems - CHI 2015, (pp. 193-196).
  3. Yang, P., & Colavizza, G. (2022). Polarization and reliability of news sources in Wikipedia. arXiv preprint arXiv:2210.16065.
  4. Crupi, G., Mejova Y., Tizzani M., Paolotti D., & Panisson A. (2022) "Echoes through Time: Evolution of the Italian COVID-19 Vaccination Debate." In Proceedings of the International AAAI Conference on Web and Social Media, vol. 16, pp. 102-113.