Grants:Programs/Wikimedia Research Fund/Using Computational Linguistics to Generate Systematic Reviews for WikiProjects: A Prototype for Invasion Biology

From Meta, a Wikimedia project coordination wiki

statusnot funded
Using Computational Linguistics to Generate Systematic Reviews for WikiProjects: A Prototype for Invasion Biology
start and end datesJuly 2023 - July 2024
budget (USD)50,000 USD
fiscal year2022-23
applicant(s)• Fernando Andutta Pinheiro, Daniel Mietchen and Lane Rasberry



Fernando Andutta Pinheiro, Daniel Mietchen and Lane Rasberry

Affiliation or grant type

University of Sao Paulo; Leibniz Institute of Freshwater Ecology and Inland Fisheries / FIZ Karlsruhe / Ronin Institute / IGDORE; University of Virginia


Fernando Andutta Pinheiro, Daniel Mietchen and Lane Rasberry

Wikimedia username(s)

Fernando Pinheiro Andutta: User:Fpa1981
Daniel Mietchen: User:Daniel_Mietchen
Lane Rasberry: User:Bluerasberry

Project title

Using Computational Linguistics to Generate Systematic Reviews for WikiProjects: A Prototype for Invasion Biology

Research proposal[edit]


Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.


In this project, we aim to develop Natural Language Processing (NLP) tools to support the automated review of scientific publications and related data in a given knowledge domain for the purpose of creating and updating materials that can assist contributors in the maintenance and improvement of content on Wikimedia projects. Our current method involves the use of n-grams to identify relevant articles within a large cluster of publications, and we plan to improve this method further by incorporating additional quantitative measures, which we will test with several use cases drawn from WikiProject invasion biology. This will allow us to more efficiently and effectively identify relevant literature on a chosen research topic.


With the growing amount of scientific literature being published and made available online, it can be difficult to identify research relevant to a specific topic. NLP techniques can help create and maintain science-related content on Wikimedia projects like Wikipedia. By using computational methods, Wikimedia projects can more easily keep their science content up to date and accurate.

Creating, maintaining, or updating Wikimedia content on a scientific topic often involves conducting a literature review, similar to writing a Systematic Review (SR) for a journal. SRs are a set of techniques that allow researchers to gather and summarize scientific papers in a consistent and reproducible way. When two researchers independently conduct an SR on the same topic, they should arrive at a similar selection of articles for their final review. This helps to ensure the reliability and reproducibility of the review process.


The step of curating publications for an SR needs to be improved because currently, the articles selected for SRs are mostly chosen manually. In order to improve SRs, we will optimize our current methods and explore a score system based on n-grams that can reduce a large cluster of publications (LCP) like that of WikiProject Invasion biology (with currently ca. 40,000 publications) to a smaller one (SCP). By comparing publications within and outside of the SCP, we can estimate parameters like semantic similarity and relatedness across hundreds or thousands of documents. The primary test cases will be an invasive species, an invaded locality and a specific invasion type, complemented with additional examples as needed.


  • Prof. Joseph Harari, adviser and High Performance Computer (HPC) support,
  • Department of Computational Linguistics at the University of São Paulo, Brazil
  • Google Scholar Profile


Approximate amount requested in USD.

  • 50,000 USD

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

This budget supports software and data aspects of integrating NLP workflows for Systematic Reviews with Wikidata. It also covers preparation of the results for public dissemination and reporting to Wikimedia community and Wikimedia Foundation.

Besides the applicants, it supports a PhD candidate at University of São Paulo to perform some of the NLP-related tasks.

  • Andutta’s stipend 18,000
  • Mietchen’s stipend 9,000
  • Rasberry’s stipend 9,000
  • PhD student 6,000
  • Hardware, software and others 8,000


Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

This project aligns with three points of the 2030 Wikimedia Strategic Direction, which are:

(1) Improve User Experience

Methods and code generated in this project can provide improved user experience to wiki editors focusing on Wikimedia content based on science, and researchers aiming to produce Systematic Reviews.

(2) Manage Internal Knowledge

This project aims to provide highly reproducible methods that can be used to manage and cluster current existing and available information across a wide range of topics.

(3) Innovate in Free Knowledge

This project represents innovation towards improved accuracy and reducing human bias with respect to results from traditional Systematic Reviews.


Plans for dissemination.

• Continuing our presentations through Wikimedia Research-related venues.

• Resulting scholarly publications in open-access journals.

• A PhD candidate will prepare and provide short presentations about this work in different institutes at University São Paulo.

• Aggregation of researchers who have already produced Systematic Reviews and wish to re-assess their work.

• Aggregation of MSc and PhD candidates willing to produce a brand new Systematic Review using proposed methods and codes.

Past Contributions[edit]

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

Two applicants (Rasberry and Mietchen) are active long-term Wikimedia contributors, while Andutta (User:Fpa1981) started contributing more recently. We started this project about a year ago, with only a few presentations since in the following Wiki conferences:

I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.



Please add any feedback or endorsements to the grant discussion page only. Any feedback added here may be removed.

Please fill out the feedback form on the discussion page