Research:Wikidata for the People of Africa

Created

19:18, 5 June 2024 (UTC)

Contact

Laurette Marais

Council for Scientific and Industrial Research (South Africa)

Collaborators

Laurette Pretorius

Stellenbosch University

Aarne Ranta

University of Gothenburg

Krasimir Angelov

University of Gothenburg

Roné Wierenga

Council for Scientific and Industrial Research (South Africa)

Duration: 2024-July – 2025-June

Grammatical Framework, Wikidata

Supported by Wikimedia Research Fund

Grant application

Research:Projects

This page documents a planned research project.
Information may be incomplete and change before the project starts.

Wikidata is an international, multilingual project aimed at creating “a free and open knowledge base that can be read and edited by both humans and machines”. Reading and editing by humans is facilitated through the use of natural language labels and descriptions of items. However, its support for Bantu languages lags far behind languages like English.

The Bantu language family comprises between 440 and 680 languages in Sub-Saharan Africa and more than 350 million people speak one or more of the Bantu languages as mother tongues (30% of the population of Africa). These languages are mostly severely under-resourced and marginally represented in Wikipedia and Wikidata. The purpose of this project is to show how the presence of such a language, viz. isiZulu, can be extended in the Wikidata knowledge graph that supports other Wikimedia projects, including Wikipedia.

We aim to employ natural language generation for expanding Wikidata entries with high quality labels and descriptions in the Bantu languages using Grammatical Framework (GF), with an initial focus on isiZulu, and the geopolitical domain. Relying on the distinction GF makes between abstract and concrete syntax will ensure that the effort of expanding the solution to other Bantu languages is significantly reduced.

Methods[edit]

The terminology phase of the project will involve collecting and employing existing terminology resources and develop new terminology in consultation with expert isiZulu linguists/lexicographers. This will be informed by an analysis of the geopolitical domain as represented in Wikipedia, to determine what concepts should be modelled and in what combinations.

The analysis and terminology will form the basis of a Grammatical Framework (GF) application grammar, which will form the core of a Natural Language Generation system for generation descriptions of countries in isiZulu. Wikidata itself will be used to determine facts about countries in order to generate true and grammatically correct isiZulu descriptions via the GF grammar. After verifying the generated descriptions, they will be contributed to Wikidata, thereby expanding its support for isiZulu.

Timeline[edit]

July 2024 - December 2024: isiZulu labels and descriptions for the countries in Wikidata as set out in Table 1. The audience is isiZulu contributors and users of Wikidata. Indeed, we hope to encourage the growth of this audience via this project. This will be released under a CC0 license.

October 2024 - March 2025: Open-source baseline GF-based NLG system for isiZulu descriptions. The audience is researchers interested in extending/adapting the system. This will be released under LGPL.

January 2024 - June 2025: Scientific publication detailing process and findings. The audience is the scientific community interested in Wikidata and Bantu languages.

June 2025: Project report discussing the above outputs, research findings and potential future work. The intended audience is the Research Fund chairs of the Wikimedia Foundation Research Fund 2024.

Policy, Ethics and Human Subjects Research[edit]

We do not foresee that this project will disrupt any ongoing work. Automatically generated descriptions will adhere to Wikidata guidelines for descriptions found here.

Results[edit]

To be completed.

Resources[edit]

To be completed.

References[edit]