Research:Wikidata for the People of Africa
Introduction
[edit]Wikidata is an international, multilingual project aimed at creating “a free and open knowledge base that can be read and edited by both humans and machines”. Reading and editing by humans is facilitated through the use of natural language labels and descriptions of items. However, its support for Bantu languages lags far behind languages like English.
The Bantu language family comprises between 440 and 680 languages in Sub-Saharan Africa and more than 350 million people speak one or more of the Bantu languages as mother tongues (30% of the population of Africa). These languages are mostly severely under-resourced and marginally represented in Wikipedia and Wikidata. The purpose of this project is to show how the presence of such a language, viz. isiZulu, can be extended in the Wikidata knowledge graph that supports other Wikimedia projects, including Wikipedia.
We employ natural language generation for expanding Wikidata entries with high quality labels and descriptions in the Bantu languages using Grammatical Framework (GF), with an initial focus on isiZulu, and the geopolitical domain. Relying on the distinction GF makes between abstract and concrete syntax will ensure that the effort of expanding the solution to other Bantu languages is significantly reduced.
The Project
[edit]Labels and descriptions, i.e. that aspect of Wikidata which makes the data accessible to humans, for the Bantu languages are lacking in many items. As with all Wikimedia projects, Wikidata aims to involve both contributors and users in its goal, which is “to provide support for Wikipedia, Wikimedia Commons, the other wikis of the Wikimedia movement, and to anyone in the world” by collecting structured data in “a free, collaborative, multilingual, secondary knowledge base”.
The contribution of labels and descriptions in a new language has the effect of empowering both those who wish to contribute to Wikidata as well as those who use Wikidata, whether directly or via other Wikimedia projects. This contributes to the dissemination of information to under-resourced communities. However, the task naturally extends to communication in the other direction: the people of Africa should be enabled to add new data items and relations between such items, such as geographical places in their regions, which leads to the dissemination of African knowledge to the rest of the world. Additions to the Wikidata knowledge graph can take the form of contributing new items and properties, but also of contributing triples that relate items to each other via properties. In other words, the availability of labels and descriptions for existing items and relations directly enables the second kind of contribution. Contribution of new items and properties is a natural next step, and it should be possible to contribute new knowledge to Wikidata in one’s own language.
Therefore, the aim of this project is to address the general lack of Bantu language labels and descriptions in Wikidata. We use natural language generation (NLG) to address the problem in a scalable way.
Not only is the severely resource-scarce status of the Bantu languages a key reason for the lack of such labels and descriptions, but it also constrains how the problem should be approached. In resource-scarce contexts, efficient use of resources is critical, whether it be digital language resources, human resources or financial resources:
- Any linguistic data created in such a project must be of a high quality, since this data may form a significant part of the natural language content that is available for these languages. This is especially true in the case of Wikidata, where the formal nature of the data requires that it be accurately and precisely represented in natural language.
- By initially focusing on a single language and a specific domain, applying insights gained during the project to other Bantu languages is made more efficient.
- The Bantu languages exhibit substantial linguistic similarities. To exploit this effectively, any solution for a single language must be readily extensible to other Bantu languages.
In this project it is shown how Grammatical Framework (GF) can be used to address the lack of Bantu language labels and descriptions in Wikidata. The present focus is on isiZulu labels and descriptions within the geopolitical domain.
The following questions arise:
- What terminology must be collected/developed to describe items in the geopolitical domain in Wikidata?
- How can a GF grammar be used to model descriptions of countries in isiZulu so that it is readily extensible to other Bantu languages? A related question is to what extent the cross-lingual API of the GF common abstract syntax is useful for the Bantu languages.
- How can data extracted from Wikidata be used to generate GF abstract syntax trees that correspond to correct isiZulu descriptions of countries?
- How can a workflow be designed that minimizes the effort that would be required to expand the solution to new languages and domains? A corollary is: What are the concrete steps required to expand the solution to new languages and domains?
Using Grammatical Framework (GF)
[edit]Grammatical Framework (GF) is a computational grammar framework for the development of multilingual grammars and may be considered “the de-facto open source general framework for developing resources for engineering multilingual CNLs”[1].
GF grammars are characterized by an interlingua architecture, where an abstract syntax models a domain of utterances in a language independent way, and a set of concrete syntaxes defines how the utterances are expressed in different natural languages.
Besides distinguishing between abstract and concrete syntaxes, another important distinction is between GF resource grammars and GF application grammars.
GF resource grammars define syntactic categories and functions and serve as a linguistic software library for application grammar development. Application grammars define semantic categories and functions that model domains of application, such as utterances relating to geopolitical concepts.
The description of items that are generated in this project is a form of ontology verbalisation, since the descriptions contributed to Wikidata are, in effect, multilingual verbalisations of knowledge already present in the Wikidata knowledge graph. The example in Figures 1 to 4 shows how two triples from the Wikidata knowledge graph are combined via GF to produce an isiZulu utterance that serves as an apt description of the item. The distinction in GF between abstract and concrete syntax is the hinge for this kind of multilingual natural language generation, making GF ideally suited for this project.
The Approach
[edit]We sketch the solution by means of and example. Figures 1 and 2 show the language independent information available about Botswana in Wikidata. Figure 1 shows what is seen on the Wikidata webpage for Botswana, while Figure 2 represents pertinent structured data about Botswana as RDF triples. This knowledge is accessible to speakers of English, for example, via the English labels and descriptions indicating what is represented by items such as Q963 and Q123480 and properties such as P31 and P361.
A description of an item is “a short phrase designed to disambiguate items with the same or similar labels”, which can be done effectively by stating key facts about the item. Indeed, such facts are exactly what is already encoded in the Wikidata knowledge graph. The fact that Botswana is an “instance of” a landlocked country and is “part of” Southern Africa is exactly the kind of information that can be used to generate a good, unambiguous description of the item. If a mechanism can be developed for combining such structured data into natural language strings, such descriptions could be generated automatically.
GF is ideally suited to this task. Figure 3 depicts a GF abstract syntax tree expressing a useful description based on this information using typical Bantu language constructions. The construction of this abstract syntax tree from structured data is done, either directly or via a GF application grammar, within a program that interacts with the GF C runtime. Figure 4 shows the tree linearised as an accurate isiZulu description of Botswana using the isiZulu resource grammar.
By answering the questions, we expand the ability of Wikidata to be read and edited by humans, especially those in Sub-Saharan Africa, with its many Bantu languages. This will also benefit the multilingual support of Wikidata to other Wikimedia projects. Moreover, by identifying appropriate data which is not yet in the Wikidata knowledge graph and for which isiZulu labels and descriptions can be generated via a GF grammar, the project enables the contribution of new knowledge to the Wikidata knowledge graph. In other words, new items or new properties of existing items will then be contributed to Wikidata with isiZulu labels and descriptions.
The procedure was as follows:
- Terminology: The terminology phase of the project involved an attempt to collect and employing existing terminology resources. Several avenues for collection were pursued, culminating in confirmation from the South African National Language Service that no such terminology for isiZulu, or any other language, has yet been developed. Consequently, new terminology was developed in consultation with expert isiZulu linguists/lexicographers. The terminology required to provide labels and descriptions for the chosen entities within the geopolitical domain, was informed by an analysis of the geopolitical domain as represented in Wikipedia, to determine what concepts should be modelled and in what combinations. From this, a list of English terms and phrases was compiled and translated by professional, mother-tongue isiZulu language practitioners. We used a data language such as YAML to represent and maintain a multilingual lexicon of the relevant terminology, automatically convertible to GF lexicon modules. This analysis and the procured terminology formed the basis of a Grammatical Framework (GF) application grammar.
- GF application grammar: We followed a standard methodology for domain-specific GF grammar development. This GF grammar formed the core of a Natural Language Generation system for generating descriptions of countries in isiZulu. A full description of the process is outside the scope of this proposal. It has been described by Ranta[2] as well as in various other scientific publications, but typically follows these steps:
- Elicit a representative sample of text from the domain
- Analyze the sample to design a model of the domain in the form of a GF application grammar
- Create a GF lexicon as required by the domain
- NLG component: develop a system that queries Wikidata and constructs suitable GF trees. Wikidata itself was used to determine facts about countries in order to generate true and grammatically correct isiZulu descriptions via the GF grammar. This system was written in Python, and interacts with the compiled GF grammar via the pgf package, which provides Python bindings to the GF C runtime.
- Verification: expert linguists were consulted to evaluate the accuracy of the generated descriptions.
- Wikidata contribution: we contributed the newly developed labels and descriptions using the Wikidata REST API.
Deliverables
[edit]- New isiZulu labels and descriptions for the countries in Wikidata. The envisaged audience is isiZulu contributors and users of Wikidata. This will be released under a CC0 license.
- Open-source baseline GF-based NLG system for isiZulu descriptions. The envisaged audience is researchers interested in extending/adapting the system. This will be released under LGPL.
- A scientific publication detailing process and findings. The envisaged audience is the scientific community interested in Wikidata and Bantu languages.
- A project report discussing the above outputs, research findings and potential future work. The intended audience is the Research Fund chairs of the Wikimedia Foundation Research Fund 2024.
Seen together, these outputs represent a blueprint for how this kind of work could continue and be expanded in future. We regard this blueprint in itself as an essential and significant contribution to the expansion of the digital presence of the Bantu languages within the Wikimedia projects and beyond.
Insights
[edit]Procurement of terms and phrases
[edit]The terminology utilised in the application grammar is the result of having the relevant English terminology professionally translated. This course of action was followed after an extensive search for existing relevant terminology, including contacting several linguistics and language teaching departments at South African universities, meeting with lexicographers and language bodies actively developing isiZulu terminology and engaging with the National Language Service of South Africa. A formal request for the development of official terminology has been submitted to the director of the National Language Service, but in the spirit of mother-tongue driven content development, professional translations procured from mother-tongue speakers have been utilised for the time being. The translation process entailed two mother-tongue speakers working as translators independently translating labels and a third mother-tongue speaker reviewing these translations. This process addressed some inconsistencies in the translations due to dialectal variance and the unavailability of standardized vocabulary. If new developments occur with regards to terminology development, the lexicon resource module of the GF grammar could be updated and all affected labels and descriptions re-generated. It should be noted that the lack of terminology extends to all South African languages, besides English and to a lesser degree Afrikaans.
Incorporation of terminology in GF lexicon
[edit]The terms procured from mother-tongue language practitioners can be divided into terms that could directly serve as labels for the targeted entities, and terms and phrases that could be used to generate descriptions for the targeted entities. As mentioned above, the former terms were, to a large extent, simply loaned as is. In isiZulu, this entails placing the head nouns of the terms in noun class by 5 adding the appropriate prefix, namely "i". This is not always done consistently with regards to hyphenation, such that the existing isiZulu labels often employ only the "i", whereas our procured terms tended to using "i-" (for nouns referring to languages, the appropriate prefix in question is "isi" or "isi-"). The developers of our procured terms opted for preserving the original orthography of the loan words, along with hyphenation.
Many terms require possessive constructions, such as "Kingdom of Egypt" rendered as "umbuso waseGibhithe" (no loan word) or "capital city of Japan" rendered as "inhlokodolobha yase-Japan" (loan word). Note the morphological variation in the "wa" and "ya" possessive morphemes due to agreement with the head noun. The lexicon module was developed to utilise the built-in possessive constructor of the isiZulu GF Resource Grammar to correctly render such constructions, as well as other relevant linguistic constructors.
GF application grammar
[edit]During the project, a GF application grammar was developed for expressing geopolitical concepts in idiomatic isiZulu. This grammar make use of the isiZulu GF Resource Grammar as a linguistic software library. The entities covered by the grammar are countries, capital cities, currencies and languages.
The abstract module defines the semantics of descriptions for the targeted entities. For example, descriptions for a language might make reference to the language's status as an official language of a certain country, the number of countries it is spoken in or the number of speakers of the language. Functions for combining these concepts into useful descriptions were developed. For example, the function SpokenCountriesNumberOfSpeakers accepts two arguments of type NumberMod to express that a language is spoken in a specific number of countries and by a specific number of speakers.
SpokenCountriesNumberOfSpeakers : NumberMod -> NumberMod -> LanguageFeature ; LanguageFeatureDescription : LanguageFeature -> Description ;
Or, similarly, a country may be described in terms of the kind of country it is (an island country, landlocked country, sovereign state etc.), as well its region. Functions for dealing with the possibility of one or both were developed.
CountryKindDescription : CountryKind -> Description ; CountryRegionDescription : CountryRegion -> Description ; FullCountryDescription : CountryKind -> CountryRegion -> Description ;
The grammar itself does not model knowledge of real countries, language, cities or currencies, but simply enables statements about them that are semantically coherent.
The isiZulu concrete module that implements this abstract module is responsible for rendering the semantics in idiomatic isiZulu. In theory, an abstract GF module is often considered to be language independent, but in practise, it is often the case that the abstract syntax reflects aspects of the languages intended to be implemented. This came strongly to the fore in dealing with numbers on the fly.
In isiZulu, as in other Bantu languages, numbers are linguistically heterogenous. The lexical items for the numbers two to five are adjectives that utilise a specific set of morphemes when used to modify nouns, while all numbers above five are nouns or noun phrases. A full discussion of this aspect of isiZulu, as well as its complex agglutinating morphology, and its related languages cannot be given here.
However, a sense of the complexity that results is sketched briefly. For example, expressing "three countries" is done by employing an adjectival construction, "emazweni ama-3", while expressing "seven countries" employs a copulative construction as "emazweni ayisi-7". Furthermore, the noun classes of the noun-based numbers are different, and hence "ten countries" is expressed using the same construction but a different noun prefix, resulting in "emazweni ayi-10".
The simplest approach to dealing with such linguistic matters was to directly enable this in the abstract module. This course of action was chosen given the intent of this project to similarly serve other Bantu languages, such as Siswati (a closely related language), as well as the other language in Southern Africa and beyond. This approach therefore enabled expressing digit-based numbers using a wide variety of morphological constructions, leaving it to the NLG component to determine the appropriate construction given a specific number.
Wikidata and GF for Natural Language Generation
[edit]The generation of especially the descriptions for the targeted entities was achieved by utilising the Wikidata knowledge graph and the language independent information it contains, along with the GF grammar for verbalising the information in idiomatic isiZulu. The NLG system queries the knowledge graph for relevant claims about entities, which allows for the construction of an abstract syntax tree that encapsulates the semantics of a useful and accurate description. The isiZulu concrete grammar is then utilised to linearise the abstract syntax tree into natural language.
Our expectation is that the process could be repeated for a different Bantu language, especially languages in the same language family as isiZulu, with minimal effort. The professional translators for the isiZulu terminology utilised a well-known technique in which the large amount of foreign words referring to countries, currencies, capital cities and languages, were simply treated as loan words in a direct way. In almost all cases, this involved prefixing the class 5 noun prefix to the foreign words. This process is therefore highly predictable and repeatable for other South African languages, given the lack of existing terminology. Translation of the rest of the vocabulary required in the descriptions will not incur a high cost, and we expect that the linguistic constructions as modelled in the GF grammar would be almost entirely reuseable, given the high structural similarity between isiZulu and the other official Bantu languages of South Africa.
The Wikidata knowledge graph contained a number of labels and descriptions at the beginning of the project, although it is challenging to monitor exactly when such contributions were made. Our approach in this project was to contribute new labels and descriptions in cases where isiZulu was lacking, or in cases where the isiZulu label was simply a copy of the English label. In total, a batch script with edits to 745 labels and 1108 descriptions was generated.
The code for the project is available on Github.
Related Work
[edit]The gf-wikidata project aims to generate entire Wikipedia articles from Wikidata. However, the current prototype has been tested for European languages that are significantly better resourced.
GF resource grammars (RGs) are key to linearising abstract syntax trees into natural language. An RG exists for isiZulu, and has been used to develop isiZulu language resources[3][4]. In a current project, an RG for Siswati has been completed recently via bootstrapping, with one for isiXhosa planned to be completed within 18 months. RGs for other Bantu languages are also in development[5][6].
GF application grammars have been used to model utterances in a large number of diverse multilingual domains, including mathematics[7], transport[8], weather reports[9], healthcare[10][11], literacy development and language learning[12], technical texts describing properties of places and objects related to accessibility by disabled people[13], etc. Verbalisation of structured data, in particular, has been done for biomedical linked data[14][15], descriptions of museum objects[16] and modular ontologies[17].
Resources
[edit]You may be interested in the following
- The SWiP Phase 2 dashboard.
- The WikiProject Eswatini Wikidata Page.
- The WikiProject Wikipedia Page
- The associated Eswatini Barnstar of Merit
Added by Derek J Moore (talk) 07:17, 26 May 2025 (UTC)
References
[edit]- ↑ Safwat, H., & Davis, B. (2017). CNLs for the semantic web: a state of the art. Language Resources and Evaluation, 51(1), 191-220.
- ↑ Ranta, A. (2011). Grammatical framework: Programming with multilingual grammars (Vol. 173). Stanford: CSLI Publications, Center for the Study of Language and Information.
- ↑ Marais, L., Pretorius, L. (2023a). Extending the usage of adjectives in the Zulu AfWN. In Proceedings of the 12th Global Wordnet Conference, pages 303–314, University of the Basque Country, Donostia - San Sebastian, Basque Country. Global Wordnet Association.
- ↑ Marais, L., & Pretorius, L. (2023b). Parsing IsiZulu text using Grammatical Framework. In International Symposium on Distributed Computing and Artificial Intelligence (pp. 167-177). Cham: Springer Nature Switzerland.
- ↑ Kituku, B., Nganga, W., & Muchemi, L. (2021, November). Leveraging on cross linguistic similarities to reduce grammar development effort for the under-resourced languages: a case of Kenyan Bantu languages. In 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) (pp. 83-88). IEEE.
- ↑ Bamutura, D., Ljunglöf, P., & Nebende, P. (2020). Towards Computational Resource Grammars for Runyankore and Rukiga. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 2846-2854).
- ↑ Saludes, Jordi, and Sebastian Xambó. (2011). The GF Mathematics Library. CTP Components for Educational Software, 46.
- ↑ Bringert, B., Cooper, R., Ljunglöf, P., & Ranta, A. (2005). Multimodal dialogue system grammars. In Proceedings of DIALOR’05, ninth workshop on the semantics and pragmatics of dialogue (pp. 53-60).
- ↑ Lobanov, G. (2017). Grammatical Framework For Multilingual Natural Language Generation: The Weather Report Case. [Unpublished master’s thesis], Chalmers University of Technology and the University of Gothenburg.
- ↑ Marais, L., Louw, J. A., Badenhorst, J., Calteaux, K., Wilken, I., Van Niekerk, N., & Stein, G. (2020). AwezaMed: A multilingual, multimodal speech-to-speech translation application for maternal health care. In 2020 IEEE 23rd International Conference on Information Fusion (FUSION) (pp. 1-8). IEEE.
- ↑ Ranta, A., Angelov, K., Höglind, R., Axelsson, C., & Sandsjö, L. (2017). A mobile language interpreter app for prehospital/emergency care. In Medicinteknikdagarna, Västerås Sweden, October 10-11, 2017.
- ↑ Marais, L., Wilken, I., Pretorius, L., & Posthumus, L. C. (2023c). Multimodal, multilingual dynamic stories for literacy development and language learning. In Proceedings of the 5th International Conference on Conversational User Interfaces (pp. 1-5).
- ↑ Ranta, A., Unger, C., & Hussey, D. V. (2015). Grammar engineering for a customer: A case study with five languages. In Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 workshop (pp. 1-8).
- ↑ . Question answering over biomedical linked data with Grammatical Framework. Semantic Web, 8(4), 565-580.
- ↑ Marginean, A. (2017). Question answering over biomedical linked data with Grammatical Framework. Semantic Web, 8(4), 565-580.
- ↑ Dannélls, D., Damova, M., Enache, R., & Chechev, M. (2011). A framework for improved access to museum databases in the semantic web. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage (pp. 3-10).
- ↑ Davis, B., Enache, R., Van Grondelle, J., & Pretorius, L. (2012). Multilingual verbalisation of modular ontologies using GF and lemon. In Controlled Natural Language: Third International Workshop, CNL 2012, Zurich, Switzerland, August 29-31, 2012. Proceedings 3 (pp. 167-184). Springer Berlin Heidelberg.



