Jump to content

Research:Wikimedia versus traditional biographical encyclopedias

From Meta, a Wikimedia project coordination wiki
Created
19:52, 3 June 2024 (UTC)
Collaborators
Duration:  2024-07 – 2025-06
Grant ID: G-RS-2402-15215

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Wikipedia has become the primary source for biographical information due to its extensive coverage and accessibility. However, Wikipedia and Wikidata inherently rely heavily on the output of research institutions, in part because of the No Original Research policy. This project aimed to analyze the production of traditional biographical dictionaries, examine their relationship to Wikipedia and Wikidata, identify current problems, and propose solutions to improve collaboration between Wikimedia and the creators of traditional biographical dictionaries.

Full grant proposal

Methods

[edit]

The main objective of the project was to identify language, cultural, gender, socio-economic, geographic, and other representation gaps in the content of Czech Wikipedia, Wikidata, and academic biographical dictionaries.[1] For this analysis, the research team focused mainly on Biographisches Lexikon zur Geschichte der böhmischen Länder (BLGBL). It has been published in individual issues by the Collegium Carolinum in Munich since 1974. The older volumes (up to the year 1999) are available online as scans only, and therefore not in the form of a database containing full-texts or any identifiers. However, the data flow from the BLGBL to the open data ecosystem faces the serious obstacle that BLGBL has not yet been published in an online version. The BLGBL was thus a suitable object for exploring the gap between academic biographical dictionaries and Wikidata.

In short, the research team performed in-depth digitization of the first volume of the BLGBL (nine issues, 1974–1979, letters A–H) and transformed the extracted full-texts into the structured data that can be compared with the relevant Wikidata entities.

During the in-depth digitization process, it was necessary for the research team to not only read the scans and create full-texts, but also to recognize the individual dictionary entries and their subparts (the biogram itself, the person’s work, and the sources). Despite the printed nature of the dictionary, OCR methods proved insufficient for such advanced text segmentation. Consequently, our research team developed an HTR model for both layout segmentation[2] and recognition of printed text in Czech and German.

To facilitate the conversion of digitized biograms into structured data, the research team developed an NLP model (biography2wikidata).[3] This model is a text-to-text transformer based on Google’s mT5 small model. Biography2wikidata model annotates the full-text of biograms using Wikidata identifiers to ensure that the subsequently extracted data aligns structurally with Wikidata as much as possible. This method enables a comparison of the content of the dictionary entity and that of the corresponding Wikidata entity. After creating a training dataset containing more than 5,000 dictionary entries (BLGBL I/1–9), the model correctly retrieves about 60% of the basic statements, 20% of the qualifiers, for a total of 50% of the basic and qualifier statements. Despite the need for manual corrections, the model significantly accelerates the processing of records. The model segments the input text very reliably, so manual editing mostly means the correct assignment of the Wikidata identifier.

The structured data obtained through this method can be further processed. Based on the extracted name and date of birth and death data, an automated SPARQL query attempts to match the dictionary entry with the corresponding Wikidata entity. The research team then manually verifies the automatic assignment or conducts a manual search for the person. In the final stage, the structured data of the dictionary entries and the relevant Wikidata entities are compared and statistically analyzed.

Results

[edit]

Gap Analysis

[edit]

A comparison of the structured data obtained from the BLGBL with that of Wikidata indicates that despite a half of century has passed since the publication of the initial BLGBL volume and the significance of this source, only 73% of BLGBL entries have corresponding entities on Wikidata.

The analysis further reveals details of the 27 % of personalities “lost in the gap”. The majority (75%) of these individuals were born during the 19th century, and nearly a quarter (23%) of all of them were employed in education as teachers, university professors or school principals, with most of these individuals being active at the turn of the 19th and 20th centuries. Other significant professions (each 3–5 %) included factory owners, actors, doctors, regional writers, and regional researchers.

These data reveal a significant bias in the reflection of historical reality. Over the course of the 20th and 21st centuries, the notability of some of these professions has declined, so that they may not be considered notable enough from today’s perspective. However, during the period of nationalization and escalating Czech-German national tensions at the turn of the 19th and 20th centuries, teachers, school principals and regional cultural figures played a pivotal role in the dissemination of ideas and the formation of socio-cultural identities and boundaries. Moreover, during the Industrial Revolution, the impact of individual industrialists on society was more significant than in contemporary times, when the decisions of large corporations are made not by individuals but rather by boards of directors. The data demonstrate that evaluation of notability on Wikidata is characterized by a more contemporary (ahistorical) perspective, rather than by reflection of the historical reality of who was truly notable during that period.

A comparison of dictionary entries and their corresponding entities on Wikidata also enables the determination of the completeness of the latter, or the persistent potential of academic dictionaries (BLGBL in this case). A review of the basic information, such as birth/death date/place (P569, P570, P19, P20), reveals that this information is identical in 71% of cases in the dictionary and on Wikidata (see Figure 1). However, 13 % of the dictionary entries contain more detailed information or data that is entirely missing from the corresponding Wikidata entities. Notably, the information regarding birth and death place is frequently absent. Another 9% of the basic information contradicts to information on Wikidata, and the validity of these entries should be determined by further historical research. Only 3% of the basic information in the dictionary is less detailed (e.g., only year instead of full date). The remaining 5% of the basic claims are entirely absent from the dictionary.

The potential of the BLGBL remains even more largely untapped in the case of other properties. For instance, the BLGBL vol. 1 contains 3,774 claims about the school studied (P69) for 1,398 personalities with relevant Wikidata entity, and 81% of these claims are missing on Wikidata. A similar situation is evident in properties related to academic degrees (P512, 96%), function or position (P39, 62%), membership in religious orders (P611, 51%), awards received (P166, 86%), or membership in an organization (P463, 86%).

Workshop

[edit]
Workshop, 31 March 2025, Prague

As part of the project, the Institute of History of the Czech Academy of Sciences organized a workshop Wikimedia versus Traditional Biographical Dictionaries? in cooperation with Wikimedia Czech Republic. It took place on 31 March 2025 in Prague. One of the main successes of the event was bringing representatives of both worlds – editors of academic biographical dictionaries and active Wikipedians – to the same table. The conference offered a wide range of contributions dedicated to individual national biographical dicitonaries, their current status and plans in the digital world: Sächsische Biografie (Institut für Sächsische Geschichte und Volkskunde, Germany), Biografický slovník českých zemí (Historický ústav AV ČR, Czechia), Dictionary of Irish Biography (Royal Irish Academy, Ireland), Slovník slovenských historiků (Slovenská akadémia vied, Slovakia), Biographisches Lexikon zur Geschichte der böhmischen Länder (Collegium Carolinum, Germany), Czech literary dictionaries (Ústav pro českou literaturu Akademie věd ČR, Czechia), Database of Czech librarians (Národní knihovna ČR, Czechia), Polski Słownik Biograficzny (Tadeusz Manteuffel Institute of History Polish Academy of Sciences).

The sharing of different perspectives and experiences was particularly valuable, as were the fruitful discussions. For example, Turlough O'Riodran (Royal Irish Academy) presented data on the dramatic increase in traffic to the online version of the Dictionary of Irish Biography following the introduction of open access and the creation of an identifier on Wikidata. Linda Jansová (National Library of the Czech Republic) showed the benefits of integrating biographical data into the Linked Open Data ecosystem. Daniel Baránek (Institute of History of the CAS), Adam Zapała and Konrad Kołodziejczyk (Tadeusz Manteuffel Institute of History Polish Academy of Sciences) presented practical procedures for converting biographical passwords into structured data and their further use. Kristýna Kysilková, from the perspective of a Wikipedian, emphasized the invaluable importance of academic dictionaries for the creation of quality Wikipedia entries. In addition, she presented some methods and tools for measuring the traffic and quality of Wikipedia entries.

[edit]

Conclusion

[edit]

The analysis results demonstrate that, despite the over a decade of Wikidata’s existence, the potential of academic dictionaries has yet to be fully realized. For the 27% of personalities with an entry in the BLGBL, there is no corresponding Wikidata entity. In the case of entries with existing Wikidata entity, approximately 20% of the fundamental information such as birth and death dates and places, is absent from Wikidata. Moreover, the majority (sometimes almost 100%) of information about the school, education, position, membership in religious orders or organizations, parents, partners and children, is missing on Wikidata.

It can be reasonably assumed that a comparable untapped potential exists in the case of other academic dictionaries. To exploit this potential, the machine learning model biography2wikidata developed in this project can also help to exploit this potential once fine-tuned.

The project also contributed to increased communication between academics and Wikipedians, facilitated the sharing of experiences, and demonstrated the importance of integrating academic dictionaries into the linked open data ecosystem.

Finally, the project has enabled closer cooperation between the Institute of History of the Czech Academy of Sciences and Collegium Carolinum. The dataset of the digitized first volume of BLGBL will be published as an online database in 2025. This will allow the creation of a BLGBL identifier on Wikidata and enrich the content of Wikidata with a wealth of data from BLGBL.

References

[edit]
  1. For gap definitions, see Miriam Redi, Martin Gerlach, Isaac Johnson, Jonathan Morgan, Leila Zia, A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft), 2021, pp. 21–24 arXiv:2008.12314.
  2. Segmentation tool: https://doi.org/10.5281/zenodo.10783346.
  3. https://doi.org/10.57967/hf/1898