Research:Understanding the data gaps on Wikidata concerning heritage structures of West Bengal

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.


This is a short study on identifying the data gaps related to heritage structures in West Bengal on Wikidata, and potential strategies to address the same. The report is authored by Bodhisattwa (CIS-A2K), with editorial oversight and support by Puthiya Purayil Sneha and external review by Sumandro Chattapadhyay.. This is part of a series of short-term studies undertaken by the CIS-A2K team in 2019-2020

Wikidata is a free and open repository of structured and linked data, hosted by the Wikimedia Foundation, built collaboratively[1] by human volunteers and robots from all over the world.[2] This platform, with an initial intention to be used within Wikimedia projects as a high quality secondary database,[1] first started by centrally linking Wikipedia articles about the same topics in different languages,[3][4][5][6][7] but soon it started linking with external databases.

Introduction to Wikidata

Wikidata is designed to be structured as a Resource Description Framework or RDF model which describes statements in the form of triplets of subject–predicate–object. In Wikidata, subject–predicate–object is termed as item–property–value. Items on Wikidata can represent every possible object, concept or topic in human knowledge which passes a certain threshold of defined notability and are represented by unique Q numbers. The actual data of an item is called value, which is pre-defined by the data type, be it strings, numbers, dates, url links, coordinates, musical notations etc. or even other items. Properties, represented by unique P numbers, describe the data value of items. The items, properties and values are language independent and thus totally machine-readable, although for human comfort and understanding, one can describe items in their own languages by adding or translating labels, descriptions or aliases.

Due to the machine-readable triplet structure of Wikidata, the database can be easily queried to find answers, which might not be otherwise possible from a list of unstructured contents such as Wikipedia articles. To retrieve and manipulate RDF data formats in triplets, we require a semantic query language for RDF databases named SPARQL. Through Wikidata query service, one can use SPARQL and retrieve data and the prevailing gaps on Wikidata and visualize in different ways.

Wikidata in West Bengal, India

Massive imports of coordinates for places in West Bengal happened between October 2018 and May 2019 on Wikidata as reflected by the map generated using Resemble.js

Wikidata activities around India have been organized around India for almost 4 years under the WikiProject India umbrella. Targeted approaches to fill data gaps on different topics have been pursued through data-thons and campaigns in these years and community strength has been aimed to increase through workshops and skill sharing initiatives.

Being part of that initiative, the Indian state of West Bengal has seen a lot of activities around Wikidata in recent years. Under the WikiProject umbrella, Wikidata volunteers have been working together to build data on different topics related to the state, its demographics, culture, heritage, education, health, politics, language etc. As heritage has been the prime focus of the Wikimedia community members of West Bengal, in this essay, we will identify the data gaps related to the topic through SPARQL query and explore reasons for the same, if any, through interviews of active volunteers who have been working on this area for years.

Wikimedia community members have been working on documenting different forms of heritage since 2011, when they organized Wikipedia Takes Kolkata photo-walk for the first time. Since then, they have organized eight more Wikipedia Takes Kolkata photo-walks, 11 Wiki Exploration projects in 9 districts of the state, 2018 and 2019 editions of prestigious Wiki Loves Monuments in India and several other documentation projects organized organically or single-handedly and by doing so they have uploaded several thousands of photographs related to heritage structures and GLAM collections on Wikimedia Commons.

In this essay, we will focus on the photo-walks and explorations which were conducted to document heritage structures of West Bengal. We will focus on two basic types of data which should be there in every dataset on heritage structures, i.e. a) location, and b) image, and we will find out if there is any significant gap there using SPARQL queries.

Photo-walks and Wiki Explorations in West Bengal

Map of KMC heritage buildings generated from Wikidata query https://w.wiki/Tir

Let’s start with the nine consecutive series of Wikipedia takes Kolkata photo-walks which aims to photo-document heritage buildings and structures of Kolkata. To understand the data gap related to the heritage buildings, we will examine the presence of graded heritage buildings and structures enlisted by Kolkata Municipal Corporation (KMC)[8] on Wikidata through different SPARQL queries. Wikidata now contains 923 heritage buildings and structures listed by KMC, but out of them 26.65% have images and only 18.53% have coordinates.

Although 81.47% of the items of the heritage structures were missing coordinates, but they gave fairly good idea about their location, all of the items had municipal wards and streets connected with them, utilizing which, photographers and travellers are expected to explore the sites easily. However, while testing the items of the wards, it was noticed that however all the 144 wards contain coordinates, but they all lack a crucial property which can denote their area of location i.e. the geoshape data. While coordinates can denote the exact location of certain parts of an area, it is misleading when it comes to a larger area, which requires geoshape to better describe the location. While testing the street data, it was found that both geoshape and coordinate data are lacking for the streets, which makes them extremely difficult to locate.

Map of temples in West Bengal generated from Wikidata query https://w.wiki/Tj7

For the last 3 years, Wikimedia volunteers from West Bengal have also been involved in Wiki Exploration projects to remote parts of the state documenting temples, mosques, sculptures etc., many of which have not been documented online before. Few hundreds of heritage structures in 9 districts of the state were documented and thousands of photographs under this project have been uploaded to Wikimedia Commons. Now, if we test the Wikidata presence of the temples situated in West Bengal, it can be noticed that 435 temples have items, out of which only 196 items have images and only 79 have coordinates, however 302 of them have their location pin-pointed to the village, ward, town or city level. Similar to the previous case, although there are 40,359 items for villages located in West Bengal, only 0.017% have coordinates while none have geoshape data.

From the above two scenarios, it can be easily concluded from the SPARQL queries, that there has been a significant amount of data gap. Both the datasets contain significant lack of location data and images. The second scenario even lacks data on the temples itself.

Challenges of Contributing to Wikidata in/from West Bengal

Now, to understand why there are huge gaps in the data, we have interviewed four volunteers from West Bengal who are involved in these two kinds of projects, three of them are Wikimedia contributors for five-ten years and one of them is relatively new to the movement. They all upload heritage photographs to Wikimedia Commons and 2 of them contribute to Wikidata. All of them agreed that due to lack of suitable hardware, they could not document the exact coordinate data while photo-documenting heritage structures. GPS devices or full-frame cameras with built-in GPS are expensive and are not affordable to many. Interviewees have also pointed out that due to lack of proper training on how to document heritage structures properly, photographers and amaetur researchers miss out vital points of documentation and thus increase data gaps. Restricted access to private heritage structures like temples maintained by families or private heritage buildings and their documents, lack of proper existing documentation along with analogue and digital metadata, and rapid destruction of built heritage due to lack of maintenance or improper restoration procedures etc. are also the reasons for data gaps. While answering the question about why photographs are not converted fully into data, they point out that it might be a burden for photographers to learn about data entry in Wikidata, as this is out of their area of interest and workflow. As noted by an interviewee, ‘the nature of work for Wikidata does not match with photographers' workflow.’ However, they also stressed on the need to conduct training programmes on Wikidata for photographers and interested people involved in documentation to let them know the importance of structured data in the area of heritage documentation.

Recommendations

From the observations of this short study, it is recommended that volunteers working on heritage documentation in West Bengal should be supported with suitable hardware to document coordinates. Frequent training programs should be conducted, preferably by experts, for volunteers on how to document heritage structures in a professional way, so that data gaps remain minimal. Training on Wikidata should be conducted for photographers to let them understand the importance of structured data in the field of heritage documentation. It is also recommended to increase interaction among the Wikidata and Wikimedia Commons volunteers, to understand each other's work flow and strategically modify those to provide optimal results.

References

  1. a b Vrandečić, Denny (2012). "Wikidata: a new platform for collaborative data collection". Proceedings of the 21st international conference companion on World Wide Web - WWW '12 Companion: 1063. doi:10.1145/2187980.2188242. 
  2. Vrandečić, Denny; Krötzsch, Markus (23 September 2014). "Wikidata: a free collaborative knowledgebase". Communications of the ACM 57 (10): 78–85. doi:10.1145/2629489. 
  3. Roth, Matthew (30 March 2012). "The Wikipedia data revolution". Diff. 
  4. Pintscher, Lydia (14 January 2013). "First steps of Wikidata in the Hungarian Wikipedia". Wikimedia Deutschland Blog. 
  5. Pintscher, Lydia (30 January 2013). "Wikidata coming to the next two Wikipedias". Wikimedia Deutschland Blog. 
  6. Pintscher, Lydia (13 February 2013). "Wikidata live on the English Wikipedia". Wikimedia Deutschland Blog. 
  7. Pintscher, Lydia (6 March 2013). "Wikidata now live on all Wikipedias". Wikimedia Deutschland Blog. 
  8. "Graded List of Heritage Buildings Grade I IIA IIB" (PDF). Kolkata Municiapal Corporation. Kolkata Municiapal Corporation. 2009. 

Notes

  • The query results were generated during early 2020. The results may vary at the time of publication of this article.
Annexure: Interview questionnaire
  1. Could you please talk a little bit about your work on the photo-documentation of heritage structures in West Bengal? (Short description of the types of structures, how old etc. )
  2. What kind of data, do you think, is lacking in all your uploads and data work?
  3. We have found out based on data extraction that exact location, coordinates etc. are missing for significant numbers of heritage structures. How do we understand why they are missing and what can be done to reduce them?
  4. Have you identified any other data gaps in your work on heritage structures.
  5. What could be other possible factors related to the missing data (access to sites, technology, language, archival content, skills etc.)
  6. We have also found out that Wikidata lacks data about temples of West Bengal, although there have been 11 exploration projects in the last 3 years and thousands of photographs were uploaded to Wikimedia Commons. What do you think, is the reason for this lower conversion rate from images to data and how that can be increased?
  7. Have you made any efforts to address some of these issues? Do you see any such efforts in other Wikiprojects?