Research:Recommending Images to Wikidata Items

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Created
23:06, 18 October 2017 (UTC)
Duration:  2017-October — 2017-
commons, wikidata, machine learning
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.


Images allow to explain, enrich and complement knowledge without language barriers[1]. They can also help illustrate the content of an item in a language-agnostic way to external data consumers. However, a large proportion of Wikidata items lack images: for example, as of today, more than 3.6M Wikidata items are about humans (Q5) but only 17% of them have an image(sparql query). A wider presence of images in such a rich, cross-lingual repository enables a more complete representation of human knowledge.

We want to help Wikidata contributors make Wikidata more “visual” by recommending high-quality Commons images to Wikidata items.


Approach[edit]

Finding missing images!

We will suggest a set of high-quality commons images for items where images are either missing or flagged as being . This recommendation will be performed by a classifier able to (1) identify images relevant to a Wikidata entry (2) rank such images according to their visual quality.

More specifically, we propose to design first a matching system to evaluate the relevance of an image to a given item, based on usage, location, and contextual data. We will then design a computer vision-based classifier able to score relevant images in terms of quality based on the operationalisation of existing image quality guidelines [2][3]

Data Collection[edit]

  • Image Subject Lists
  • External Image Sources: where can we find image candidates for items without P18 (image)?
    • Images on Wikipedia Pages linked to an item (from globalimagelinks)
    • Page Images on Wikipedia Pages linked to an item (from page_props)
    • Pages resulting from free Commons Search from API
    • External Sources [TODO]

Data Analysis: Feasibility[edit]

To understand the extent to which the sources above actually contain potential image candidates, we ran a simple analysis experiment.

  1. We took all entities of monuments and split them into With P18/Without P18, where P18 is the property field of Wikidata indicating the presence of an image describing the entity. Of around 100K entities, 2/3 have images and 1/3 don't.
  2. We then looked at how many pages are linked to each entity, and in which languages.. Only 20% of entities without images link to a Wikipedia page. In general, entities without an image link to pages in 2 or less different languages
  3. We then checked how many actual images lie in the linked pages: it is either 0 or more than 1
  4. We looked at how many Page Images are linked to entities, and this is similar to the page links number
  5. Finally, we counted the images returned by the commons free text search when queried with the entity name: here we find that around 50% of entities without images actually have at least one commons image matching them

Overall, more than 60% of entities without an image have at least one image from one of the sources above, making this approach a viable solution to find image candidates to recommend to Wikidata items.

Number of image sources (monuments).png


Feature Extraction[edit]

Evaluations[edit]

We will pilot a set of a recommendations (powered by tools like WikiShootMe platform) to evaluate if our machine learning method can help support community efforts to address the problem of missing images.

Timeline[edit]

Q2,Q3

References[edit]

  1. Van Hook, S.R. (2011, 11 April). Modes and models for transcending cultural differences in international classrooms. Journal of Research in International Education, 10(1), 5-27. http://jri.sagepub.com/content/10/1/5
  2. https://commons.wikimedia.org/wiki/Commons:Image_guidelines
  3. https://commons.wikimedia.org/wiki/Commons:Quality_images_candidate