Research:Recommending Images to Wikidata Items
Images allow to explain, enrich and complement knowledge without language barriers. They can also help illustrate the content of an item in a language-agnostic way to external data consumers. However, a large proportion of Wikidata items lack images: for example, as of today, more than 3.6M Wikidata items are about humans (Q5) but only 17% of them have an image(sparql query). More in general, only 2.2M of 40Milion Wikidata items have an image attached. A wider presence of images in such a rich, cross-lingual repository enables a more complete representation of human knowledge.
We want to help Wikidata contributors make Wikidata more “visual” by recommending high-quality Commons images to Wikidata items.
We will suggest a set of high-quality commons images for items where images are either missing or flagged as being . This recommendation will be performed by a classifier able to (1) identify images relevant to a Wikidata entry (2) rank such images according to their visual quality.
More specifically, we propose to design first a matching system to evaluate the relevance of an image to a given item, based on usage, location, and contextual data. We will then design a computer vision-based classifier able to score relevant images in terms of quality based on the operationalisation of existing image quality guidelines 
- Image Subject Lists
- Image Sources in other Wiki Projects: where can we find image candidates for items without P18 (image)?
- Searching for Flickr Images:
- Images resulting from free text Search (query = entity name in all languages) using Flickr API, filtering for Creative Commons images only
- This also allows to discover how many images of this category match the one above: is there a way to include more CC images from Flickr in the commons for specific categories?
Data Analysis: Feasibility
To understand the extent to which the sources above actually contain potential image candidates, we ran 2 simple analysis experiments.
- We took all entities of monuments and split them into With P18/Without P18, where P18 is the property field of Wikidata indicating the presence of an image describing the entity. Of around 100K entities, 2/3 have images and 1/3 don't.
- We then looked at how many pages are linked to each entity, and in which languages.. Only 20% of entities without images link to a Wikipedia page. In general, entities without an image link to pages in 2 or less different languages
- We then checked how many actual images lie in the linked pages: it is either 0 or more than 1
- We looked at how many Page Images are linked to entities, and this is similar to the page links number
- We counted the images returned by the commons free text search when queried with the entity name: here we find that around 50% of entities without images actually have at least one commons image matching them
- Finally, we counted the number of images in pages linked to the items, when images are not from commons (globalimagelinks) but attached within each wiki. We see that 20% of entities have Wiki image matching them
Overall, more than 60% of entities without an image have at least one image from one of the sources above, making this approach a viable solution to find image candidates to recommend to Wikidata items.
- We took all entities of people and split them into With P18/Without P18, where P18 is the property field of Wikidata indicating the presence of an image describing the entity. Of around 3.5M entities, 1/7 have images and the rest don't.
- We then looked at how many pages are linked to each entity, and in which languages.. Only 35% of entities without images do NOT link to a Wikipedia page. In general, entities without an image link to pages in 2 or less different languages (much less than entities with P18)
- We then checked how many actual images lie in the linked pages: in around 60% of the entities, more than 1 image can be found in linked pages
- We looked at how many Page Images are linked to entities: in this case, numbers are lower, only 30% of entities are linked to pages with page image specified
- We counted the images returned by the commons free text search when queried with the entity name: here we find that around 30% of entities without images actually have at least one commons image matching them
- Finally, we counted the number of images in pages linked to the items, when images are not from commons (globalimagelinks) but attached within each wiki. We see that more than 60% of entities have Wiki image matching them
Overall, more than 70% of entities without an image have at least one image from one of the sources above, making this approach a viable solution to find image candidates to recommend to Wikidata items. Finding good images for this class of entities might also help identifying page images for biographies of people.
On Collecting Flickr Images
Here we look at whether Flickr is a good source for collecting CC images for Wikidata Items without an image attached.
We first look at the number of images returned by querying the Flickr search api with the following constraints
- They must have one of the following licences:
- They have to match the free text query corresponding to the entity label in any language
Results show that around 40% of entities without images have one or more matches in Flickr. In the plot below, we can see an artifact on the 10+ bin: it looks like a lot of entities have 10 or more images reteurned by the API. We still have to investigate why this number is so skewed, it might possibly be an artifact of the Flickr search API which returns results for more general query (general church instead of a specific church, for example). Indeed, while the total number of matches for 33K entities stands around 200K, this number boils down to only 35K when removing duplicates.
Detecting duplicates Commons/Flickr
When thinking about the possibility to use Flickr images for this project, one question arouse: what if Flickr CC images are already in the Commons? Checking for these kinds of duplicates does not narrow down to a simple url or title match. Possily, some information about the image source is available in the Commons description. This is not always the case. The safest way to do this is using a computer vision approach and check for exact (or semi-exact: we have to take into account resolution and compression artifacts) matches. We proceed as follows
- We download set C: all the Commons pictures returned by the previous experiment for monument entities without P18 specified.
- We download set F: all the flickr images returing from the free text search above, for all entities of monuments without P18 specified.
- We want to represent all pictures in sets C and F with a unique signature. We do this by computing a visual feature summarizing the image content. More specifically, we extract the output of the second-last layer of a convolutional neural network (see picuture) called Inception-v3, trained to recognize objects in images at scale. Intuitively, the feature we extract from this network is a compact description of the image structure and content. This feature's dimension is 2048.
- We now want to compare features of set C to features of set F, and look for matches between the two sets. Since the 2048-d features are very high dimensional, we want to reduce their dimensionality in a way that the information needed to match images is preserved. Since we want to have semi-exact matches (to account for small modifications), we resort to Locality-sensitive hashing, a technique able to hash vectors so that similar vectors follow in the same bucket, to perform dimensionality reduction. We index the hash table with all images from set C, and then, for each image in F, look for the closest match in C. The algorithm returns a matching image and a similarity score s. To determine a threshold t for s below which we can trust the match (i.e. the match is real), we look at matched pairs at various thresholds.
- For t<5, the matches are 100% exact matches (exact duplicates): either we can see the FLickr image ID in the Commons title (e.g. :St_Dubricius,_Porlock_(2879499692).jpg) or, by visual inspection, we can see that, for example, that the file Church_cupolas_-_Nativity_of_the_Blessed_Virgin_Mary_Church.jpg matches the retrieved Flickr image with ID 36641532665
- For thresholds greater than 30, the matching between F and C are wrong. Images still look structurally similar, but the content is different
- In total, only 25 out of 33k matches have s<20 with their matches: this tells us that there is a lot to do in this direction, and there is an amazing opportunity to enriching Commons with Flickr data without repetitions starting from WikiData : 1) [automatic] for a wikidata item, retrieve set F and set C; 2) [automatic] match set F with set C; 3) [manual] add to WikiCommons those images from Flickr that are relevant to the item
- Define general categories of interest: for example, people, monuments, species, etc. Defining positive lists of items is important to a) scope the focus of the model b) target the recommendation
- Retrieve lists wikidata items belonging to the categories of interest. This process can be broken down into 2 pieces:
- Get property identifiers for wikidata. We want to retrieve only items with specific properties corresponding to the categories of interest. For example, all items of people are instance of property Q5 'human being'. While homogeneous categories (people) are characterised by one properties id only, heterogeneous categories such as monuments of species span multiple propertu identifiers. There are 2 ways to get these property ids in this case:
- Retrieve wikidata items instances of property identifiers: From the Json dumps, retain IDs and properties of wikidata items instances of (p31) the properties identified.
- - Collect images from pages linked to wikidata items retrieved, from the SQL Replicas. From the table of global links from entities to Wikipedia pages, we retrieve all pages linked to the wikidata items in the list. Then, we can collect two tipes of images:
- Images in pages From the table of global links from images to Wikipedia pages, we retrieve all images in the pages linked to the wikidata itmes in the list.
- Page Image From the table of page properties, we retrieve the 'page images' of the pages linked to the wikidata itmes in the list, namely the main image of each page.
- Collect images from Commons search. We use commons api to retrieve images from query = label (B13) + location (B14)
- build a quality model
- [PYTHON/Tensorflow] build a deep learning model based on quality images
- [PYTHON] extend model with features including sizecompression quality text features (richness, readibility)
- select candidates by relevance
- [PYTHON] retain all images (2.2.1) and page images (C3)
- [PYTHON] retain only those images retrieved from commons (D) whose image name (D1) soft matches the wikidata label (B13)
- [PYTHON] filter out images according to additional computer vision toolsface detector for instances of humansscene/object detectors for other instances
- sort images by quality
- download images filtered in (G)
- assign quality score from (F) and rank
- domain expertise for e.g. monuments
- distribute process
- there is no GPU?
- pixels are not available internally
We will pilot a set of a recommendations (powered by tools like WikiShootMe platform) to evaluate if our machine learning method can help support community efforts to address the problem of missing images.
- Van Hook, S.R. (2011, 11 April). Modes and models for transcending cultural differences in international classrooms. Journal of Research in International Education, 10(1), 5-27. http://jri.sagepub.com/content/10/1/5