Images are extremely important for free knowledge sharing and dissemination, as they can help complement, extend, and explain written information, often in a language-agnostic ways. However, a large proportion of knowledge in Wikimedia spaces is lacking pictorial representations. Most Wikidata items are missing images: only 4.3% of them (3.14 M) have a corresponding value P18 property (sparql query). Wikipedia is also missing many images: only 45.6% of articles in English Wikipedia have a 'page image' specified (see quarry query), 50% in Italian Wikipedia, 32% in Swahili Wikipedia, and 41% in Arabic Wikipedia.
In this project, we want to test the feasibility of a framework that, given a Wikipedia article, can suggest relevant images to be included in it. The framework will use recent techniques developed in the Computer Vision field for the purpose of Multimedia information retrieval. In the long term, the ideal outcome of the project is a system that, given an article without an image, can recommend images from Commons that are related to the content of the article, based on tools for automatic content understanding. Such a system would complement existing tools like GLAMify, which help finding images to be used in a given Wikipedia language based on images' usage in other Wikipedias. This project is carried out as part of a student master thesis for the master program in Data Science at the Ukrainian Catholic University.
Approach: Coordinated Multimodal Representations
This project aims at making the first step towards a complete system for image recommendations in Wikipedia. We test the feasibility of existing solutions in the field of Multimedia Retrieval for the task of Wikipedia image recommendation. More specifically, we look at existing frameworks for multimodal embeddings or representations, namely techniques that allow to map images and text to the same feature space. Multimodal embeddings allow to compute similarities between different types of information, for example, they can help finding the pieces of text which are more similar to a given image, and vice versa. This property of multimodal embeddings makes them very suitable for our task of finding the right images for a given article.
Choosing the Embedding Technique
Multimodal learning approaches can be divided into three categories  1) Joint representation, which aims to integrate modality-specific features into some common space 2) Coordinated representation, which aims to preserve modality-specific features, while introducing a space to measure multimodal similarities 3) Intermediate representation, which aims to encode features of one modal to some intermediate space, from where we later generate features of another modal.
We identified Coordinated Representation techniques as the most promising for our task, given their ability to exploit modality-specific (image, text) features fully. More specifically, we use a recent technique called Word2VisualVec, which computes visual features for chunks of text, thus making them comparable to visual features extracted from images. In Coordinated Learning, the general pipeline is to discover correct feature representations for each modality (images or text), while knowing how to map them into some common space. Word2VisualVec model solves opposite model, which is far more computationally efficient. That is, we fix some feature representation for each modality and learn how to map them into a common space correctly. Moreover, since the task is to identify images with text, we simplify the model even further by mapping directly from the text to image space.
Challenges of our Real World Scenario
The task of this project becomes testing the feasibility and improving Word2VisualVec in the practical scenario of recommending images to Wikipedia articles. Word2VisualVec was originally trained on a much simpler task, namely caption retrieval for still images. The original dataset is made of 30K Flickr images, and each image is associated to 5 crowd-sourced descriptive sentences. Caption sentences describe generic actions of objects such as "dogs" or "mountains" in the image. The model is trained to associate an image to one of the 5 captions.
Our Wikipedia image recommendation task poses the following challenges:
- Semantics: in a generic caption retrieval task, the model learns associations between text and generic visual objects, for example "cars". Here we need the system to be able to capture the fine-grained semantics of an article, e.g. we want to be able to retrieve images related to the concept "Maserati", rather than to a generic concept "car". This can be solved partially by training on entity-specific data. More importantly, considering image metadata is crucial to solve this challenge.
- Retrieval problem: in a caption retrieval task, an image needs to be associated to one or more sentences. In our Wikipedia image retrieval task, we want to retrieve one or more images that should be assigned to one article, which is made of multiple sentences. In this scenario, we want not only to change the evaluation metric to reflect the goal of our task, but also carefully evaluate how we represent the notion of "article": are we looking at article's title, summary, or the whole text?
As a test bed for this project, we focus on English Wikipedia's Featured Articles, as this would allow to train the model using high quality data. As a matter of fact, "a featured article has images and other media, where appropriate, with succinct captions and acceptable copyright status. Images follow the image use policy. Non-free images or media must satisfy the criteria for inclusion of non-free content and be labelled accordingly." . In order to enhance this dataset with more high-quality data, we also sample from the English Wikipedia's Quality Articles.
Data Collection Process
For each article, we download:
- Text: The article text (wikitext)
- The image files in the article
- The metadata (image title and description) of the images in the article
We preprocess the data as follows.
- Text: We extract the core text of the article from the wikitext, thus removing irrelevant internal and external links, using the MWParserFromHell library.
- Images: We remove all images with "SVG" extension -- this is to remove default pictures like Wikimedia project logos, which are present in most Wikipedia articles. We also use a RedditScore word tokenizer on the image title, to extract meaningful terms from the image name string.
Data Collection Summary
Featured Articles Dataset
- 5,638 articles
- 57,454 total images
- 45,643 unique images avaialbe from Commons
[TODO] Good Articles Dataset
- 36,476 articles
- 216,463 total images
- ??? unique images avaialbe from Commons
Please note that "Good Articles Dataset" contains both good and featured articles, since the latter is a subset of the former. Datasets were uploaded to Kaggle, where you can either process them in cloud or download and work with them locally.
Here we will add the different frameworks that we tried.
|1||Text Representation||describes what data was used to generate textual representation of an image|
|2||BoW||minimal number of times a word should appear in training corpus for it to be included in Bag of Words vocabulary|
|3||GRU||the output size of Gated Recurrent Unit, which is one of the models used to extract text features|
|4||Epoch||number of epoch the model was trained|
|5||Precision||caption ranking precision formatted as R@1, R@3, R@10|
|6||image description||description of an image, if present. Otherwise, its title|
|7||image description (parsed)||description of an image, if present. Otherwise, its parsed title. That is the title, which often has a few words glued together without spaces, was converted into separate words removed with image extension out of it|
|8||article summary||first 1000 characters of an article. It is an approximation of article summary because extracting title precisely will require much not-trivial work|
|1||word2vec||article summary||image description (parsed)||1.3, 2.5, 5|
|2||word2vec||article title||image title (parsed)||3.5, 10.3, 17.8|
|3||word2vec||article title||image description (parsed)||3.8, 9.4, 18.6|
|4||inferText||article summary||image description (parsed)||0.9, 1.4, 2.5|
|5||inferText||article title||image title (parsed)||2.9, 6.3, 12.9|
|6||wikipedia2vec||article summary||image description (parsed)||0.5, 1.5, 2.|
|7||wikipedia2vec||article title||image title (parsed)||1.5, 3.1, 6.7|
|8||wikipedia2vec||article title||image description (parsed)||1.5, 3.2, 6.5|
|9||co-occurrence||article title||image description (parsed)||4.9, 11.8, 25.7|
|1||article first sentence + article title||5||32||10||3.6, 12.4, 19.3|
|2||article first sentence||5||32||10||5.1, 16.0, 25.9|
|3||article first sentence + image description||5||32||10||8.3, 25.1, 37.8|
|4||article first sentence + image description||20||32||10||7.6, 23.8, 36.2|
|5||article summary + image description||5||32||10||11.8, 29.8, 40.9|
|6||article summary + image description||5||32||10||11.8, 29.8, 40.9|
|7||article summary + image description (parsed)||5||32||10||13.9, 31.9, 42.7|
|8||article summary + image description (parsed)||5||100||10||7.2, 20.2, 28.3|
|9||article summary + image description (parsed)||5||32||24||18.2, 38.4, 47.2|
|1||article summary + image description (parsed)||5||32||14||5.2, 14.6, 22.5|
|2||article summary + image description (parsed)||5||32||24||7.2, 18.0, 28.0|
|3||article summary + image description (parsed)||5||32||38||8.4, 20.1, 29.6|
|4||article summary + image description (parsed)||20||100||14||4.7, 14.9, 22.3|
|5||article summary + image description (parsed)||10||100||14||1.4, 5.6, 10.9|
Qualitative Results (Visualizations)
- This paper is accompanied by a Github repository with all experiments
- Full paper is available at ResearchGate
- W Guo, J Wang, and S Wang. “Deep Multimodal Representation Learning:A Survey”. In:IEEE Access7 (2019), pp. 63373–63394
- Jianfeng Dong, Xirong Li and Cees G. M. Snoek. Predicting Visual Features from Text for Image and Video Caption Retrieval. IEEE Transactions on Multimedia, 2018.