Research:Automated Categorization of Wikipedia Images
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Project goals
[edit]Besides their regular text, Wikipedia articles are rich multimedia documents containing video, audio, and above all, images. The volume of Wikipedia images is very large: English Wikipedia articles alone contain more than 5 million unique images. As observed in recent work, images represent an essential component of Wikipedia readers' experiences with significant user engagement[1]. At the same time, from a data perspective, the geographical and cultural diversity of Wikipedia makes its image data unprecedented and extremely valuable for the research community. Despite its volume and value, navigating, retrieving, and re-using visual content on Wikipedia is hard, due to the lack of labels, categories, and metadata. Classification of this content for research and editing purposes is becoming increasingly important. Unfortunately, the value offered by its uniqueness comes with the disadvantage that common off-the-shelf classification models based on ImageNet give unsatisfactory results, requiring a custom solution.
This project is inspired by the textual counterpart ORES, and the goal is two-fold: 1) develop a classification taxonomy to label images on Wikipedia and 2) develop a model for image classification and embedding. The first part requires familiarity with semantic network data such as Wikipedia/Commons category network, and it aims to identify the best way to label images on Wikipedia based on existing metadata (e.g. Wikipedia/Commons templates, categories, and tags). The second part will focus on training and evaluating a deep learning model to predict the binary relevance of a set of relevant labels.
The source code is maintained here: https://github.com/epfl-dlab/wiki_image_classification
Taxonomy
[edit]Classification
[edit]Abstract
[edit]Wikipedia is full of articles... and images! Having over 53 million articles in 299 languages containing 11.5 million unique images, there is a great need for automated organization of all this data. Inspired by ORES, an ensemble of machine learning systems in Wikipedia that provides among others automated labeling of articles, this project aims at automated topic labeling of images in Wikipedia. In this report, experiments are made using images labeled with the ORES labels of the articles where they are present, and with the custom labels that were generated with a heuristic in the taxonomy part of this semester's project. Two different models (EfficientNetB0 and EfficientNetB2) are trained on this data using 10 or 20 labels. As the main insights we understood that:
- the custom labels were inferior to ORES labels according to our metrics;
- the network with more parameters, EfficientNetB2, yielded higher prediction values having greater average recall but does not outperform EfficientNEtB0 with regards to the ROC curves;
- the labels with better performance are those that are most present in the dataset used in pre-training.
I. Introduction
[edit]Wikipedia is the largest encyclopedia in history, containing over 53 million articles and having around 1 billion page views per day. Besides text, images play an important role in readers' interaction with Wikipedia articles, as shown in a recent study by Rama et al. [2]. With the number of unique images on Wikipedia surpassing 11 million, labeling these images into broader topics (rather than into the specific objects in the image) is becoming increasingly important to tackle tasks such as visual vandalism detection (is the topic of the image related to the topic of the article?), finding visual knowledge gaps (what topics of images is Wikipedia missing the most?), and explanation of reader pattern (do readers interact with images differently depending on the topic of the image?).
Problem formulation. The lack of standard metadata describing the broader picture of images in Wikipedia poses a hinder to the exploration of the full potential of this visual data. There is today no way to perform a topic-based search of images, rather than an object-based, where the latter (to our best knowledge) does not exist in Wikipedia but that could easily be implemented using networks pre-trained on standard image datasets.
Prior solutions. To address this problem of automated topic-labeling of Wikipedia images, off-the-shelf networks trained on e.g. ImageNet do not yield satisfactory results due to the variety and uniqueness of images in Wikipedia, as mentioned by Redi in her research [3] . In the same article, Redi develops a taxonomy of labels by pairing the 6.7 million Commons categories to the 160 COCO [4] categories of visible images and then uses fine-tuning of a deep learning model pre-trained on ImageNet to classify images. This solution, though, still falls short in classifying images in terms of the image topic. Moreover, in the work by Huang [5], Huang sets out to classify chart images of Wikipedia Commons, obtaining the best overall accuracy when fine-tuning an already pre-trained model.
Proposed solution. Our solution to the problem of automated classification of Wikimedia images is to develop a customized taxonomy of topic labels based on the Commons categories (in the work done by Salvi in taxonomy) and then to fine-tune a pre-trained deep learning model with the Wikipedia image data labeled with the customized topic labels. To be more specific, the deep learning model is given an image and a set of predefined labels and displays a subset of labels that are relevant to describe the image. The assignment of each label is done independently of the others, so the network can be seen as an ensemble of several binary classifiers. Note that the terms class and label are used as interchangeably as synonyms.
II. Related work
[edit]ORES. ORES [6] is an ensemble of machine learning techniques in Wikipedia whose goal is to help editors and content moderators to deal with the immense work of administrating this gigantic encyclopedia. Functions offered by ORES are e.g. vandalism detection, judging article quality, and predicting the topics of an article.
WIT dataset. The Wikipedia-based Image Text (WIT) dataset [7] is a large multimodal and multilingual dataset containing 37.6 million image-text entries, with 11.5 million unique images across 108 Wikipedia languages. From the English Wikipedia, 3.9 unique images were gathered by us by reading the segments and removing duplicate images. Each entry contains an image and the textual context in the article where that image is present, and the metadata of the image itself, e.g. caption and image name. In the time scope of this project, only the image data was used.
ImageNet. ImageNet [8] is a dataset of 1.4 million images, each classified with one single label out of 1000 possible labels. For over a decade, it has been the benchmark dataset for the training of image classification models.
EfficientNet. EfficientNet [9] is a family of deep learning networks that have been shown to achieve better accuracy while requiring fewer parameters on ImageNet compared to other convolutional networks. It utilizes a rule for scaling the width, depth, and resolution of the network for better performance.
Transfer learning. Transfer learning is a machine learning method that aims to reuse the knowledge learned in a problem in another similar problem. In the field of image classification, the transferred knowledge is image features such as corners, shapes, and backgrounds. Networks pre-trained on ImageNet are widely used with great success for different reasons, as studied by Huh et al. [10].
Importance of classifying images in Wikipedia. Images play an important role in understanding and engaging readers, as highlighted in a large body of literature from educational psychology as in [11]. It is not different in Wikipedia; as shown in the work by Heald et al. [12], images coming from Commons have a high monetary and societal value. So, having in mind the value of these images, and also the over 11 million unique images in Wikipedia, it is clear that an automated classification of these according to a standard set of labels is vital for unleashing the potential of the visual content.
III. Method and Data
[edit]Method. The method used in this classification part of the semester project is to fine-tune a deep learning model pre-trained on ImageNet, and then generate different metrics to assess the quality of the model. By fine-tuning a network, it is meant that the base model's last layer is replaced by a dense layer followed by an output layer of size equal to the number of classes. The weights of the last two layers of this network are then trained, while the other weights are kept unchanged. To assess the quality of the model, the chosen metrics are precision, recall, and receiver operating characteristic area under the curve (ROC AUC).
Data. When it comes to the data, the images coming from the WIT dataset were used, where each image was assigned with a subset of labels starting from the Commons categories of the articles in which the image was present. The finite set of 42 labels was generated by Salvi in a tree-search manner in the first part of this semester's project. See Figure \ref{fig:class_distribution} for the number of images per label.
In the scope of this semester's project, the goal is to develop prototypes of the described classifier rather than a fully-packaged solution. Thus, a rather strict pre-processing of the data was made to reduce the training time and the source of possible errors. First, the image data was taken only from the English articles on Wikipedia, which left us with 3.9 million out of the 11.5 unique million images in the WIT dataset. Next, only the 2 million images with a non-empty label set were kept. Finally, only the 1.6 million images of the .jpg and .jpeg formats were kept to avoid problems with the conversion of .png files. In this process, also some other couples of thousands of images were removed, images that were not found for not existing among the downloaded images from the WIT dataset, or images whose names had an encoding unreadable to the operating system.
IV. Implementation
[edit]In this section, more details on the different facets of the implementation itself are covered.
Fine-tuning. The final network used in the experiments had an EfficientNet-based network pre-trained on ImageNet. EfficientNet is the base model, where the last layer is replaced by a dense layer of 128 layers and an output layer with the same number of neurons as the total number of labels. During the fine-tuning, the weights of the base model are left unchanged, so only the weights of the two added layers are updated. See Figure \ref{fig:network} for a scheme of the assembled network.
Loss function. The loss function chosen for this multi-class image classification problem -- where each image can be assigned several labels -- was set as the binary cross-entropy. The idea is that each label shall be judged as a binary classifier independent of the other labels probabilities. The formula of the loss function is: $$ L(\textbf{p}) = -\frac{1}{N} \sum_{i=1}^{N}\sum_{j=1}^{M}y_{ij}\log(p_{ij}), $$ where $N$ is the number of images, $M$ is the number of labels, $y_{ij} \in \{0,1\}$ is the ground-truth on whether the $i^{th}$ image is labeled with the $j^{th}$ label, and $p_{ij} \in [0,1]$ is the probability given by the model that the $i^{th}$ image is labeled with the $j^{th}$ label.
Training. The training was performed during 15 epochs, where the class weights were used to compensate for the unbalanced class distribution. Moreover, a decreasing learning rate was tested at an early stage of the project but without any improvement, therefore the learning rate is set to be constant throughout. The image data was all the same with 570 thousand images and evaluating it at 30 thousand images. Each epoch took 30 minutes on average on a machine with 48 cores and 250GB of RAM.
Experiments. To experiment on the performance of the network for the given data and labels, different setups of the network and the labels tested:
- The number of total classes was set to 10, 20;
- Base models EfficientNetB0, EfficientNetB2;
- ORES labels and the labels generated by Salvi.
V. Evaluation
[edit]ORES vs. Custom labels. In this first experiment, we want to compare the separability of the ORES labels contra the separability of our custom labels. To do that, the EfficientNetB0-based network was fine-tuned with data labeled with 10 ORES labels, and then with 10 of our custom labels.
For the custom labels, the 10 labels with most images were taken, while for ORES, 10 hand-picked labels out of the top 20 top labels were picked. Note that in this hand-picking of 10 the labels from the top 20 classes, we left out the Geography labels specific to a region (e.g. Geography.Regions.NorthernEurope and Geography.Regions.Asia) and kept the more general Geography.Geographical.
See in Table \ref{tab:ores} the evaluation metrics after training the EfficientNetB0-based network where the data all labeled with ORES labels. Then, see in Table \ref{tab:naive} the same metrics for data labeled with our custom labels.
Evaluation metrics when using ORES labels
Precision | Recall | ROC AUC | |
---|---|---|---|
Media | 0.58 | $\frac{429}{1360}=$ 0.32 | 0.85 |
Music | 0.64 | $\frac{124}{614}=$ 0.20 | 0.86 |
Sports | extbf{0.87} | $\frac{700}{1790}=$ 0.39 | 0.88 |
Visual arts | 0.68 | $\frac{1204}{3289}=$ 0.37 | 0.84 |
Geographical | 0.66 | $\frac{509}{2267}=$ 0.23 | 0.82 |
Military and warfare | 0.64 | $\frac{481}{1924}=$ 0.25 | 0.81 |
Society | 0.18 | $\frac{7}{877}=$ 0.01 | 0.66 |
Biology | 0.80 | $\frac{1138}{1939}=$ \textbf{0.59} | \textbf{ 0.93} |
S.T.E.M. | 0.81 | $\frac{2126}{4203}=$ 0.51 | 0.84 |
Space | 0.85 | $\frac{52}{254}=$ 0.21 | 0.83 |
Micro average | 0.74 | 0.37 | 0.87 |
Macro average | 0.67 | 0.31 | 0.83 |
Evaluation metrics when using our custom labels.
Precision | Recall | ROC AUC | |
---|---|---|---|
Culture | 0.64 | $\frac{263}{9355}=$ 0.03 | 0.62 |
Entertainment | 0.21 | $\frac{11}{795}=$ 0.01 | 0.72 |
History | 0.54 | $\frac{511}{7216}=$ 0.07 | 0.65 |
Nature | 0.53 | $\frac{1937}{5166}=$ 0.38 | 0.77 |
Objects | 0.16 | $\frac{34}{937}=$ 0.04 | 0.64 |
People | 0.60 | $\frac{35}{2042}=$ 0.02 | 0.78 |
Places | extbf{0.66} | $\frac{5558}{13288}=$ \textbf{0.42} | 0.70 |
Politics | 0.29 | $\frac{158}{1074}=$ 0.15 | 0.76 |
Society | 0.52 | $\frac{71}{6555}=$ 0.01 | 0.65 |
Sports | 0.45 | $\frac{353}{1023}=$ 0.35 | \textbf{0.86} |
Micro average | 0.59 | 0.19 | 0.81 |
Macro average | 0.46 | 0.15 | 0.72 |
Evaluation metrics for custom labels, 20 labels, EfficientNetB0. 4.7M total parameters, 658K trainable parameters. Mean number of predicted labels per image: 0.11.
Precision | Recall | ROC AUC | |
---|---|---|---|
Animals | 0.08 | $\frac{42}{94}=$ 0.45 | extbf{0.95} |
Biology | 0.03 | $\frac{3}{49}=$ 0.06 | 0.82 |
Culture | 0.50 | $\frac{1}{9355}=$ 0.00 | 0.57 |
Entertainment | 0.00 | $\frac{0}{795}=$ 0.38 | 0.70 |
Events | 0.10 | $\frac{5}{458}=$ 0.01 | 0.61 |
History | 1.00 | $\frac{1}{7216}=$ 0.00 | 0.53 |
Language | 0.00 | $\frac{0}{215}=$ 0.00 | 0.73 |
Literature | 0.00 | $\frac{0}{81}=$ 0.00 | 0.75 |
Music | 0.07 | $\frac{6}{85}=$ 0.07 | 0.76 |
Nature | 0.54 | $\frac{105}{5166}=$ 0.02 | 0.73 |
Objects | \textbf{1.00} | $\frac{1}{937}=$ 0.00 | 0.59 |
People | 0.44 | $\frac{8}{2042}=$ 0.00 | 0.76 |
Physics | 0.00 | $\frac{0}{35}=$ 0.00 | 0.64 |
Places | 0.71 | $\frac{1134}{13288}=$ 0.09 | 0.68 |
Plants | 0.40 | $\frac{177}{387}=$ \textbf{0.46} | \textbf{0.94} |
Politics | 0.37 | $\frac{38}{1074}=$ 0.04 | 0.73 |
Science | 0.00 | $\frac{0}{622}=$ 0.00 | 0.60 |
Society | 0.33 | $\frac{1}{6555}=$ 0.00 | 0.59 |
Sports | 0.48 | $\frac{131}{1023}=$ 0.13 | 0.83 |
Technology | 0.00 | $\frac{1}{675}=$ 0.00 | 0.56 |
Micro average | 0.49 | 0.03 | 0.86 |
Macro average | 0.30 | 0.04 | 0.70 |
Evaluation metrics for custom labels, 20 labels, EfficientNetB2. 8.5M total parameters, 723K trainable parameters. Mean number of predicted labels per image: 0.26.
Precision | Recall | ROC AUC | |
---|---|---|---|
Animals | 0.09 | $\frac{49}{94}=$ 0.52 | extbf{0.96} |
Biology | 0.29 | $\frac{2}{49}=$ 0.04 | 0.80 |
Culture | 0.00 | $\frac{0}{9355}=$ 0.00 | 0.55 |
Entertainment | 0.00 | $\frac{0}{795}=$ 0.38 | 0.71 |
Events | 0.06 | $\frac{4}{458}=$ 0.01 | 0.68 |
History | 0.00 | $\frac{0}{7216}=$ 0.00 | 0.54 |
Language | 0.00 | $\frac{0}{215}=$ 0.00 | 0.74 |
Literature | 0.00 | $\frac{0}{81}=$ 0.00 | 0.80 |
Music | 0.00 | $\frac{0}{85}=$ 0.00 | 0.79 |
Nature | 0.49 | $\frac{210}{5166}=$ 0.04 | 0.74 |
Objects | 0.00 | $\frac{0}{937}=$ 0.00 | 0.58 |
People | 0.10 | $\frac{4}{2042}=$ 0.00 | 0.75 |
Physics | 0.00 | $\frac{0}{35}=$ 0.00 | 0.63 |
Places | \textbf{0.67} | $\frac{3911}{13288}=$ 0.29 | 0.68 |
Plants | 0.40 | $\frac{215}{387}=$ \textbf{0.56} | \textbf{0.95} |
Politics | 0.00 | $\frac{0}{1074}=$ 0.00 | 0.75 |
Science | 0.00 | $\frac{0}{622}=$ 0.00 | 0.55 |
Society | 0.00 | $\frac{0}{6555}=$ 0.00 | 0.60 |
Sports | 0.53 | $\frac{143}{1023}=$ 0.14 | 0.85 |
Technology | 0.13 | $\frac{1}{675}=$ 0.00 | 0.62 |
Micro average | 0.59 | 0.19 | 0.86 |
Macro average | 0.14 | 0.15 | 0.71 |
VI. Discussion
[edit]ORES vs. Custom labels. As can be seen from the comparison between the metrics in Table \ref{tab:naive} and Table \ref{tab:ores} (see Figure \ref{fig:roc-curves-10-custom-labels} and \ref{fig:roc-curves-10-ORES-labels} for the ROC curves), the network performs substantially better with ORES-labeled data. The difference is the most remarkable when comparing the average recall: the network trained and evaluated on the custom labeled data yields lower prediction values and is thus more unsure. The reason for this is believed to be the quality of our method to assign the custom labels to the images.
EfficientNetB0 vs EfficientNetB2. Comparing the average recalls in Table \ref{tab:custom-20-labels-efficientnetB0} and Table \ref{tab:custom-20labels-effficientnetb2}, we see that the EfficientNetB2-based model is greater by a factor of 6 (0.19 vs 0.03). This means that the EfficientNetB2-based model yields greater valued predictions and thus surpassing the threshold of 0.5 more times. This phenomenon is also observed by the greater mean number of predicted labels per image (0.26 vs 0.11). Notice though that the average precisions have closer values, which is also confirmed by the very close values of ROC AUCs. This means that the EfficientNetB0-based model needs only a smaller threshold to achieve an average recall similar to the EfficientNetB2-based one.
About the labels. The labels with the greater number of image assignments were expected to have the best performance metrics given the heavily unbalanced dataset, and thus the fact that the network has learned more varied features from these labels. Notice, though, that the labels with the best performance metrics in the top 20 case are Plants and Animals. This is believed to be caused by the pre-training: ImageNet has a substantial number of plant and animal classes (reference?). This trend is also observed in the network that uses ORES labels: the Biology label has the greatest ROC AUC.
VII. Future work
[edit]There are several facets of the image classification part of the project to be further explored.
To begin with, extending the model to be multi-modal -- image and text -- to also use the text related to the image as model input. An example of text input that can be used is the image name or the image caption. The C-Tran [13] is a model that can be tried. A further study of which kind the textual data in the WIT dataset related to an image (e.g. image name, caption, attribute name, etc) yields the best results would be insightful.
Next, studying the impact of training all network parameters, including those of the base model, would be interesting to discover how this impacts the evaluation metrics. As mentioned before, only the added final two layers' parameters were updated.
Furthermore, being less restrictive with the image filtering to be able to handle also .png files is a necessary extension to be able to have as much data as possible.
- ↑ Rama D, Piccardi T, Redi M, Schifanella R (2022). "A Large Scale Study of Reader Interactions with Images on Wikipedia". EPJ Data Science.
- ↑ Daniele R, Tiziano P, Miriam Redi and Rossano S (2021). "A Large Scale Study of Reader Interactions with Images on Wikipedia". CoRR. Vancouver style error (help)
- ↑ Miriam R (2020). "Prototypes of Image Classifiers Trained on Commons Categories". Wikimedia. Vancouver style error (help)
- ↑ Holger C, Jasper R R Uijlings and Vittorio F (2016). "COCO-Stuff: Thing and Stuff Classes in Context". CoRR. Vancouver style error (help)
- ↑ Sisi H (2020). "An Image Classification Tool of Wikimedia Commons". Vancouver style error (help)
- ↑ Aaron Halfaker and R Stuart G (2019). "ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia". CoRR. Vancouver style error (help)
- ↑ Krishna S, Karthik R, Jiecao C, Michael Bendersky and Marc N (2021). "WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning". CoRR. Vancouver style error (help)
- ↑ Tan, Mingxing and L, Quoc V (2009). "ImageNet: A large-scale hierarchical image database". IEEE Conference on Computer Vision and Pattern Recognition. Vancouver style error (help)
- ↑ Krishna S, Karthik R, Jiecao C, Michael Bendersky and Marc N (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks". Vancouver style error (help)
- ↑ Mi-Young H, Pulkit A, Alexei A E (2016). "What makes ImageNet good for transfer learning". CoRR. Vancouver style error (help)
- ↑ Daibao G, Shuai Z, Katherine Landau W, Erin M M (2020). "Do You Get the Picture? A Meta-Analysis of the Effect of Graphics on Reading Comprehension". AERA Open. Vancouver style error (help)
- ↑ Heald P, Erickson K, and Kretschmer M (2015). "The Valuation of Unprotected Works: A Case Study of Public Domain Photographs on Wikipedia". SSRN Electronic Journal. Vancouver style error (help)
- ↑ Jack Lanchantin and Tianlu Wang and Vicente Ordonez and Yanjun Q (2020). "General Multi-label Image Classification with Transformers". CoRR. Vancouver style error (help)