Research:Expanding Wikipedia articles across languages/Intra language approach
Today Wikipedia contains more than 40 million articles, is actively edited in about 160 languages, and its article pages are viewed 6000 times per second. Wikipedia is the platform for both access and sharing encyclopedic knowledge. Despite its massive success, the encyclopedia is incomplete and its current coverage is skewed. To address these gaps of knowledge and coverage imbalances (and perhaps for other reasons), Wikipedia needs to not only maintain its current editors but also help new editors join the project. Onboarding new editors, however, requires breaking down tasks for them (either by machines or humans who help such editors learn to become prolific Wikipedia editors).
Up until recently, the task of breaking down Wikipedia contributions and creating a template for contributions has been done for the most part manually. Some editathon organizers report that they do template extractions to help new editors learn, for example, what the structure of a biography on Wikipedia looks like. Sometimes, extracting templates manually means going across languages, especially in cases where the language the editors being onboarded in is a small Wikipedia language (in terms of the number of articles already available in that language). Manual extraction of templates is time consuming, tedious, and a process that can be heavily automatized, to help reduce the workload for editathon organizers and help more newcomers to be onboarded.
This research aims at designing systems that can help Wikipedia newcomers identify missing content from already existing Wikipedia articles and gain insights on basic facts or statistics about such missing components, using both the inter-language and intra-language information already available in Wikipedia about the structure of the articles.
A well-expanded Wikipedia article contains sections that help the reader navigate more smoothly through the different parts of the article. Specific Wikipedia topics within a given Wikipedia language can have a common set of sections that reoccurs within the same topics across multiple articles as well as more unique and uncommon sections. For example, it is generally understood that biography articles on English Wikipedia contain Early Life and Career section while many may not contain Presidency section. Learning what sections belong to what type of articles is a hard task which requires substantial experience editing Wikipedia in a given language. This creates a heavy cost for onboarding newcomers to Wikipedia (for the newcomers themselves or for the editathon organizers, especially in smaller Wikipedia languages where there is not a lot of examples to look into and learn from).
In this part of our research we attempt to address the following questions: can we use the current structure of Wikipedia in a given language to learn what sections can be added to a Wikipedia article? This is what we call "Intra-language Recommendations". And, if the Wikipedia language is small for automatic learning of structure, is there reliable signal in other languages we can learn from to extract structures (sets of sections) to be recommended to an article in a given language? This is what we call "Inter-language Recommendations". Below, we explain the two approaches.
Intuitively, articles about similar topics would share a common set of sections. A cursory look though, for instance when comparing the sections' titles of the English articles on World War II and the Vietnam War, shows that this is often not the case.
In this research projects, we explored different recommendation methods. Figure 1 summarizes the different approach and shows a high-level overview of the system for generating ranked lists of section recommendations (right) for a given input Wikipedia article (left). The different methods are represented as a path of arrows from left to right. At the broadest level, we consider two paradigms: article-based recommendation (top, shaded green) and category-based recommendation (bottom, shaded blue). The figure exemplifies how a section ranking is produced for the input article "Stanford, California" (the town, not the university).
This approach works directly with article features. Intuitively, articles about similar concepts should have a similar section structure. This intuition directly translates into a strategy for recommending sections for articles: given an article A for which sections are to be recommended, detect articles similar to A and suggest sections that appear in those similar articles but that do not appear in A yet.
Similar articles are discovered in two ways:
- A topic modeling method (LDA) leverages articles that are similar to the input article in terms of textual content.
- A collaborative-filtering method (ALS) which leverages articles that are similar to the input article with respect to already-present sections, rather than mere textual content.
For the first case (topic modeling), for each article in the corpus, we map its list of sections over 200 topics, weighted by the probability that the given article belongs to a certain topic.
The output of this procedure is a set of ranked lists L (one per topic) containing the most-frequently occurring sections. We can then generate section recommendations for an article by combining its probability distribution over the topic space (i.e., how representative is a certain topic for the article) and the rankings from L.
In the second case (collaborative-filtering) aims to recommend sections from articles that are similar to the input article with respect to the sections that already exist in the body.
This is a typical collaborative filtering setup. Usually, CF is used to recommend items to users; given a user to make recommendations for, it finds similar users and suggests items those users have liked. In our setting, articles correspond to users, sections to items, and liking an item to the presence of a section.
This approach models the topic of the articles by exploiting the category network that provides already a natural way to organize the different subjects. Since the current Wikipedia categories graph is not well structured, before applying our methods, we had to clean the network. Ideally, links in this network would represent IS-A relationships and the network would be taxonomical, i.e., a tree-structured hierarchy grouping of articles into a coherent ontology by subject. Unfortunately, this is not the case in practice because like in the case of the article this structure is user-contributed and different people interpret the parent relation between category in different ways (IS-A vs. generic topic similarity).
Cleaning the category network The Wikipedia Category network is not purely ontological -- i.e., ontological categories connected by IS-A relationships are interspersed with categories which the only aim is to cluster articles by different criteria. For this reason, we devised a strategy to purify the network while still retaining the vast majority of its categories. This allows us to apply a transitive closure on the network, thus assigning articles not only to the specific categories they are labeled with but also to their ancestors. For example, in our approach we want to consider the article about Alan Turing not only part of its immediate category "English inventors", but also included in "British People", "People_by_continent" and so on.
Given the high noise present in the categories graph, to extract a purified network, we proceed in multiple steps:
- We removed all the cycles to transform the graph into a directed acyclic graph (DAG) and allow to extract for each category, a set of non-cyclic paths with the parent/sub-category relation. We used Algorithm GR, a heuristic implementation of solving the Feedback Arc Set (FAS) problem. Through this process, we removed a negligible number of edges (0.1%) from English Wikipedia's linked category graph.
- We defined the concept of purity as a score that describes if a category contains articles of the same type. In this context, we assigned the class of the articles based on the RDF property 'dbo:type' defined in DBpedia. To avoid assigning too specific types, we followed the transitive parent relation, and we used as labels the top-level classification for a total of 55 different classes. This mapping allows us to generate the types frequency distribution for each category that represents the variety of articles assigned to the concept.
- We pruned the network by keeping only the categories that pass the "purity check". The algorithm proceeds in a recursive bottom-up approach and from each pure node (category), it propagates to all the parents the set of articles. A category that does not pass the purity check is discarded and is removed from the graph. To measure the homogeneity of the types frequencies distribution of one category, we used the Gini coefficient, and we generated multiple networks using different thresholds of the Gini coefficient.
- To select the best Gini coefficient threshold, we followed the intuition that if a category is not pure (contains heterogeneous articles), the relation IS-A with its articles is not valid. Moreover, since the goal is to have an ontological structure, this property must also be valid for its the transitive closure.
We manually annotated 800 samples as a positive or negative example of IS-A relation by taking random pairs of articles and categories (with parent transitive closure). This dataset was used to compute precision and recall on the network pruned with different Gini coefficient and to select the best threshold value.
Computing the section relevance
Once the category network is cleaner, we generate the recommendations with 2 methods:
- Using category-section counts
- Collaborative filtering
For the first case, we assume that given a category C as the entry point for generating recommendations, a simple and effective way to measure the relevance of a section S is to count the number of times S appears in all the articles in C.
Under the assumption that the category network provides a clear ontological structure, we include all the articles in C and its corresponding subcategories and count the frequency of sections appearance in the categories' set generated by this transitive closure (all the articles assigned to categories in the subtree). This approach provides a ranking of the section given a category.
Since the category sizes may differ significantly, we normalize the ranking of the section given a category by the number of articles used in the counting. This step sets the ranking score of the sections between 0 and 1. We interpret such scores as P(S|C), which is the probability of observing a section S in category C.
The second approach is based on collaborative filtering using the popular Alternating Least Squares Method (ALS) algorithm. The rationale behind this strategy is that the previous approach does not allow to transfer information between similar categories that reside in two different subtrees. For instance, while Swiss scientists and American scientists are categories at the same level of the hierarchy and may share common sections, the counting based approach cannot extract information from these two disjoint subtrees. Therefore, we are interested in grouping similar categories to learn about missing sections by transferring information between them. This is especially important for the smaller categories, such as "Nauruan scientists", that may present patterns similar to other categories but do not have enough content to contribute to the generation of recommendations alone. As described above, given one category we represented the "rating" of a section with the probability of observing the title across all the articles in the categories subtree.
Generating the recommendations Since one article can be part of multiple categories, and its topic can be described by taking into consideration all of them, the main idea is to merge the recommendations produced by each category to find a set of relevant sections for the article. A simple way to merge the recommendations while promoting the most representative sections, is to sum their scores when they are present in more than one list. As effective as this strategy could be, it does not take into account other interesting features of the categories, e.g., number of articles assigned, Gini coefficient, etc. Intuitively, such features capture how diverse are the recommendations generated by a category, i.e., the larger is a category and the lower is its Gini coefficient, the more heterogeneous will be the recommended sections.
This problem setting is very similar to the one described in the vast body of literature on Learning-to-Rank (L2R). L2R was originally conceived to improve the performance of information retrieval systems, but its supervised machine learning models can be applied directly also to our scenario.
We developed a simple, interpretable model with features engineered from the size of a category and its Gini coefficient. After a round of feature selection, and a gradient descent optimization, we obtained a model capable of effectively merging the recommendations coming from multiple categories.
This will allow us to build a novel article editor, where section names will be recommended both from new categories that the user will use to label the article and already existing ones.
We evaluate the section recommendation system using both offline and online mechanisms.
Dataset We test our methods using the English Wikipedia dump released on September 1st, 2017, which encompasses 5.9M articles and 1.6M different sections. Parsing the dump, we also observed that 37% of all pages are flagged as stubs.
Taking a closer look at the data, we notice that the most frequent sections include titles that are content-independent, e.g., "See Also", "References", or "Links". These sections add useful metadata to the article but do not contribute to the expansion of the content itself. If we discard a list of 14 section titles of this kind, over 54% of all articles have none or only one section.
In the dataset, we can also observe the effect of a lack of indication provided to editors by the editing tool: more than 1.3M section titles appear only one single time because often the editors chose too specific titles.
For instance, the one time the section "Monuments, images and cost" appeared, it would have been more consistent with the rest of Wikipedia to simply use "Monuments" instead. Before evaluating our methods, we clean the dataset to remove irrelevant sections that we do not want to include in our training phases. First, we remove all the unique sections, which do not represent a relevant signal for the recommendation. Then, we remove a list of 14 sections (References, External links, See also, Notes, Further reading, Bibliography, Sources, Footnotes, Notes and references, References and notes, External sources, Links, References and sources, External Links), which we manually selected from among the most common titles. The rationale behind this is that these sections are generic enough to be recommended by default for nearly every article, so they should not occupy spots in the short and concise, context-specific rankings we aim to provide.
In this case, we compute precision and recall when the outcome is the section recommended by the algorithm and the ground truth is whether the section already exists in a given Wikipedia article.
Precision and recall scores will create a lower bound for the performance of the recommendation system. This is because: i) some of the recommended sections might be indeed relevant for the article but they have not been added by editors to the article, yet , ii) we currently do exact string matching between the section recommended and the sections listed in the article. As a result, if the section doesn't perfectly match a section in the table, we capture it as to-be-recommended.
The Precision/Recall plot summarises the performance of the different methods at different k (number of items recommended). The plots include upper-bounds because when the recommendation has more elements than the number of available sections in the testing article, the maximum precision/recall must be adapted accordingly. For example with k=10 and a testing article with 3 sections, the precision/recall of a perfect system would be 0.33. The upper-bound curves are computed by averaging the maximum precision/recall for each article of the testing set.
Article centric - Topic Modeling Figure (a) summarizes the performance achieved by the topic-modeling approach. Precision and recall are plotted in blue, the theoretical upper bounds in red. The method achieves a maximum recall (at k=20) of 30%. Precision is considerably lower, at 20% for the single top recommendation, and it decreases to less than 5% for k=20. Inspecting the recommended sections manually, we found that the major shortcoming of this method is that it can only capture textual, not taxonomic, similarity. A typical example of how the topic-based method fails is the input article "Stanford, California", where we get similar recommendations as "Stanford University". These two articles are very similar in terms of words contained, and therefore in terms of the topics extracted via LDA. In fact, however, the two articles call for widely different sections, one being a town, the other, a university.
Article centric - Collaborative filtering This method makes hard to generate recommendations for articles that were not part of the training set. For this reason, we designed a custom evaluation experiment that is based on the reconstruction of half of the sections for the articles of the testing set.
As described in the previous section, we trained the model by factorizing the resulting rating matrix with ALS. We treat the problem as an explicit feedback setup in which the article gave a favorable rating to its sections. Although the experiment is intentionally designed to favor the accuracy of the results (i.e., leaving half of the sections in the training set, ignoring articles with only 1 section - for having a strong baseline), the performance of the recommendation is unsatisfactory.
In particular, the precision is always below 0.2%, and the recall with 20 titles recommended does not reach 1.5%. Despite these values are significantly better than the random case, where the precision is below 0.002% for all k, and the recall for k=20 is 0.008%, this method showed to be not suitable to be used in a real-world scenario.
Using the categories - Section counts Differently from the previous experiment, using the categories as an entry point, allows us to design a different test and evaluate the recommendations on all the sections' titles. Indeed, the performance metrics of this experiment are based on the reconstruction of all the sections present in the articles of the testing set. In our evaluation, we generate the recommendations for an article by merging the relevant sections of it categories both with simple sum and with a weighted sum based on Learning To Rank. The figure shows that with this method the precision for the unweighted sum is above 57% for the first element recommended and with 20 recommendations, the method reaches a recall of 80%.
Using the categories - Collaborative filtering To model the relevance of a section in a category with collaborative filtering is important to choose an adequate rating representation. We found that the best setup to training the model is to use the probabilities generated by the counting base approach with a limit to 100 for each category. Like in the case of article-based recommendations, we factorize the matrix with ALS, and we modeled the CF ratings (relevance of a section) as an implicit feedback. The Precision/Recall figure shows that collaborative filtering on the category setup performs much better than the article-based approach, but it does outperform the recommendations generated by the counting method.
We intend to run two online evaluations as described below.
Experienced editor evaluation
The best judgment of whether a section belongs to an article is done by experienced Wikipedia editors. We evaluted the quality of the recommended results by asking experienced editors to label recommendations for a given category as relevant or not.
The recommendations for this step of the evaluation are generated from the two methods that we are comparing: our custom probability-based approach and standard Collaborative Filtering (ALS).
A group of 11 experienced editors evaluated 20 different section recommendations on a total of 88 Wikipedia articles. 89% of the sections that our system recommends as the best candidate for the given article were considered as relevant (i.e., precision@1). Similarly, the evaluators found that the quality of the recommendations gracefully degrades (i.e., precision@10 == 77%).
General editor evaluation
While experienced editors' assessment of the quality of the recommendations is key to our evaluation, it is important to learn about the impact of such recommendations in practice. The set up for this evaluation is not finalized but it can take one or more for the following forms:
An extension to GapFindercan be built where the editors can receive recommendations on which (random, high pageview) articles they can expand, and they can also search for articles of their choice via the search feature. In such a setting, once the editor clicks on the article, a recommended set of section(s) will be shown to the editor and the editor can choose to work on one or more of them.
There are alternative approaches for testing this recommendation system. Part of the answer as which approach we choose is also the language communities preferences to specific approaches versus the others.
Section recommendation in practice
This is not the first time that the need for section recommendation is recognized. For example, Ma Commune, does section recommendation. That initiate also goes a bit deeper and offers further insight on what section expansion can entail, for example, by offering statistics related to the average length of the section if it were to be created. WikiDaheimalso experimented with hardcoding History and Politics section and encouraging contributions towards these missing but important sections in articles about geographical locations.
The figures below shows the results on English Wikipedia as the target language. As expected, given the amount of unique elements, the precision and recall are very low. However, when using English as source for targeting languages with less unique items, such as Portuguese, we obtain recall and precision comparable with the intra-lingual recommendations, suggesting that the inter-language approach has big potential for languages with small amount of unique items.
Data and Code
We released all the datasets and the code:
- SIGIR Paper repository: Structuring Wikipedia Articles with Section Recommendations
- [Code]: A framework to clean the Wikipedia category network
- [Code]: Implementation of "A fast and effective heuristic for the feedback arc set problem"
- [Dataset]: Wikipedia Category network Sept 17. Articles types (DBpedia). Output if the pruning algorithm.
- [Dataset]: Generated recommendations
See Data for data related topics.
- Eades, Peter; Lin, Xuemin; Smyth, W.F. "A fast and effective heuristic for the feedback arc set problem" (PDF). Information Processing Letters 47 (6): 319–323. doi:10.1016/0020-0190(93)90079-o.