Grants:TPS/User:とある白い猫/Presenting at PAN Lab of CLEF 2011/Report

From Meta, a Wikimedia project coordination wiki

Cross Language Evaluation Forum 2011 (CLEF 2011), Amsterdam


Since its creation Wikipedia and other Wikimedia projects have relied on volunteers to handle all tasks, including mundane tasks. Today with improvements in Artificial Intelligence and Information Retrieval we are able to delegate such mundane tasks to machines to a certain degree.

This report will be an overview of various conferences, labs and other activities related to the 19-22 September CLEF 2011 conference in Amsterdam. My attendance primarily was for the purpose of presenting my own experimental automated tool (VandalSense 2.0) but also to exchange ideas for newer approaches for this task. Secondarily however, I reviewed research activities unrelated to automated vandalism detection that could have potential impact on Wikimedia-related projects. Lab organizers are at a look out for new ideas and challenges for conferences on the upcoming years and offering them a task that would benefit Wikimedia projects would create a symbiotic relationship benefitting all parties involved. I will be explaining current research I observed and give example(s) on how it could be of use to the foundation sponsored projects.

A significant majority of researchers as well as keynote speakers stated that they made use of Wikimedia projects as a source of raw data for research purposes at some point if not for their current topic of research. Such research can generate new innovative tools to handle mundane tasks automatically or semi-automatically so that human editors have more time left to work on other tasks.

PAN Lab ([edit]

Structure of PAN

PAN lab had several parts. All tasks involved the processing of large amounts of text. Algorithms covered plagiarism detection, author identification, and vandalism detection.

First task was plagiarism detection which had 30 participants from 21 countries. Only 11 submissions made it though. For the task 61,000 cases of plagiarism were hidden in around 27,000 documents in 3 languages.[1] This task was sponsored by Yahoo! Research as they offered a 500 Euro prize to the team that scored highest which was the team from University of Chile, Chile (Gabriel Oberreuter, Gaston L'Huillier, Sebastián A. Ríos, and Juan D. Velásquez). In essence this task could also be viewed as a copyright detection task even though this isn’t the intention of the organizers currently. Task could be altered to assist in automated detection of copyrighted content on Wikipedia, a problem with legal implications that had been plaguing the project since its early days. The task was divided into two sub-tasks: External Detection and Intrinsic Detection. With external detection “given a suspicious document and a set of potential suspicious documents, the task is to find all the plagiarized passages in the suspicious document and their corresponding source passages in the source documents”. Such a tool could be useful for Wikimedia projects to detect copyright violations but also to identify material copied from known freely licensed sources. Intrinsic detection on the other hand was intended to detect plagiarism based on clues on the suspicious document itself. This can be used to identify fragments of material that were copied into a document without having the said documents available. In essence users contribution (such as talk page posts) could be analyzed to determine how the person writes and based on that it is possible to identify which main space edits are not within the style norms of the user. This kind of analysis is already in use to detect academic plagiarism.

The other task was author identification which had 31 participants from 23 countries. Only 8 submissions were made. Corpus provided 12,000 documents from 118 authors in only one language. Winning team of the competition was Universidad Autónoma de Nuevo León, Mexico and Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico team (Hugo Jair Escalante, Manuel Montes). This task could be used to identify sockpuppets of long-term abuse cases as well as to counter lobbyist efforts on Wikipedia. Techniques used on the project are similar to ENRON case where emails were used as evidence to identify who was responsible of what decision and had a lasting impact on the entire trial.

Finally the last task was the Automated Vandalism Detection task which had 18 participants from 14 countries. Only 3 submissions were made. The winner of the Vandalism detection part University of Pennsylvania team (Andrew West and Insup Lee), managed to score better than the best results of the previous year even though the task this year was “more difficult”. Unlike last year the participation had diminished significantly for this task as this year the labs vandalism track remained obscured. Last year the lab was mentioned on Slashdot which boosted participation.[2] I have suggested restructuring the vandalism detection track to have a goal to identify more specific cases of vandalism rather than vandalism in general. Also I suggested that perhaps attention should also be given to have a bridge between vandalism detection and author identification.

One observation I had was greater interest was shown towards the plagiarism task due to the 500 Euro prize by Yahoo! Research. Researchers have far too many labs to consider as they are all equally challenging. Even though the amount is very small, it is adequate to motivate researchers to pick a specific topic.

ImageCLEF ([edit]

As its name implies ImageCLEF deals with analyzing text and/or images to classify images. The lab was divided into four sub tasks: Medical Image Retrieval, Photo Annotation, Wikipedia Image Retrieval, and Plant Identification. The most intriguing part is the multilingual nature of the text mining phase making the algorithms just as useful for non-English languages which suits the multi-lingual mission of Wikimedia project.

Medical Image Retrieval task focused on identifying what the medical image is (is it an x-ray or CT scan, ultrasound and etc.) and after that identify various features. In essence this could determine features could lead to early identification of various illnesses from routine check-ups that may even save lives. 55 groups registered with 17 submitting results from 9 countries. Winner was DEMIR/Dokuz Eylul University, Turkey (Tolga Berber). While the medical task may have little benefit to Wikimedia directly, algorithms developed could be incorporated for automated identification as the task involves using the images from various medical diagnostics instruments and the related report for to identify content to learn features which can then be used to identify other unrelated images.

Photo Annotation task attempted to identify what the image conveyed: identifying the topic or emotion. 48 groups registered with 18 groups from 11 countries submitted runs. Results were similar in terms of accuracy and quantity leaving with no clear winner. What is impressive with this task is the identification of emotion the images conveyed which is generally viewed as a human only domain. This could create an entirely new type of categorization based on “emotion” on top of “topic”.

Commons images were used to train and classify. 45 groups registered of which 11 groups from 9 countries submitted results. Researchers were able to identify objects within images in a more general manner. This could be used to categorize images on commons semi-automatically during upload with category suggestions to the uploader. Furthermore same algorithms can be used to identify copyrighted images that are frequently uploaded to commons by well-meaning but under-informed users. Examples to this include the Eiffel Tower at night, sculptures (which are not covered by Freedom of Panorama by the US Copyright law), screenshots of movies. In fact after my mention that deleted images aren’t really deleted, the community was very interested in the deleted content of commons which in essence is a collection that machines can be trained and afterwards such images are automatically detected. One researcher referred to the deleted content as a “gold mine”.

Botanic images (pictures of leafs) were used to identify what species of plant they belong to. This part of ImageCLEF lab had 40 groups that registered of which 8 participant teams from 7 countries submitted runs. Identification was based on the features of leafs. This task is new to ImageCLEF and more research will be put into it. What is most remarkable of this task is it can be used to generally categorize existing botanic images on commons based on the characteristics of leafs as well as meta data such as GPS coordinates to narrow down species. The image can then be re-categorized by human botanists to the relevant sub category. Species identification of Botany is very difficult as leafs for example look near identical to non-experts. Furthermore botanists specialize in one sub-field of botany dealing with plants in a specific geography or climate and would not be able to identify certain species without considerable effort. WikiSpecies would also benefit from this kind of research as the commons repository would serve as an indispensable tool for Botanists worldwide.

While currently not proposed, identifying and tagging images automatically for the image filter (already approved by the board of trustees and community through the referendum) could be an additional task for ImageCLEF. Certainly we would not run into point of view (POV) issues with machine generated tagging. Humans of course would still be able to tag or untag on top of such automatic tagging.


Wikipedia’s infoboxes[3] were used to disambiguate which name refers to which person in a cross-lingual (21 languages) manner. Quite often people have identical first and last names. This could be of great asset to identify what passages that link to disambiguation pages refer to and in turn could be used to automatically identify such links and disambiguate them in a semi-automatic manner for multiple language editions all from something as “insignificant” as infoboxes. Furthermore such research would help Wikimedia projects be more Web 2.0 compatible.

A new addition to CLEF was MusicCLEF which had its pilot that attempted to identify authorship of music. OGG uploads to commons could perhaps be monitored for copyrighted content within them through the use of tools created by this LAB.

Conclusions and Suggestions[edit]

It is in my belief that with little effort CLEF could become an indispensable asset for Wikimedia Foundation related projects. Researchers working for CLEF already use Wikimedia projects. Particularly PAN and ImageCLEF labs could assist in dealing with issues wikis face such as automated identification of copyrighted material (text and Images), automated tagging of images for the image filter already approved by the board of trustees and community through the referendum, semi-automated categorization of images on commons. This in turn would lead to human editors having more time for other tasks.

One key thing to note is with a participation of 174 registered participants, 52 students in other words 226 people from 29 countries or 5 continents the international makeup of the conference CLEF utilizes scientists world-wide even though it is known to be more of a European conference. Unlike its more business oriented counterparts, CLEF is more research prone making its goals compatible with non-profit projects and organizations.

Foundation had practically no presence in the conference even though foundation run projects dominated discussions in practically all of the tracks. For the next CLEF foundation could provide moral support such as mentioning the conference on the Foundation newsletter to gather more attention to the labs relevant to the foundation sponsored projects. Content of the Foundation newsletter would probably end up at various tech and scientific portals generating even more attention. Furthermore researchers are often not experienced Wikimedians so they do not realize the potential of tools Wikimedians know and take for granted. This is where Wikimedia Laboratories could make a presence perhaps offering suggestions or provide means for new and existing challenges alike. Also reporter(s) from Wikinews could be tasked/asked to interview winners as well as other participants. CLEF 2012 will be held at Rome so a Wikimedian living in or close to Rome could be a logical choice. This would boost publicity of the Wikimedia related tasks for the next CLEF conference which in turn would provide better developed tools at the disposal of Foundation projects.