Grants:Project/Finding References and Sources for Wikidata
What is the problem you're trying to solve?
Explain the problem that you are trying to solve with this project. What is the issue you want to address? You can update and add to this later.
As Wikidata gets used more and more, ensuring that its content is of high quality becomes increasingly pertinent. Trust in and evidence of the quality of Wikidata is crucial to motivate continued participation from the community and encourage other parties to reuse the knowledge base in new areas. And yet, our understanding to date of how high quality Wikidata is, or in fact, what this notion of quality looks like is very limited. While there is research to suggest Wikidata outperforms other systems such as DBpedia along a number of dimensions, there have also been inconsistencies highlighted, which cast doubts upon its overall usefulness. 
One way to improve confidence in the value of Wikidata would be to make sure that the data it includes comes from reliable, reputable sources - the quality of these (primary) sources would then act as proxies for the quality of Wikidata statements quoting them. According to Wikidata's own help pages:
``The majority of statements on Wikidata should be verifiable insofar as they are supported by referenceable sources of information such as a book, scientific publication, or newspaper article (https://www.wikidata.org/wiki/Help:Sources).
However, according to the bi-weekly snapshots of Wikidata, 50% of all statements as of July 2016 are unreferenced. While there are a certain portion of statements for which a reference is not required, this does provide an indication of the amount of Wikidata statements that are still unreferenced. Furthermore, it is estimated from https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel-references?panelId=9&fullscreen&from=1468404365106&to=1476180365106 that around 34% of the references that are present are not considered as acceptable by Wikidata’s own policy as they refer to other Wikimedia projects.
Questions around the definition and assessment of quality have already started to be discussed in the community (https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Quality_is_measurable#Sources_and_quality). As a starting point, a call has been made for tools to help harvest sources, in particular from Wikipedia, which is where about half of the existing Wikidata sources emanate from. Our aim is to investigate this problem beyond Wikipedia and to develop methods and technology to find new, relevant sources automatically online. To do so, we will apply a combination of machine learning and crowdsourcing. We will derive observable features of reliable, useful sources from the existing knowledge base, while in the same time using crowd input to understand how people perceive source quality (for example, in terms of aspects such as `authority', `bias', and `reliability') and create a gold standard for machine learning algorithms to be tested against. This will allow us to rank each source based on the criteria settled upon, and therefore make use of the best source for each statement, ensuring that a certain threshold of quality is maintained at all times. A tool will then be developed to locate potential sources and determine the best quality option to choose, based on these rankings.
What is your solution?
Our solution is based around using machine learning and the wisdom of the crowds to collect information on source quality, identify sources of varying quality, and produce rankings of such sources for editors to use. We will elicit a set of criteria for source quality, including measures such as authority and trust that we can use to assess each source. Using the crowd, we will build a corpus of sources ranked according to their usefulness in Wikidata, so that data is built up about which sources are the best or most reliable for informing future claims and statements. This corpus will also be used to test machine learning algorithms that will infer how an existing source, already used in Wikidata, or available online, performs on this scale.
Explain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant. The main goal of this project is to discover a way to find automatic sources based on known rankings of source quality. Therefore, the sub-goals are:
- Evaluate the authority or quality of different sources using crowdsourcing
- Create a ranking of sources based on their scores for different quality dimensions
- Develop a machine learning approach that identifies features of sources that people would like to see used
- Develop an application to locate potential sources using text analysis and choose the best source based on the quality dimensions.
Ultimately, our proposal is just one step in the right direction. Our ultimate goal is to help improve public confidence in Wikidata by implementing a systematic way to vastly improve the number of statements that have reasonable references.
We will begin by reviewing existing work on what criteria are best for assessing source quality. This will provide us with a set of criteria that we can then judge each source by, for which we will design and develop a crowdsourcing experiment that utilises the community to rank sources that we find automatically for particular statements using text mining. After identifying the discriminative features of a ‘good’ source, we will devise a machine learning algorithm that learns a model of reliable, relevant sources based on the corpus created via crowdsourcing, alongside a tool that locates new sources matching these criteria.
- Staff costs, including full-time software developer for 6 months, and researcher time: 29706
- Travel and accommodation for up to 2 researchers to attend local Wikimedia events e.g. Wikimania 2017, x2 trips x2 researchers = $5600
Total budget requested: $35306
By focusing on a crowdsourcing approach to ranking source quality, community participation is integral to our project. Therefore, we will put in place an extensive community engagement plan, including online communication with relevant parties and offline engagement at events such as conferences, meetups and workshops.
We will engage with the community of editors to encourage them to use our tools so that sources can be added automatically and reliably to their content. We will create an active online presence to discuss this issue and emphasise the need to provide well-referenced content to improve the long-term quality of Wikidata in general. By attending relevant events such as Wikimania, OpenSym and WWW we hope to engage directly with communities around open collaboration and crowdsourcing. We also plan to engage with local Wikimedia chapters, for example in the UK, France, Netherlands and Germany initially and then more once the project has gained traction. This will help us raise awareness of the project goals and progress, and of the tools that we produce. Other options that we will consider are arranging hackathon events, for example to enable people to gather and perform source assessment.
Following the conclusion of the project, we will publish an online service based on the machine learning models developed in the project. This will allow the continued collection of data about source quality so that a persistent and continually updated ranking can be maintained. We will also publish our experiences and results from the crowdsourcing experiments, so that these works can be taken forward by third parties. The aim would be so that in the future sources could be added based on the same table of rankings to ensure that statements are referenced with high-quality, reliable sources.
Measures of success
- Number of sources assessed
- Number of new sources added
- Number of sources improved
- Number of people engaged in crowdsourcing experiments
- Accuracy of machine learning algorithm
- Number of people using the tool/software
Prof. Elena Simperl (f)
Elena is a Professor in the Web and Internet Science (WAIS) research group in the department of Electronics and Computer Science (ECS) at the University of Southampton, United Kingdom. Her primary domain of research is at the intersection between knowledge technologies, social computing, and crowdsourcing. In particular she is interested in socially and economically motivated aspects of creating and using semantically enabled content on the Web, and in paradigms, methods, and techniques to incentivize collaboration and participation at scale. Recently she has been looking at particular classes of Web systems that aim at massive crowd engagement: citizen science platforms; crowdsourcing and open innovation initiatives creating economic and social good through the use of open data; or universal knowledge bases such as DBpedia, Wikidata, and the Linked Open Data Cloud. She has had the opportunity to investigate these topics in over twenty European and national research projects, often as project coordinator or technical lead.
Dr. Chris Phethean (m) : Chrisphethean
Chris is a research fellow in the Web and Internet Science (WAIS) research group at the University of Southampton. His research has focused on understanding online communities, particularly around social media, online collaborative systems and citizen science projects. He now works on the EU H2020 Stars4All project, leading the work package on community awareness and engagement. Previously, he has been curriculum manager for the European Data Science Academy project (EU H2020) and led the work package on dissemination and community building. His current research is around the effects of gamification in citizen science. Recently he has also been carrying out work looking at the communities of users working to maintain large scale knowledge bases such as Wikidata, comparing the community dynamics on this site to others such as Wikipedia, and how this in turn affects the quality of data stored within the platform.
Alessandro Piscopo (m) : Alessandro_Piscopo
Alessandro is a PhD student in the Web and Internet Science (WAIS) research group at the University of Southampton. His current research interests focus on exploring community dynamics behind the collaborative creation of structured knowledge. In particular, he is looking into how data quality is influenced by community processes -- especially in Wikidata. Previously, he has carried out research on the use of open data from diverse governmental sources to produce information about social characteristics of local communities. His other past experiences include working in the publishing field, in the production and editing of scientific textbooks.
Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
- Automated work on finding sources is valuable for Wikidata. Finding ways to score source quality is also valuable to validate the quality of Wikidata. ChristianKl (talk) 20:35, 16 February 2017 (UTC)
- Harsh Thakkar, Kemele M Endris, Jose M Garica, Jeremy Debattista, Christoph Lange, and Soren Auer. 2016. Are linked datasets fit for open-domain question answering? a quality assessment. In Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS16). ACM
- Freddy Brasileiro, Jõao Paulo A Almeida, Victorio A Carvalho, and Giancarlo Guizzardi. 2016. Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 975–980