User:Charles Matthews/Draft proposal

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
statusplease add a status
WikiFactMine terrible logo.svg
summarySemi-automatic enhancement of Wikidata from the biomedical literature
targetWikidata and WikiProject Medicine
type of granttools and software
amount100k USD
contact• peter.murray.rust at googlemail dot com
this project needs...
created on10:03, 30 January 2018 (UTC)

Proposed title[edit]



Make a formal description and algorithm out of the MEDRS guideline on reliable sources, and apply it to the referencing of biomedical facts on Wikidata, and elsewhere.

Type of project[edit]


Target projects[edit]

Wikidata, WikiJournal community, Wiki Med (i.e. WPMEDF and medicine WikiProjects); Wikipedias using infoboxes in the medical area that invoke Wikidata statements.

Project idea[edit]

What is the problem you're trying to solve?[edit]

Explain the problem that you are trying to solve with this project

“Find an effective and rigorous approach to the automated creation of well-referenced biomedical content within Wikimedia.”

Wikimedia’s medical content is popular, with medical pages on English Wikipedia being visited 2.2 billion times in 2017. Complete automation is really not possible, and humans do need to be in the loop of content creation and checking. There are two key quality issues: finding appropriate references, and verifying that the statements made are actually supported by those references.

A systematic approach is available, with current technology. The use of Wikidata means that all content is held in a structured and multilingual form, easing translation of basic facts; and can be displayed on Wikipedias in all language versions. What is still needed is a more serious form of source criticism. We need to record and then apply the expertise of the Wikimedian medical editors, and this is the bottom line for improving clinically significant content.

In this area. “reliable sources” cannot always be judged by any simple or intuitive criterion: in colloquial terms, there is no good “duck test”. The MEDRS guideline on the English Wikipedia has been developed for exactly that reason. It could potentially be handled by a machine, operating on metadata gathered into Wikidata. The appropriate secondary literature is on the scale of tens of thousands of papers, rather than the hundreds of thousands published every year.

An algorithm may, however, encounter grey areas, where straightforward criteria break down. Human inputs can then be channelled into an annotation system, as a forum: annotations have a recent W3C standard, from 2017. To develop insight, the analysis of edge cases has to be scaled up into fuller case studies. Other forms of semi-automation are used on Wikidata, but for the success of this application, detailed reasoning on the quality of referencing is central.

What is your solution?[edit]

The combination of a new platform, ScienceSource, a community working on it, and software development.

  • The ScienceSource platform will collect ca 30,000 of the most useful Open medical and bioscience articles from Open repositories and convert them to uniform syntactic HTML form (e.g. Wikimedia-Parsoid). These are supported by WikiData entries for the bibliography (authors, journals, date, etc.) developed in WikiFactMine with the Fatameh tool. Selection will initially be through (i) existing citations of Open articles on Wikipedia pages; (ii) the most accessed of the WikiData-Fatameh entries; (iii) addition of lists of key entries by Wikimedia editors (medical and scientific).
  • We will work with two Wikimedia communities (WikiMed and WikiJournal) to develop machine-assisted human-reviewing (where machines suggest, but do not dictate or hide the decision-making process). ScienceSource will automate routine document conversion tasks.
  • Articles will be semantically enhanced by annotation with terms in WikiFactMine dictionaries (e.g. diseases, drugs), It is a tool for editors and readers to (i) enhance reading (ii) suggest synonyms and preferred usage (iii) check against WikiData values (“facts”). Features may include (i) metadata - date of publication, source organization… (ii) presence of key components (e.g. CONSORT metadata) (iii) values (patient numbers, length of study, drugs used, etc.). We’ll use agile approaches to respond frequently to the experience of editors.

Project goals[edit]

The overarching goal of the project is to create a high-quality corpus of Wikidata-annotated biomedical articles that will be used by Wikimedians (e.g. Wiki Med, WikiJournal) and the wider world. In other words, the aim is good coverage on the ScienceSource wiki and Wikidata of recent biomedical review literature. Such a corpus will assist Wikimedians in writing and referencing medical content, to a high standard, and closely linked with Wikidata's science and metadata content. It will also add to the prominence of Wikidata and the WikiCite initiative. Most importantly, infobox content in medical areas drawn in from Wikidata will be more reliable and of improved quality.

To that end, there are three subgoals:

  • Import new referenced facts into Wikidata, conforming with the medical references guideline; and improve the quality of referencing of existing biomedical statements in Wikidata, by adding references, or replacing existing references.
  • Metadata improvement for Wikidata's items on biomedical papers.
  • Build a working ScienceSource community, as a technical platform.

Meeting those goals[edit]

  • Process 30,000 downloaded papers, including all WikiJournal papers within the biomedical scope.
  • Metadata import to Wikidata, 15,000 statement.
  • Annotations on platform: 3,000 contributions.

Once the project is complete, the improvement of quality in Wikidata content will continue to have positive outcomes for all Wikipedias adding it through infoboxes.

Participation/content goals[edit]

  1. Total participation at project events/on platform/circulated 150
  2. Content pages improved or created 500+500

Project plan[edit]


Technical activities[edit]

  • Set up wiki platform: to download HTML papers from EPMC into it.
  • (Update quickscrape) Download papers from different places
  • Update dictionaries: diseases, genes, d.
  • Mine the papers.
  • Co-occurrence: RDF triple linking ‘fact’ and paper is uploaded to Wikidata.
  • Check paper against medical guidelines using an algorithm review and auto-annotate using dictionaries.
  • Human review of the auto-annotations and manual addition of other annotations
  • Approved annotations are stored on wiki labs as an RDF triple.
  • Annotation RDF files are uploaded into Wikidata.



Community engagement[edit]


ContentMine has extensively engaged the Wikimedia community in the past four years. Peter Murray-Rust delivered a keynote talk at Wikimania 2014 and at the Wikipedia Science Conference 2016, where we also ran a workshop. We have introduced Wikidata to audiences in over 50 presentations at UK and international meetings, highlighting it as a key resource for those searching for scientific knowledge and as the future of large-scale scientific data curation. Through our WikiFactMine project grant, our Wikimedian-in-Residence Charles Matthews ran training sessions in Cambridge University for researchers and librarians to teach them about Wikidata and the Wikimedia universe. We also participated in the Wikidata 4th Birthday Party, Mediawiki Dev summit, European Wikimedia Hackathon, WikiCite 2017, Pre Wikimania 2017 Hackathon and Wikimania 2017. The research results were presented at WikidataCon (October 2017).

We contributed to linking Wikimedia with mining the scientific literature and online, for example at the July 2017 Cambridge text and data mining conference. For example, we produced a regular weekly blogpost series aimed at librarians and the Cambridge research community and a monthly newsletter delivered by English Wikipedia Mass message. We also produced and maintained the central WikiFactMine ‘hub’ on Wikidata and totally revamped the Wikidata:Sources page to help support WikiCite. ContentMine has contributed resources to the regular Cambridge Wikimedia meetups.

Participation in the ScienceSource project would be promoted by the same mixture of regular communications, training and workshops, and participation in conferences and meetups.

Community notifications

Early forms of this proposal have been trailed widely to Wikimedians, and significant discussions held. Advance notifications of the proposal were:

  • WPMEDF contacted via their Board meeting of November 2017, leading to consultation in December, particularly on medical referencing and metadata/tagging imports from repositories. This would be a continuing advisory relationship, through RexxS.
  • WikiJournal, contact via wikiversity:Talk:WikiJournal User Group thread started 8 January 2018, also face-to-face meeting with User:Evolution and evolvability in Cambridge.
Wikimania 2018 Workshop

We aim to have a beta system by Wikimania 2018 that can be tried out with many Wikimedians in a workshop run by project team members. The primary activities and aims of the workshop will be:

  • Demo and hands-on user workshop with active Wikipedia and Wikidata editors to get feedback and ideas for useful features.
  • Developer session with others looking at semi-automated information flow into Wikidata and other related tools.

We chose Wikimania because it brings together many of the most active Wikimedia volunteers and contributors and we will be ready for broader feedback on the tools by August 2018.



The grant would be to ContentMine Limited, a UK non-profit for extracting facts from the scientific literature.

Primary contact[edit]

Peter Murray-Rust (Founder and Director of ContentMine)

Peter has been a Wikimedian since 2006 and delivered a keynote talk at Wikimania 2014 and Wikipedia Science Conference 2015, where CM also ran a hands-on workshop. Peter founded ContentMine as a Shuttleworth Foundation Fellow, and is the main software pipeline architect. He received his Doctor of Philosophy from the University of Oxford and has held academic positions at the University of Stirling and the University in Nottingham. His research interests have focused on the automated analysis of data in scientific communities. In addition to his ContentMine role, Peter is also Reader Emeritus in Molecular Informatics at the Unilever Centre, in the Department of Chemistry at the University of Cambridge, and Senior Research Fellow Emeritus of Churchill College in the University of Cambridge. Peter is renowned as a tireless advocate of open science and the principle that the right to read is the right to mine.

Other participants[edit]

Jenny Molloy (Director of ContentMine)

Jenny is a molecular biologist by training and manages ContentMine collaborations and business development. She spoke on synthetic biology at Wikipedia Science Conference 2015 and has been a long term supporter of open science. She is also a Director of Biomakespace, a non-profit community lab in Cambridge for engineering with biology.

Charles Matthews (Wikimedian in Residence, ContentMine)

Once a lecturer at DPMMS, Cambridge, Charles has been a Wikipedian since 2003, and is active also on Wikisource and Wikidata. He has been a staff member and contractor for Wikimedia UK, and is co-author of “How Wikipedia Works”. He was employed by ContentMine in 2017 as Wikimedian in Residence at the Moore Library, and on their WikiFactMine project, and continues to work with them as a volunteer.

Wikimedian advisors

Community notifications[edit]

Advance notifications of the proposal

  • WPMEDF contacted via their Board meeting of November 2017, leading to consultation in December, particularly on medical referencing and metadata/tagging imports from repositories. This would be a continuing advisory relationship, through RexxS.
  • WikiJournal, contact via wikiversity:Talk:WikiJournal User Group thread started 8 January 2018, also face-to-face meeting with User:Evolution and evolvability in Cambridge.