Grants:Project/ContentMine/ScienceSource

From Meta, a Wikimedia project coordination wiki
statusselected
ScienceSource
summaryScienceSource makes a formal description and algorithm out of the MEDRS guideline on reliable sources, and applies it to the referencing of biomedical facts on Wikidata, and elsewhere.
targetWikidata; Wikipedias using infoboxes in the medical area that invoke Wikidata statements.
type of grantsoftware
amountUS$99,640
type of applicantorganization
granteeContentMine Ltd.
contact• cesar@contentmine.org
affiliaten/a
this project needs...
volunteer
join
endorse
created on11:42, 31 January 2018 (UTC)
ScienceSource is a proposed project of ContentMine, and part of the WikiFactMine initiative

Project idea[edit]

What is the problem you're trying to solve?[edit]

“Improve biomedical content within Wikimedia, by building an algorithmic version of the medical references guideline."

Wikimedia's medical content is hugely popular, with medical pages on English Wikipedia being visited 2.2 billion times in 2017. This makes it imperative that the information is kept accurate and current. However, creation of well-referenced biomedical content within Wikimedia is challenging due to the volume of appropriate secondary literature, totalling tens of thousands of papers per year. This greatly complicates finding appropriate references, and verifying that the statements made are actually supported by those references.

Two subproblems we aim to solve are that (a) specialist editors cannot currently easily access and screen potentially relevant papers due to volume. Once papers are identified, they (b) currently can't easily be connected to multiple articles in Wikipedia and entries in Wikidata in ways that would help verify the relevance of the content and claims.

We believe these problems can mostly be resolved through the combination of machine automation and input by editors that we propose here in the ScienceSource platform.

What is your solution?[edit]

We propose the combination of a new platform, ScienceSource, a community working on it, and software development.

Finding and critiquing “reliable sources” requires a systematic approach that combines automation with manual content creation and checking by a human, specifically the community of Wikimedian medical editors (abbreviated as Wiki Med) whose input is critical for improving clinically significant content. Reliable sources cannot always be judged by any simple or intuitive criterion but there are existing Wikimedian community guidelines such as w:Wikipedia:Identifying reliable sources (medicine) (MEDRS) that we propose to turn into an algorithm to screen metadata gathered in Wikidata, and human inputs.

To develop insight into where the algorithm is valid and improve a semi-automated approach, edge cases require analysis, and there must be detailed reasoning on the quality of referencing. The human editors will therefore annotate papers, and those annotations will be shared according to the W3C annotation standards.

Summarising the approach:

  • The ScienceSource platform will be a collaborative MediaWiki site. It will collect and convert up to 30,000 of the most useful Open Access medical and bioscience articles and convert them.
  • We will work with two Wikimedia communities (Wiki Med and WikiJournal) to develop machine-assisted human-reviewing. The wiki platform will facilitate the decision-making process, driven by the human reviewers.
  • Articles will be annotated with terms in WikiFactMine (WFM) dictionaries. In this project, those dictionaries will include, for example, diseases, drugs, genes. This not only means that the useful terms are highlighted, but they are also linked to entries in Wikidata and therefore to any relationship that is described in Wikidata. Thus “aspirin” links to d:Q18216 with synonyms, disease targets, chemistry, etc.

Advantages of this route to semi-automation (putting humans in the loop) include full documentation of the rationale for decisions, discussions in annotation form rather than prose, and the combination of human and machine inputs on the same footing. Features will include (i) metadata based on d:Wikidata:Source Metadata, i.e. the rapid development of WikiCite, (ii) potential incorporation of topic, experimental method, and key components relating to medical trials, e.g. w:Consolidated Standards of Reporting Trials (CONSORT) metadata.

We'll use agile approaches to respond frequently to the experience of editors: so wiki values will be to the fore.

Project goals[edit]

The overarching goal of the project (outcome) is to create a high-quality corpus of Wikidata-annotated biomedical articles that will be used by Wikimedians in the Wiki Med and WikiJournal communities and the wider world. In other words, the aim is good coverage on the ScienceSource wiki and Wikidata of recent biomedical review literature. Such a corpus will assist Wikimedians in writing and referencing medical content, to a high standard, and closely linked with Wikidata's science and metadata content. It will also add to the prominence of Wikidata and the WikiCite initiative. Most importantly, infobox content in medical areas drawn in from Wikidata will be more reliable and of improved quality.

Specific Goal Description Wikimedia Project benefit Wikimedia community benefit
Import new referenced facts into Wikidata, conforming with the medical references guideline; and improve the quality of referencing of existing biomedical statements in Wikidata, by adding references, or replacing existing references. Wikidata Wiki Med, WikiJournal
Metadata improvement for Wikidata's items on biomedical papers. Wikidata Wiki Med, WikiCite
Build a working ScienceSource community, as a participatory technical platform. Wikidata Wiki Med, WikiJournal

Project impact[edit]

How will you know if you have met your goals?[edit]

Specific Goal Description Measurement criteria Actions taken
Import new referenced facts into Wikidata, conforming with the medical references guideline; and improve the quality of referencing of existing biomedical statements in Wikidata, by adding references, or replacing existing references. Process 30,000 downloaded papers, including all WikiJournal papers within the biomedical scope. Acquire open access papers by downloading to the platform. Export from the annotations structure to RDF, and upload to Wikidata. Track by wiki histories.
Metadata improvement for Wikidata's items on biomedical papers. Metadata import to Wikidata, 15,000 statements. Acquire open information on e.g publication type, retractions, topics. Import to Wikidata via bot, track by edit summary.
Build a working ScienceSource community, as a participatory technical platform. Annotations on platform: 3,000 contributions. Store annotations, in a format compliant with the W3C standard. Use wiki norms to develop community control, wiki tools to track community work.
Continuing impact
  • The algorithm developed on the ScienceSource platform will have longer-term impact, for example for automated checking for the referencing of Wikipedia articles, to exclude unsuitable citations and the content depending on them. Screening out pseudoscience will become easier.
  • The finished corpus will be a body of open secondary biomedical sources that can be used as an approximation to the state-of-the-art in the science underlying clinical medicine. It will be a resource in uniform format, and with enriched metadata. It would be possible to link directly into it, at anchor points, to make Wikipedia references as links with paragraph targets.
  • We will apply for whitelist status for a Wikibase version of the annotation RDF, so that it could be searched with query.wikidata.org.

Goals around participation or content?[edit]

Metrics Numeric target Tools & documentation
Total participants On the platform: 25, measured by account creation
Workshops and meetups: 150, measured by attendance list
Newsletter circulation: 100 individuals,
Number of accounts
Attendance lists
Mailing list
Number of content pages created or improved, across all Wikimedia projects Number of pages improved: 500
Number of pages created: 500
Wikidata history analytics

Project plan[edit]

Activities[edit]

Our project scope is to develop machine-assisted human-reviewing software based on MEDRS guidelines to assess 30,000 high impact open medical and bioscience articles from open repositories. We divided the work into five work packages, as shown in the table below with each work package, duration (Gantt Chart), objectives and outputs.

Work package Objectives Outputs
Project management Ensure that the action runs smoothly, that there is excellent communication among all the project participants, volunteers and community, and that action outputs are delivered on time to deliver a high quality output. Project Plan.
Progress report.
Risk register.
Final report.
Dictionaries management Select and enhance a set of WikiFactMine dictionaries that are valued by Wiki Med and WikiJournal as daily assets in their work. Revised and expanded WFM dictionaries, committed to Github.
Updated statements in Wikidata entries for dictionary creation.
Corpus and bibliographies management Select and annotate a core set of biomedical articles, that are valued by Wiki Med and WikiJournal. Corpus of 30,000 WFM-annotated articles (Parsoid format) on Wikimedia Labs.
Software development To create a complete toolchain for ingesting Open biomedical articles into WMFLabs, converting to Parsoid, and annotating with WFM dictionaries. Software (such as Quickscrape, XSLT stylesheets, Hypothes.is viewer; most already exists or is based on standard tools) in public repos.
Tutorial material.
Dissemination and community engagement To promote and increase the awareness of the project's benefits. To disseminate the key project results to the community and wider partners. Project website section.
Dissemination plan.
Conferences and community engagement report.
Workshops report.
Project Gantt chart (by month 1 to 12)
Work Package 1 2 3 4 5 6 7 8 9 10 11 12
WP1 Project management X X X X X X X X X X X X
WP2 Dictionaries management X X X X X X
WP3 Corpus and bibliography X X X X
WP4 Software development X X X X X X X X X
WP5 Dissemination and community X X X X X X X X

Budget[edit]

Position/Expense USD
Senior SW Developer (37.5 hours per week for 12 months) $43,500.00
Project Manager (10 hours per week for 12 months) $21,230.00
Senior Wikipedian-in-Residence (25 hours per week for 12 months) $21,730.00
Travel (WikiCite, Wikimania) $12,600.00
Dissemination activities (4x workshops, classroom hire, coffee break, marketing material) $580.00
Total $99,640.00

Community engagement[edit]

WikiFactMine was present at Wikimania 2017 with talks, Hackathon workshop and stall
Past engagement

ContentMine has extensively engaged the Wikimedia community in the past four years.

Peter Murray-Rust delivered a keynote talk at Wikimania 2014 and at the Wikipedia Science Conference 2016, where we also ran a hands-on workshop. We have introduced Wikidata to audiences in over 50 presentations at UK and international meetings, highlighting it as a key resource for those searching for scientific knowledge and as the future of large-scale scientific data curation.

We contributed to linking Wikimedia with mining the scientific literature and online, for example at the July 2017 Cambridge text and data mining conference. For example, we produced a regular weekly blogpost series aimed at librarians and the Cambridge research community and a monthly newsletter delivered by English Wikipedia Mass message. We also produced and maintained the central WikiFactMine "hub" on Wikidata and totally revamped the Wikidata:Sources page to help support WikiCite. ContentMine has contributed resources to the regular Cambridge Wikimedia meetups.

Participation in the ScienceSource project would be promoted by the same mixture of regular communications, training and workshops, and participation in conferences and meetups.

Wikimania 2018 Workshop

We aim to have a beta system by Wikimania 2018 that can be tried out by Wikimedians. We will run at least one introductory workshop in Capetown.

Wikidata 6th birthday event

We will have support from WMUK in organising a local event for Wikidata’s 6th birthday in October, working title “Data modelling for Wikidata”.

Summary table for community engagement activities
ID Title Description Month Effort (person months)
C1 Meetups (Cambridge) Organise and deliver a three-monthly meetup, at or near ContentMine offices. (10-20 people per session) 1 every quarter 0.2
C2 Workshops Organise and deliver four workshops for users and developers, e.g. at Wikimania 2018, Wikicite, Mozfest and FORCE11, follow-up on results (20-50 people per session) 1 every quarter 0.5
C3 Conference presentation Deliver presentation during Wikimania 2018 and WikiCite (20-100 people per session). M1-M12 1
C4 Newsletter Deliver a monthly newsletter, reaching wikimedians, volunteers and community members (100 people on newsletter list). 1 every month 0.2
C5 Project webpages and social media Ensuring development work and results are communicated through ContentMine’s own site, wiki page and social media to interested communities at each step in the process. M1-M12 0.3

Get involved[edit]

Participants[edit]

Peter Murray-Rust (second from left) talks at Biovision 2017
Peter Murray-Rust (Founder and Director of ContentMine)

Peter has been a Wikimedian since 2006 and delivered a keynote talk at Wikimania 2014 and Wikipedia Science Conference 2015, where CM also ran a hands-on workshop. Peter founded ContentMine as a Shuttleworth Foundation Fellow, and is the main software pipeline architect. He received his Doctor of Philosophy from the University of Oxford and has held academic positions at the University of Stirling and the University in Nottingham. His research interests have focused on the automated analysis of data in scientific communities. In addition to his ContentMine role, Peter is also Reader Emeritus in Molecular Informatics at the Unilever Centre, in the Department of Chemistry at the University of Cambridge, and Senior Research Fellow Emeritus of Churchill College in the University of Cambridge. Peter is renowned as a tireless advocate of open science and the principle that the right to read is the right to mine.

Jenny Molloy (Director of ContentMine)

Jenny is a molecular biologist by training and manages ContentMine collaborations and business development. She spoke on synthetic biology at Wikipedia Science Conference 2015 and has been a long term supporter of open science. She is also a Director of Biomakespace, a non-profit community lab in Cambridge for engineering with biology.

Charles Matthews (Wikimedian in Residence, ContentMine)

Once a lecturer at DPMMS, Cambridge, Charles has been a Wikipedian since 2003, and is active also on Wikisource and Wikidata. He has been a staff member and contractor for Wikimedia UK, and is co-author of “How Wikipedia Works”. He was employed by ContentMine in 2017 as Wikimedian in Residence at the Moore Library, and on their WikiFactMine project, and continues to work with them as a volunteer.

Wikimedian advisors (provisional)
  • Daniel Mietchen, technical and scientific
  • Evolution and evolvability, WikiJournal
  • Magnus Manske, technical and scientific
  • RexxS, technical and Wiki Project Med
  • T Arrow, technical and WikiCite

Community notification[edit]

Early forms of this proposal have been trailed widely to Wikimedians, and significant discussions held. Advance notifications of the proposal were: Facto Post Issue 5, 17 October 2017 (see w:User:Charles Matthews/Facto Post/Issue 5 – 17 October 2017), editorial on annotations, link to d:Wikidata:WikiFactMine/Annotation for fact mining and invitation to comment. This mass message is delivered to seven WikiProject talk pages, listed at w:Wikipedia:Facto Post mailing list, as well as individual editors. It was supported by:

Ongoing contacts: Facto Post, Issue 9 – 5 February 2018 has been sent out, with a link here, including delivery to the talk pages of the following WikiProjects on English Wikipedia: Biology, Chemistry, Genetics, Molecular and Cell Biology, Neuroscience, Psychology, Tree of Life. A notification has been made to Project Chat on Wikidata, and a further announcement made on English Wikisource on the Scriptorium, leading to a discussion thread. Notifications have also been left on d:Wikidata talk:WikiProject Source MetaData and w:Wikipedia talk:WikiProject Medicine.

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  1. Support Support Petermr (talk) 11:45, 31 January 2018 (UTC)
  2. Support Support This will enrich the value of biomedical data and literature on Wikidata, and serve as a go-to example for linked data with Wikidata. --Magnus Manske (talk) 10:07, 1 February 2018 (UTC)
  3. Support Support Editing within the field of medicine is a challenging, albeit rewarding, process. Medical articles on Wikipedia remain among the best because of the quality of sourcing that we insist on, i.e. top quality secondary sources. But there is such a wide variety of topics in the field, nobody can hope to be familiar with all of the sources and so I welcome enthusiastically an opportunity to make use of the sophisticated tools now becoming available to help editors find the best sources for a given topic. This proposal is laying out a path to accomplish that goal and I support it wholeheartedly. --RexxS (talk) 18:56, 1 February 2018 (UTC)
  4. Support Support This project can have a very positive impact on the metabolomics content in Wikipathways. --77.173.238.191 13:42, 2 February 2018 (UTC)
  5. Support Support I've had fantastic conversations with Charles and thought a lot about the kind of research data collection and subsequent analysis that one could do with a combination of WikiFactMine, semi-automatic data import and Wikidata Queries. I actually have a research idea that could serve as a showcase of Wikidata usefulness for research - but I need the import mechanism to be more smooth than it currently is. This is a huge step forward :-) Vojtěch Dostál (talk) 09:48, 5 February 2018 (UTC)
  6. Support Support yes - increase quality, by building out CC sources. Slowking4 (talk) 17:03, 5 February 2018 (UTC)
  7. Support Support As someone who's talked to multiple members of this team, I'm excited about the potential - and particularly glad that Petermr and his team recognise the importance of supporting sources in multiple languages (indeed, some significant medical research happens in non-English contexts but doesn't always get amplified because of language barriers; Wikidata helps us there). Medical science literally can and does save lives, and our communities around the world would benefit from accessing this metadata, and contributing to the process of collation and curation. This comment was added by User:Anasuyas at 17:20, 6 February 2018 UTC.
  8. Support Support This project has great potential to bridge between content and references — in both directions — through mining, metadata and annotations. Its focus on automating support to manual workflows around medicine and WikiJournal are promising to be of use to the community well before the project ends and can serve as a basis for similar reference-focused initiatives within other parts of the Wikimedia ecosystem. -- Daniel Mietchen (talk) 18:12, 7 February 2018 (UTC)
  9. Support Support I think that this is a great way to continue adding value to Wikidata, and to sister projects (espec. WikiJournals) by automating creation of associated machine-readable data. T.Shafee(Evo﹠Evo)talk 07:00, 8 February 2018 (UTC)
  10. Support Supportvery worthwhile and useful Ozzie10aaaa (talk) 13:38, 9 February 2018 (UTC)
  11. Support Support I'd be very happy to see further projects build upon the work done by WikiFactMine last year and can see the benefits this project would bring to the community. T Arrow (talk) 16:03, 9 February 2018 (UTC)
  12. Support Support. YULdigitalpreservation (talk) 18:15, 9 February 2018 (UTC)
  13. Support Support Wikimedia projects are a leader in providing open content in new models. In this grant we have a team with a history of Wikimedia community engagement and long credentials in the traditional sciences and open access movement who are presenting cheap and fast ideas and experiments for a new publishing style. What I see here is a system whereby they will import a huge amount of text content into the Wikimedia platform and enrich it with Wikimedia's own native style of publishing, including interwiki links to more information. This program is significant for the content that it will import, the plan it has for remixing it, the media attention for documenting and reporting all of this as a model for anyone else to follow, and for the social connections it brings to Wikipedia in a controversial space of activism for great public access to science information. Everything that I expect to see in a grant request is here - proven team, history of success, high understanding of wiki nuance, off-wiki media plan, combination of human engagement with state of the art bot engagement, and a practical need served to deliver information and content of high contemporary relevance to a massive audience. Blue Rasberry (talk) 16:34, 10 February 2018 (UTC)
  14. Support Support a fantastic idea. LT910001 (talk) 00:11, 11 February 2018 (UTC)
  15. Support Support This is a well thought through proposal and this project would assist our efforts to improve and maintain medical evidence in WP articles. JenOttawa (talk) 03:37, 12 February 2018 (UTC)
  16. Support Support As a member of the Metadata2020 community, I think this would be a great project to help improve and enrich metadata at Wikidata. I know the people involved in the project and I am highly confident that they have the skills and wiki-specific knowledge required to make this proposal a success. I support it wholeheartedly. Metacladistics (talk) 11:51, 13 February 2018 (UTC)
  17. Support Support Development of new pipelines to capture and share open knowledge - facts - in the robust and trusted home of Wikidata, is an essential part of building an inclusive and accessible knowledgebase, fit for the global community this century. This project will support this through community and platform, and by adding many new referenced facts, and is well aligned with the goals of Wikidata and Wikimedia. 131.111.5.141 16:46, 15 February 2018 (UTC)
  18. Support Support I was a founding board member of Wikiproject Med and am very familiar with the needs of Wikipedia’s medicine editors. Well designed and well managed, this project will be a great help. (A note to those implementing this, though: reliable sources are published by publishers with a reputation for accuracy. A good reputation is hard-won over time. WikiJournal is way too young to be considered reliable. If you expect to be taken seriously, when deeming a publisher reliable you’ll hold the bar at least as high as English Wikipedia’s overarching policy, Identifying reliable sources, which insists on a reputation for accuracy.) —Anthonyhcole (talk) 05:28, 16 February 2018 (UTC)
  19. Support Support --Cameronneylon (talk) 09:46, 16 February 2018 (UTC) I am an irregular wikipedian but involved a lot in discussion of Open Access and how to achieve its maximum promise. The coming together of text mining, a decent corpus of content and the the semantic backing of Wikidata offers a real opportunity to achieve this kind of knowledge availability for the first time at scale. That's tremendously exciting and the team has the experience and technical skills to make it happen. Once medium-scale success can be demonstrated this will likely grow rapidly to a large scale.
  20. Support Support I'd like to support the proposal from the Wikidata dev team side. Not having enough good references for the data in Wikidata continues to be a problem that is holding back adoption of Wikidata in Wikipedia. I'd like to see a focus on adding references and verifying existing data in Wikidata and getting that data used in Wikimedia projects. I'd like to see a little less focus on adding a lot of new data (since we already have _a lot_ of that around scientific articles that isn't used enough). --Lydia Pintscher (WMDE) (talk) 11:44, 16 February 2018 (UTC)
  21. Support Support As an open access publisher (at Ubiquity Press) I strongly support this proposal. Broadening access to current and accurate scientific information on through as many channels as possible is extremely important, and wikidata and wikipedia are key. The team behind this project have the capabilities, track record and commitment to make it a success. --Briankh (talk) 17:12, 16 February 2018 (UTC)
  22. Support Support Sounds great! 2001:630:12:1061:4C15:278C:9912:8DCE 11:55, 19 February 2018 (UTC)
  23. Support Support This is such a project that, even should it fail - tough I don't see why it would, - will make valuable contributions for acting on important issues of both the Wikimedia and the Open Science movements. Namely, the sustainability and scalability of the former's communities and the unlocking of the latter's potentials. --Solstag (talk) 01:00, 20 February 2018 (UTC)
  24. Support Support I would love to see more work done around the medical domain and to build upon the work done with WikiFactMine Stefankasberger (talk) 20:37, 25 February 2018 (UTC) (late support)
  25. Support Support This project has a clear objectives on what it wants to achieve, and I totally support the initiative. This is a very useful project.--Jamie Tubers (talk) 10:43, 29 April 2018 (UTC)
  26. Support Support A good project. How does this benefit editors who contribute much medical content to WP? It's good to see so much collaboration on a project like this, but would it benefit any medical editor build articles supported by well-referenced topics? How would someone creating article tap into this project? Best Regards, Barbara (WVS) (talk) 13:23, 29 May 2018 (UTC)
  27. Support Support A brilliant and needed bridge. –SJ talk  19:45, 3 June 2018 (UTC)