This project is funded by a Project Grant

Report accepted

This report for a Project Grant approved in FY 2016-17 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/ContentMine/WikiFactMine.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's final report! This report shares the outcomes, impact and learnings from the grantee's project.

Part 1: The Project

Summary

WikiFactMine was a 12 month project to extract facts from the scientific literature. It did it by an advanced search mechanism, using sets of search terms called "dictionaries", allowing massively parallel thematic search. Dictionaries can be built up from queries on Wikidata, and that procedure was made much slicker over the course of the project. WikiFactMine also played a part in the WikiCite initiative, to build up bibliographical content within Wikimedia, by providing a software component that was then applied millions of times on Wikidata.

Project Goals

Please copy and paste the project goals from your proposal page. Under each goal, write at least three sentences about how you met that goal over the course of the project. Alternatively, if your goals changed, you may describe the change, list your new goals and explain how you met them, instead.

Promote Wikidata to the world as the first place to go to or use for reliable identification and re-use of scientific concepts. The overarching goal of the project is to enhance the bioscientific coverage in Wikidata so it becomes a visible and used resource both for Wikipedians and in general scientific research and discourse.

Specific goals for the 12 months funding requested are:

Goal A: Use dictionaries derived from Wikidata to index the daily literature

Dictionaries were in use throughout the project. "Legacy" dictionaries were used initially, compiled by various methods. In summer 2017 more methodical use was made of Wikidata SPARQL queries, and the main focus of dictionaries moved over to flowering plant species.

Goal B: Create a feed of scientific facts in context with associated citations for Wikipedia and Wikidata editors

The feed was set up via the WikiFactMine API. Facts were offered to Wikidata editors via Javascript that could be installed in the sidebar. The emphasis was on Wikidata, rather than Wikipedia.

Goal C: Promote a combination of the scientific literature, Wikidata and Wikipedia as powerful resources for the scientific community, build and support the community of editors who facilitate interlinking between these three resources.

The Fatameh tool developed by WikiFactMine in early summer 2017 played a key role in the WikiCite initiative, in terms of the creation of millions of Wikidata items on individual scientific papers. In turn, those items began to host citation data, at scale, realising the broader goals of the w:Initiative for Open Citations. These advances fed into other projects, such as Scholia.

Project Impact

Targets

The following WikiProjects were nominated as of interest to WikiFactMine: WikiProject Biology, WikiProject Chemistry, WikiProject Genetics, WikiProject Molecular and Cell Biology, WikiProject Medicine, WikiProject Neuroscience, WikiProject Psychology, WikiProject Taxonomy, WikiProject Tree of Life (includes WikiProject Species).

Planned measure of success (include numeric target, if applicable)	Actual result	Explanation
Dictionaries 4-8 dictionaries successfully deployed creating an index of facts from up to 10k papers on a daily basis.	Over 200 deployed	Creation of dictionaries via SPARQL and the aaraa tool simplified the process of creation.
Facts We anticipate reliably indexing around 200k terms, generating upwards of 10k facts per day by M10.	4.5 million extracted	Despite ongoing difficulties with the Euro PMC API, and some downtime, the fact extraction pipeline was productive and robust enough for the intended purpose.
Use of feeds Feeds actively used on a weekly basis by at least 10 Wikipedia and Wikidata editors by M12 Contributions to 1000 Wikidata entries by M12 Contribution to 100 Wikipedia entries by M12	Fell short of targets	Feedback suggests that it was a mistake not to develop a user-friendly interface and route for adding facts to Wikidata. The technical demands were too high for most of the science WikiProjects in the scope to become deeply involved. The initial proposal under-estimated this "software readiness" issue as a dimension of project planning.
Outreach 1-2 members of each active WikiProject actively contributing to feedback on dictionaries and tools. 250 people attending in-person or online events relating to linking the scientific literature, Wikidata and Wikipedia	Active approach reached multiple broad audiences	See Activities section below. By early summer it became clearer that the proposed networking to librarians and scientists in Cambridge was not finding the required traction. The Facto Post newsletter was started, to inform Wikipedian editors and WikiProjects. The emphasis was on putting Wikidata in a broad context. The mailing list has seen steady growth.

Story

In the end the dictionaries were SPARQLy. That is not how things started, and it is not the whole story.

Text mining that goes on inside a virtual machine, hosted on Wikimedia Labs, is not the easiest activity to visualise; or indeed in this case to set up. The ElasticSearch technology is essentially the same as you would use by typing into Wikipedia's search box. The WikiFactMine application, however, stops short of ranking search results: that is not the goal. The compensation is to use thousands of terms in parallel (say a list of known carcinogens) and identify each occurrence in thousands of scientific papers. Daily.

That list would be an example of a dictionary, in the project's sense. Compiling a good list is a more encyclopedic activity, one that most Wikimedians can follow with greater understanding. A search with a custom dictionary reflects a personal take on what might be interesting in the vast, apparently trackless scientific literature.

The WikiFactMine team of two headed for Montreal and Wikimania in August 2017, looking out for those who might adopt the outputs of the core technology. At the 75% point of the project, the pipeline for papers was running relatively smoothly, after a period when the tech onus had been elsewhere (the creation of Wikidata items on papers, enabling referencing by link). At the hackathon, a concrete topic was brought up by a gardening and plant fan: soil types.

By July the construction of dictionaries had switched over from ad hoc methods, to SPARQL. That's a slick solution, based on Wikidata content, and can often enough provide a good dictionary in a matter of minutes, applying a WikiFactMine tool. To begin with, Wikidata's coverage of soil types was not of the required standard. What was actually involved was looking through the category systems of Wikipedias, particularly in German, Polish and Russian, for soil types (reflecting geography). And that, checking out articles illustrated in an unexpected way. With photographs of holes in the ground.

Wikipedia is a marvel. Numerous states in the USA have their official state soil type! There can hardly be a better way to get your feet back on the ground, from the cloud, than working with all the different types of clay you can find, worldwide. There was a question after the conference talk, about the possible impact of this type of text mining on school work. An appropriate answer seemed to be, yes, a search for something like soil types mentioned in literature could potentially set off project work for school students. Yes, with caveats of course: enough open literature, slick dictionary creation for a broader range of topics, and this technology as a commodity.

Towards the end of the conference, the classic type of application of WikiFactMine technology was requested by a molecular biologist: cell lines. They are both a standard experimental tool in life sciences, and the source of unreliable science when the line is corrupt. A dictionary of cell lines wasn't hard to produce. Through a later Wikidata debate, a new property allowing items on papers to have statements about experimental methodology was created. A home for hard-to-find scientific information was assured by that, an instance to support the original "project idea".

Later in August, there occurred a highlight for the latter part of the project. About midday on a Saturday, as we were gathered for the first day of a ContentMine strategy weekend, the number of scientific paper items on Wikidata passed 5 million. Six months before, this development could not have been foreseen, and WikiFactMine had played a part in pushing along science bibliography on Wikidata.

Survey(s)

N/A

Other

See Project resources, Facto Post section.

Methods and activities

Software development

The main software effort was to construct the pipeline in Wikimedia Labs, involving a number of virtual machines and tools, chained together (see diagram, right). They implemented the download and conversion of recent open access papers in XML format. The fact mining itself was an implementation of w:Elasticsearch. Further software allowed the results to be exposed in an API.

The secondary software activity was the dictionary side, involving in the end a combination of SPARQL queries, and the aaraa conversion tool.

A third strand was software to create Wikidata items for scientific papers. This fatameh tool was enthusiastically adopted by other developers. Its maintenance required significant developer time.

Use of outreach channels

ContentMine networks across open science, academic research groups, and the tech industry, particularly from our base in Cambridge. For the WikiFactMine project, the main Wikimedia outreach events are covered by the project resources section, with some videos of presentations. Other examples included:

Personal contact with Wikidata developers, in Berlin and elsewhere
d:Wikidata:Property proposal/cell line used, debate led to the creation of the new Wikidata property "describes a project that uses" (P4510)
Vienna hackathon, Wikimania hackathon Workshop presentation on the Javascript tool
IFLA conference 2017 librarian community presentation, CopyCamp Presentation
Virginia Tech students workshop.

Advocacy

Academic, NGO and industrial organizations to which ContentMine has been promoting Wikidata's benefits for science are numerous. They include:

UCL, University of Illinois at Urbana-Champaign, Virginia Tech, Tilburg University
UNESCO, Cochrane, PlantImpact, Lhasa, Citrine, Ginko, E-life, MDPI, CognitiveChem,

Wikimedia chapter

We also have developed strong contacts with Wikimedia UK, who were involved in the hiring process of the Wikimedian in Residence.

Project resources

Videos

Tom Arrow at WikiCite, 23-25 May, 2017.
Charles Matthews at Cambridge Text and Data Mining Symposium, July 2017
Charles Matthews, participant at that symposium in roundtable discussion, speaking from 6:39
Tom Arrow at WikidataCon, October 2017

Blogposts

Series of 12 weekly blogposts on the Moore Library website Engaging With Data, April to July 2017

WikiCite

d:Wikidata:Sources, WikiCite collation page, created 18 May 2017 from redirect

Facto Post

w:User:Charles Matthews/Facto Post, monthly mass message on English Wikipedia, first four issues from June. Signpost op-ed w:Wikipedia:Wikipedia Signpost/2017-06-23/Op-ed on 23 June on the overall role of Wikidata.

Training sessions

w:User:Charles Matthews/Training supporting links, "Reliability of Wikipedia … and how you can help", Moore Library, 22 June 2017
d:User:Charles Matthews/Training supporting links2, "Introduction to Wikidata", Moore Library, 29 June 2017
d:User:Charles Matthews/Training supporting links3, "The Many Faces of the Wikimedia Universe", Moore Library, 6 July 2017

These events were hosted by the Moore Library in their Research series. The format was a lunchtime lecture, followed by a two-hour training session.

Wikimania 2017 images

(above) Tom Arrow at the Hackathon

(below) Charles Matthews in the Community Village

Early project documentation on discuss.contentmine.org

Posts by "T" (Tom Arrow), in particular

Proposing paper on Wikidata dictionaries for content mining at 1st Workshop on Scholarly Web Mining (SWM 2017)
WikiFactMine Schematic, December 2016
Deploying WikiFactMine Pipeline on WMF Tool Labs, February 2017

Early dictionaries

d:Wikidata:WikiFactMine/Legacy dictionaries with GitHub links

Project documentation on GitHub

The WikifactMine API Endpoint, development of access to the mined facts. It refers to the Swagger User Interface.
ContentMine dictionaries upload.

Project documentation on Wikidata

d:Wikidata:WikiFactMine, around 20 pages, especially the following subpages:

d:Wikidata:WikiFactMine/Core SPARQL, warehouse of SPARQL queries developed for dictionary creation.
d:Wikidata:WikiFactMine/Dictionary list, listing of dictionaries, with visualisations etc. for the botanical series.
d:Wikidata:WikiFactMine/New dictionaries and aaraa tool, documentation for the conversion of SPARQL queries to dictionaries.
d:Wikidata:WikiFactMine/Zenodo, final destination for mined facts as link.
d:Wikidata:WikiFactMine/Fatameh tool in context
d:Wikidata:WikiProject Source MetaData/fatameh, tool homepage

Dashboard

Fatameh dashboard (access required)

Cambridge Wikimedia meetups

Meetup/Cambridge/34 held in Makespace, 16 Mill Lane, Cambridge, 3 June 2017 — ContentMine video among presentations.
Meetup/Cambridge/35 held in Makespace, 16 Mill Lane, Cambridge, 3 September 2017 — software workshop on a food science theme, as demo of the new WikiFactMine tools. Workshop links page: d:User:Charles Matthews/Food science workshop links.

Management

d:User:Charles Matthews/WikiFactMine, details of tasks

Learning

What worked well

Learning patterns/Go the extra mile, and then search for treasure

What didn’t work

Too much reliance on end users to operate the output mechanism.
Similarly, too much initial reliance on their ability to create their own dictionaries.
The emphasis on very recent papers didn't reflect scientists' interests, which go back a few years at least.
For medical applications, selective concentration on the secondary literature would have brought more results.
Search for two kinds of terms (co-occurrence) needed more technical attention, to outclass existing databases.

Next steps and opportunities

Grants:Project/ScienceSource takes the WikiFactMine idea to the next level.
On the WikiCite front, there are a number of opportunities, such as identifying authors of papers, and building up other metadata (e.g. topics) essential to a better understanding of the scientific literature.

Part 2: The Grant

Finances

Actual spending

Expense	Approved amount	Actual funds spent	Difference
Software development (37.5 Hrs per week for 12 months)	$40000	$44590 (Nov 2017)	-$4590
Wikimedian in Residence (37.5 Hrs per week for 6 months)	$19500	$15139 (Nov 2017)	$4361
Travelling and meetings (including workshops, meetups, project meetings and dissemination activities)	$3000	$5698 (Jun 2017)	-$2698
Wikimania & WikiCite conference (2 people, including flights, accommodation and subsistence)	$1200	$11620 (Nov 2017)	-$10420
Total	$63700	$77047 (Nov 2017)	-$13347

Remaining funds

Do you have any unspent funds from the grant? No. ContentMine supported the overspend on the project.

The following costs were incurred, according to ContentMine's internal management accounting (USD, current exchange rate to GBP used):

Total spend $104,347.60

The proposed amount was $86,200.00, grant received was $63,700.00.

Documentation

Did you send documentation of all expenses paid with grant funds to grantsadminwikimedia.org, according to the guidelines here?

Yes.

Confirmation of project status

Did you comply with the requirements specified by WMF in the grant agreement?

Yes.

Is your project completed?

Yes.

Grantee reflection

This wonderful grant came at just the right time for us (and Wikimedia) to show that Wikidata was a world-class resource. There was a lot of innovation and besides following the plan we originally set out, we also were able to develop Wikidata as a bibliographic resource. The Fatameh tool which Tom Arrow developed is now run on WMFLabs (thanks to the cooperation of all involved). It's now got over 10 million Wikidata entries and I and others have been able to blog and tweet this new resource. We enjoyed going to WikiCite and Wikimania and being able to feel part of a world community. We had a Wikimedian in Residence, Charles Matthews, as part of the team; and important support and encouragement from WMF staff.

We are completely sold on Wikidata as the central point for structured resource discovery and use. The SPARQL endpoint is the most convincing advert anywhere for the value of SPARQL and as Wikidata continues to grow - both in entry number and numbers per entry - it will start to convince the world that there is a new exciting tool that we, the community, own and develop and we can base new community knowledge engines on it.

We are now promoting Wikidata in all our projects and grant applications. The dictionary structure we developed in WikiFactMine turns out to be ideal for use by many communities and we are working closely with Thomas Shafee in WikiJournal. We have applied for another project grants, Science Source, to create a central resource of high quality annotated biomedical articles and we are also building WikiFactMine into our own (Open) products.

Thank you!! (Peter Murray-Rust, User:Petermr)