Jump to content


From Meta, a Wikimedia project coordination wiki

Timeline for ContentMine[edit]

We're working on putting times to tasks/milestones. See Grants:Project/ContentMine/WikiFactMine/Planning for tasks without times.

Timeline Date
Advertise for Wikimedian in Residence 31/01/2017
Port Canary to Tool Labs (T5 from Proposal) 31/02/2017
1000 new dictionary entries from WiR outreach 30/04/2017
Graphical Tool to ingest weekly feeds and present to editors 15/05/2017
Gadget to suggest journal article relevant to Wikipedia article/Wikidata Item 30/06/2017
100 people attended in person or online events relating to linking the scientific literature, Wikidata and Wikipedia 30/07/2017
Gadget to suggest references for Unreferenced Wikidata statement 30/08/2017
Suggested "Main Topic" Statements Via Primary Sources Tool 30/10/2017

Monthly updates[edit]


This month we've just been getting started:

  • Development of the API: initial version of date querying with dummy data available on tools now [1].
  • Getting permissions to use the ES cluster on tool labs see: T149709
  • Working towards WiR details


Progress on the software development front:

  • Using the ES cluster on tools as the back end for the API (sample data is available on the date 2016-12-09)
  • Searching by 'ingestion date' using the API is now supported
  • Basic gui interface to date searching is now available on the 'FactVis' demo (see https://tarrow.github.io/factvis/)
  • Begun porting canary to standalone commands to run on tool labs

You can follow this progress in more detail at https://github.com/contentmine/wikifactmine-api

Wikimedia in residence: Good progress made on arranging details like: desking, insurance, internet access etc..


We attended the WMF developers summit. We started work on the Karoo tool which is a port of canary to a command line application so that it is suitable for running on WMF tool labs. Code was written to implement:

  • Correctly mapping and clearing the Elasticsearch indices for papers and facts
  • Loading papers from the disk to the Elasticsearch paper index

We also corrected a bug in Canary that resulted in fewer facts than expected because we weren't correctly using the Elasticsearch scroll api.


The news this month is that we have just advertised the WiR position! Please look at the ContentMine website ([http://contentmine.org/jobs ) for further details and how to apply. .

Applications closed at the end of this month and we're now preparing for interviews.



We had a great deal of success in advertising for a WiR. We had a large number of very well qualified good candidates. We've conducted two round of online interviews and have now made an offer to the most suitable candidate. We're just organising the administrative details before work can start. It's a very exciting time indeed!


Development work has also made good progress after stalling for a little bit due to some problems getting everything running on Labs.

We had problems with using the Elasticsearch cluster that is available for tool-labs. Unfortunately is seems likely that the repeated queries we were making to extract facts made the cluster respond occasionally by falling over. You can read about this on Phabricator [2].

In an attempt to debug this an Elasticsearch node was set up under the wikifactmine labs project. Although this took quite a while it turns out that using just a single small node on our own labs project didn't seem to fall over during fact extraction.

Thankfully this means that the pipeline is currently working and you can get facts that we extracted today (from all open access papers published on EuropePMC on the day exactly 30 days ago).

You can see this on the APIs swagger interface [3] and by using the date selector at the bottom of the fact visualiser [4]



The excellent Charles Matthews started working for the project this month. He's wasted no time in starting outreach in the Moore library in Cambridge. He wrote a nice introductory blog post.


The main advances in development work this month are the ability to search for facts by related Wikidata item. Again this can be seen on the APIs swagger interface [5]. the next steps to be follows are to use this to create a gadget which will display a selection of facts as you browser different items (for which we have facts) on Wikidata.

We also now have a dashboard where you can see the number of facts extracted each day [6].



A userscript for Wikidata has been developed to showcase facts that we have for each item. It is backed by the wikifactmine api. You can see both the code and instructions to install it in your common.js on github.

We also attended the WMF hackathon and Wikicite in Vienna. An important proof of concept was put together there with help from User:Tobias1984. Fatameh is an OAuth tool that can be used to create Wikidata items about academic papers that have PMID (and associated items such as authors where appropriate) on demand in a very friction-less fashion. WikiFactMine may well hope to use this tool to create items onto which we can then offer statements for addition using the primary sources tool. The code is available on phabricator.


5 Blog posts written:

Arranged a Cambridge Wikimedian's Meetup on meta at Meetup/Cambridge/34.

Helped with some WikiCite preparations closely related to the WFM project: you can explore the tasks on phabricator. The portal-style page d:Wikidata:Sources was originated.



Continued work on Fatameh, bug fixing. Enabling it usage for PMCIDs as well. Enabled command line usage of it through API keys to it can be used in a highly automated way. It made (or attempted to make >50k Wikidata items) by ca. end of June. On wikidata landing page is: here


4 Blog posts written:

Ran Cambridge Meetup 34

Delivered issue 1 of FactoPost by mass delivery to ca. 50 stakeholders on en-wiki

Also had meeting with Marti Johnson from WMF. The slides are here.



Most development work in this month focused on producing aaraa. This is a tool to help Wikimedians create their own WikiFactMine dictionaries. More documentation on this can be found by following the link. The tool can be found here. Work was also put into supporting the fatameh tool which saw increased usage.


2 Blog Posts Written:

The July issue of FactoPost

Much work was then focused on supporting the development of aaraa as well as using to to produce a wide range of dictionaries from SPARQL queries. These queries can be seen here. The dictionaries made from these queries can be seen on github


August saw us send two people to Wikimania in Montreal as well as for the hackathon before hand. We ran a workshop at the hackathon and gave a talk at the main conference. We also ran a stand for the majority of the conference giving Wikimedians the chance to find out about the project and to have a go at making a dictionary for a topic that interests them.

We also went to the IFLA World Library and Information Congress to present the project to librarians.


Time that was not spent travelling was used for bug fixing in WFM api/pipeline, aaraa, fatameh


Lots of documentation was written under our Wikidata 'Landing page' as well as the creation of more dictionaries.

The August issue of Facto Post was written and can be read here

Is your final report due but you need more time?