Jump to content

Grants talk:Project/ContentMine/WikiFactMine

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 6 years ago by Nemo bis in topic Outputs

Change status to 'proposed' to submit for 8/2/16 Project Grants deadline[edit]

Dear petermr,

Please note that if you intend to submit this proposal for the August 2 deadline of the current round of Project Grants, you must change the status from draft to proposed. If you have any questions, our final proposal clinic is from 1600-1700 UTC on August 2.


--Marti (WMF) (talk) 05:11, 2 August 2016 (UTC)Reply

Thanks - changed proposal status to PROPOSED Petermr (talk) 17:12, 2 August 2016 (UTC)Reply

Staff pay[edit]

I note that in the budget section the proposed pay for the software developer is USD 40,000/year and the proposed pay for the Wikipedian is 38,000 for the year.

One report says that the average salary in Cambridge is GBP 46,000, or USD 61,000. The work proposed in this project seems technical. Can you explain why and how you plan to recruit a technical position in this city while offering much less than average pay for the area? Blue Rasberry (talk) 14:02, 2 August 2016 (UTC)Reply

The average pay is across all sectors and various levels of seniority, we will be recruiting a junior to mid-level technical role in a non-profit and the salary offered is in line with technical roles at this level at the University of Cambridge and with our existing rates. Based on our experience with previous recruitment activities in Cambridge, we do not foresee problems in attracting suitable candidates. --Jcmolloy (talk) 23:14, 20 September 2016 (UTC)Reply

Relationship between grantee and ContentMine[edit]

I see here that petermr founded and seems to head ContentMine.

The grant is proposed to go to Peter. Can you clarify how much of the grant work will be managed by Peter as an individual, and how much will be a part of ContentMine? It seems like ContentMine will be involved in everything, correct? Blue Rasberry (talk) 14:10, 2 August 2016 (UTC)Reply

petermr has been funded by the Shuttleworth Foundation to start and run ContentMine - this pays for staff and expenses. petermr himself runs CM on a voluntary basis. The project would be managed by petermr and staff as appropriate. ContentMine will provide in-kind support for the WikiFactMine project throughout its funded lifetime. — The preceding unsigned comment was added by Petermr (talk) 17:01, 2 August 2016 (UTC)Reply

Other funding and in-kind contributions[edit]

Can you describe what kind of support that ContentMine or the University of Cambridge might contribute to this project? State if other funding support is likely, and also mention any in-kind contributions. Blue Rasberry (talk) 14:15, 2 August 2016 (UTC)Reply

The servers on which the code will be run have been provided through university funds which petermr held. We are working with the other departments to see whether additional funds can be provided.— The preceding unsigned comment was added by Petermr (talk) 17:01, 2 August 2016 (UTC)Reply

ContentMine and commercial interests[edit]

I see that ContentMine embraces open content and licensing, including CC-by licensing for text, CC0 data licensing, and that all the software uses MIT's open source license.

Is ContentMine a nonprofit organization, or a company? Either way, how likely is it that ContentMine could put any part of the output of this project into its own closed product? Blue Rasberry (talk) 14:10, 2 August 2016 (UTC)Reply

ContentMine is a UK company limited by guarantee with a specific OpenLock in its articles that prevents acquisition by predatory closed companies (e.g. commercial publishers). CM deliberately does not create closed products and all material is made available in Open repositories such as GitHub.— The preceding unsigned comment was added by Petermr (talk) 17:01, 2 August 2016 (UTC)Reply

Precedent from similar projects[edit]

Briefly, and with links to further reading if possible, can you point to any project similar to WikiFactMine which has been done before? Briefly, can you say what WikiFactMine might do that has not been done before? Blue Rasberry (talk) 14:12, 2 August 2016 (UTC)Reply

Usefulness for Medical Wikipedians[edit]

This "informing... Wikipedia editors of relevant citation-supported facts" is not really the problem we have. We are lacking people interested in slowing going through high quality review articles and other secondary sources and paraphrasing them in simple language.

We actively discourage the use of primary sources which appears to be what this project will be mostly working on. The example given is for the type of primary source that we as Wikipedians should never be using for anything medical. It is a study of yeast in cell culture and therefore has no current application to humans.[1] The summary left out the most important word potential before "use of lemon essential oil-based products as natural remedies against candidiasis". This is bench top research of the earliest form.

Doc James (talk · contribs · email) 08:32, 3 August 2016 (UTC)Reply

Thanks. These comments are useful. The example given had a more chemical/bioscience emphasis and wasn't meant to emphasize medicine. A more typical medical result would be to alert to mentions of clinical trisls - we are currently part of the Open Trials group.
We are more interested in systematic reviews than single trials. While we are supportive of single trials being open and the work of OpenTrials and AllTrials generally single trials just are not good sources for WP. We want experts to review all the avaliable research and come to an overall conclusion. Doc James (talk · contribs · email) 23:08, 7 August 2016 (UTC)Reply


Just to reinforce what Doc James has written above, Wikipedia prefers secondary sources, not only for medical content (see MEDRS), but for all content (see see SCIRS and secondary). Also while the example sources you have given are from en:Europe PubMed Central and hence not hidden behind a pay wall, it is more important the sources are secondary rather than open. Fortunately this is easy to do in the advanced search in PubMed by restricting the publication type to "review".

The stated aim of this project is To make Wikidata the primary resource for identifying objects in bioscience. As already stated by others elsewhere, the biggest challenge is not finding high quality sources (en:PubMed already does an excellent job of this), but paraphrasing these sources to produce Wikipedia content. Boghog (talk) 06:24, 27 August 2016 (UTC)Reply

This is exactly the type of feedback that will be useful for developing appropriate information feeds that are useful to editors, so thanks. We could filter for reviews. As more of these are closed access relative to primary articles we may uniquely be able to offer a feed of snippets and facts from them, particularly those that don't exist as full-text in Europe PubMedCentral. The project will not help find more editors but we hope it will make it easier for existing volunteers by extracting papers and parts of papers that are likely to be relevant, thereby reducing the amount of time they need to spend searching for relevant information to paraphrase. --Jcmolloy (talk) 23:28, 20 September 2016 (UTC)Reply

As far as Wikidata goes, the primary sources tool manages to get actual factual claims like Katō Kiyomasa is male and is from Japan. It provides not only the sources where the claims are made but it provides also the claims themselves. For WikiFactMine to be useful for Wikidata it would not only have to say "paper X speaks about protein Y" but it would also have to parse the claims into the structure of Wikidata. ChristianKl (talk) 15:09, 21 September 2016 (UTC)Reply

Relevant wikidata communities[edit]

I suggest you also notify relevant Wikidata communities, most notably WikiProject Molecular biology, WikiProject Chemistry, and WikiProject Medicine. Best, Andrew Su (talk) 17:44, 3 August 2016 (UTC)Reply

have already done so.
A note on d:Wikidata:Project chat might reach some more people. --Lydia Pintscher (WMDE) (talk) 12:51, 5 August 2016 (UTC)Reply

Some thoughts[edit]

Sounds great, and these comments are intended to provoke discussion, not criticise:

  • Please notify the Wikispecies community
  • I'd like to see more detail of what you think a Wikimedian in Residence would contribute to the project, and the reasons for offering that person a salary below the upper end of the range suggested by WMUK (disclosure: I have worked as a WiR (not least in the Royal Society of Chemistry role mentioned), hope to do so again, and am currently pro-bono WiR at ORCID)
  • You say "[a] paper would be automatically offered to editors." Please explain what this "offer" would entail. For instance, a button to "create Wikidata item"?
  • "W3C annotation standards" - which standard? Please add a link.
Web annotation projectPetermr (talk) 18:33, 21 September 2016 (UTC)Reply

Please let me know if I can help in any way. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:45, 4 August 2016 (UTC)Reply

Primary sources tool?[edit]

Do you plan to feed the data into the Primary Sources Tool? Together with the work done as part of StrepHit and upcoming improvements this might just be what is needed to push it over the edge and make it the great import tool it should be. --Lydia Pintscher (WMDE) (talk) 12:53, 5 August 2016 (UTC)Reply

@Lydia Pintscher (WMDE): completely agree. I believe the planned T6 and T7 technical activities can be leveraged for direct collaboration with the StrepHit team. The shared goal would thus target the primary sources tool usability. Since ContentMine and StrepHit have already started joint work at WikiCite 2016, it would be really useful to follow up on this. Specifically, T6 would match the back-end part of StrepHit T1 task, while T7 would match the frond-end part. --Hjfocs (talk) 13:32, 28 September 2016 (UTC)Reply

Some comments[edit]

  • The first section does not clearly state the goal(s) of this project.
  • It is not clear to whom will the grant be made? In one place it is stated that "The grant would be to ContentMine Limited". On the other hand in the infobox User:petermr is listed as the grantee.

Peter Murray-Rust is a Director of ContentMine Ltd which is a non-profit company limited by guarantee and the company would administer the grant. The application form infobox is set up in such a way that it links directly to user pages so it made most sense to put Peter as the grantee as he is the primary applicant. --Jcmolloy (talk) 23:46, 20 September 2016 (UTC)Reply

  • As I understand the project is going to use the existing software, which has been already developed by ContentMine, plus 4-8 wikidata based custom dictionaries to mine the literature on daily basis to produce ~10k of "facts" per day and then deposit them in Zenodo? If this is true, what kind of development work the proposed full time developer will be tasked with?

ContentMine has core software that we have already developed to scrape and extract facts, but there are enhancements that need to be made for it to function in a useful way for this project. You can see the technical tasks and time allocation [[2]]. The facts currently go to Zenodo but the intention is to create automated feeds. --Jcmolloy (talk) 23:46, 20 September 2016 (UTC)Reply

  • I think you should better define what you mean by the "fact" in context of this project.
  • The project lists three participants but it is not clear what their responsibilities are. It is also not clear if the full time software developer and full time Wikipedian-in-Residence will be selected from these three participants or not? Who will be responsible for dictionary creation, data mining, server operations, data publication?
  • In sustainability section is stated that it is a pilot project. I think that this should be stated at the beginning of the project (in the solution or goals section). In addition, if it is a pilot project, you should probably state criteria (in Measures of success section) when the project will be judged sufficiently successful for continuation.
  • A more general question: How the proposed article data mining compares with existing search tools like Google Scholar? I can already type a combination of words (a "fact") and find the references that are relevant to it.

Ruslik (talk) 15:48, 14 August 2016 (UTC)Reply

It's not entirely clear how these "dictionaries" will be available going forward. So the proposal is to fund ContentMine to develop software to build themselves the dictionaries from Wikidata? --Jura1 (talk) 17:10, 14 August 2016 (UTC)Reply
The software is all Open (Apache2) and so are the dictionaries (CC 0). ContentMine is a non-profit with a OpenLock commitment so cannot be bought by commercial enterprises. Petermr (talk) 05:55, 21 September 2016 (UTC)Reply

Example Facts[edit]

Some example facts (with some connections to Wikidata IDs) can be seen here: factvis to give a more visual idea of the data we've put on zenodo. Bomarrow1 (talk) 18:47, 21 September 2016 (UTC)Reply

The data in that link doesn't seem to be directly useful to Wikidata or Wikipedia users. In contrast to the Primary Source tool where a Wikidata user just has to press "approve", in this case the Wikidata user still has to write the whole statement himself. ChristianKl (talk) 20:00, 29 September 2016 (UTC)Reply

How about Claims ?[edit]

As said above, I think the use of the word "fact" here is pretty confusing. Perhaps this is related to legal terminology? I think it would be useful to add some content to the proposal about what wikidata claims will be created by the various workflows you describe. The most obvious one, I think, is the claim that links an item about a journal article to items that describe the article through a property like 'primary topic'. Apart from that, I guess that you would hope to feed systems that resemble the wikidata game with the extracted snippets and allow the users to define new claims? --I9606 (talk) 23:24, 23 September 2016 (UTC)Reply

@I9606: I am of the same opinion. It would be nice to explicitly clarify WikiFactMine input (i.e., free text) and output (i.e., Wikidata claims), maybe by expanding the two running examples. --Hjfocs (talk) 13:55, 28 September 2016 (UTC)Reply


In the biomedical domain there are a number of sophisticated systems for doing NER. For some of the types you describe (especially genes), non-dictionary (e.g. machine learning) based systems like BANNER have much higher accuracy (though of course are much slower). I think the fundamental concept of building tools that identify unique concepts in the scientific literature and anchoring these to Wikidata items is fantastic, but I am wondering whether it would make some sense to explore taggers that have been evolving specifically in this space for some time as you develop your pipeline. Some good open examples are at PubTator http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#RESTfulIntroduction --I9606 (talk) 23:24, 23 September 2016 (UTC)Reply

@I9606: thanks for the pointers to domain-specific NER tools. I would like to add that StrepHit has already implemented a domain-agnostic NER facility [3] (actually more powerful than NER, since it also identifies abstract concepts, besides real-world entities). Hence, I believe WikiFactMine could definitely benefit from StrepHit NLP technology, which is already stated in the proposal by the way: Grants:Project/WikiFactMine#Technical_Activities. --Hjfocs (talk) 14:09, 28 September 2016 (UTC)Reply

Eligibility confirmed, round 1 2016[edit]

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 1 2016 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 1 2016 begins on 24 August 2016, and grants will be announced in October. See the schedule for more details.

Questions? Contact us.

--Marti (WMF) (talk) 18:06, 23 August 2016 (UTC)Reply

Broken Link[edit]

Just a minor comment: the link in Grants:Project/WikiFactMine#What_is_your_solution.3F, second paragraph, appears to be broken. I think it should point to wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements. --Hjfocs (talk) 13:45, 28 September 2016 (UTC)Reply

WikiFactMine and StrepHit[edit]

Here is a list of key points that can lead WikiFactMine to tighten the collaboration with StrepHit:

  1. T6 and T7 (Grants:Project/WikiFactMine#Technical_Activities) could focus on the primary sources tool, as pointed out in #Primary_sources_tool.3F;
  2. C2 (Grants:Project/WikiFactMine#Summary_of_Community_Engagement_Activities) is definitely a shared task, meant to rethink the whole primary sources tool workflow;
  3. the StrepHit team is happy to actively contribute to the Wikimania 2017 workshop, as per C5 and Grants:Project/WikiFactMine#Wikimania_2017_Workshop;
  4. WikiFactMine would benefit from StrepHit entity linking facility, as discussed in #Dictionaries;
  5. the integration of Hypothes.is (mentioned in Grants:Project/WikiFactMine#Carvone) is also foreseen by StrepHit for the primary sources tool, cf. wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements#Highlight_text_snippet_used_to_extract_the_claim.

Hjfocs (talk) 14:46, 28 September 2016 (UTC)Reply

Aggregated feedback from the committee for Project/WikiFactMine[edit]

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
Additional comments from the Committee:
  • The proposed technology is likely to result in significant positive impact through the generation of Wikidata content, and appears to be sufficiently open to allow scaling and future enhancement.
  • The proposal presents an innovative approach to generating new Wikidata content that effectively leverages the current wave of public and institutional interest in Wikidata as well as the emerging interest in machine-generated and machine-augmented content.
  • Good impact but extensive project, in my opinion it should be split into a pilot project to check the feasibility and afterwards in an extensive project in order to decrease risk and costs.
  • Good potential impact by suggesting sources to be used in articles. Can be scaled to other topics and thus be used by more WikiProjects.
  • Will be useful for Wikipedians who are active in adding citations. But, hard to scale and adapt without similar paid Wikimedians in Residences. As Doc James and many others rightly point out in the talk page, lack of resources is not the fundamental issue. We need more editors who can curate them.
  • The proposal builds on the methods and tools developed by ContentMine. It mentions some new development work but does not specify the nature of this work. It has clear project goals and measures of success. However its long term impact is unclear.
  • I worry about sustainability after the grant ends.
  • Very innovative project with accurate and reasonable plan and realistic measures of success.
  • Good innovation. Instead of suggesting articles based on keyword density, they may have an algorithm to give weightage for citation index too. That can give more relevant suggestions.
  • The elements of the proposal that directly relate to the creation of the proposed Wikidata dictionary and automatic content generation model, such as the proposed software development and community engagement efforts, are reasonably scoped and budgeted. However, the proposed Wikipedian-in-Residence position, which constitutes a significant portion of the proposed budget, does not appear to be a critical element of the proposed work; I would suggest removing the funding associated with this position (or requesting it under a separate grant application) to reduce the cost of the project.
  • I have concerns about how this grant will be executed. It is not clear who is the grantee: ContentMine or Peter Murray-Rust in his personal capacity? The role of ContentMine in this project should be stated clearly. In addition, the responsibilities of project participants should be better defined.
  • I have little doubt that this will be done appropriately as Magnus is probably the best advisor for Wikidata one can find.
  • They have a clear track record and have a working software already. At the same time, that they have left many questions unanswered here - https://meta.wikimedia.org/wiki/Grants_talk:Project/WikiFactMine#Some_comments - is a concern.
  • The proposal presents a compelling approach for building and maintaining community engagement. However, the responses from community members to date indicate a lack of community support for the proposed approach, particularly among key community stakeholders in the medical topic areas.
  • The community is engaged but the authors of proposal should still expand their outreach activities. It would help to better define what assistance and databases Wikipedia editors really need.
  • They have already started discussing these proposals with target communities, and there is rather significant community support from enWiki. Still, there is some criticism (most notably from Doc James) that has to be addressed.
  • I would have loved to see more signups from Wikidata / Wikipedia community willing to add the data they mine. Right now, it totally seems dependent on the paid Wikipedian in residence.
  • In principle, this is an excellent proposal for creating an innovative and scalable model for creating and enhancing Wikidata entries. However, the concerns raised regarding the proposed use of primary sources to generate medical content are significant and cannot readily be resolved without substantial changes to the scope of the proposed work. I would suggest revising the proposal to focus on a different topic area where such concerns are not present or can be more readily mitigated, and resubmitting it in a future round.
  • I would ask to apply for a limited timeline as a pilot project, and re-apply for a second phase afterwards, analyzing the impact and the feasibility of these results.
  • I would recommend a close collaboration with editors to understand the types of sources they want and WikiData people. As long as that happens, I think this will be successful.
  • The role of ContentMine should be clearly stated: how it will benefit from the grant; what work its staff will do.
  • The long term future of the proposed approach (taking into account that it is a pilot project) should be better defined as well as its sustainability. I would encourage the proposers to re-submit the revised proposal in the next round.
  • It is ambiguous if they are a for-profit company. They claim they are a https://en.wikipedia.org/wiki/Private_company_limited_by_guarantee which does allow profits to shareholders. In their website, they haven't used the explicit word non-profit anywhere. It is found only in the grant proposal. Their answer regarding this in the talk page to this - https://meta.wikimedia.org/wiki/Grants_talk:Project/WikiFactMine#Some_comments - is not clear. I would recommend they also explore funding from other sources like Google which have an interest in Wikidata.

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.

Next steps:

  1. Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
  2. Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.
Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

petermrj, our interview with you was part of the due diligence process (I'm getting these comments posted late). You are still welcome to record any response you have to committee comments on your talkpage to make your feedback publicly accessible, but I think we gathered the information we needed for the committee during our talk with you. Best regards, Marti (WMF) (talk) 18:46, 2 October 2016 (UTC)Reply

Round 1 2016 decision[edit]

Congratulations! Your proposal has been selected for a Project Grant.

WMF has approved partial funding for this project, in accordance with the committee's recommendation. This project is funded with $63,700

Comments regarding this decision:
The committee is pleased to support your project with partial funding to automatically mine scientific literature for data to add to Wikidata. We value the opportunity to bring the innovation of ContentMine to the Wikidata community. We appreciate your sensitivity and responsivity to concerns raised by some members of the Wikipedia community of editors, and we’re looking forward to working with you to make sure the project design sufficiently addresses these concerns before moving forward. We believe that can be done, in part, by designing sound volunteer curation processes that can then be coordinated by the Wikipedian in Residence. Because of the need to establish these processes, however, we expect this position to have a gradual start, so funding has consequently been reduced to halftime. Along with many others, we are excited to see what new possibilities WikiFactMine will open up for Wikidata!

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!

Upcoming changes to Wikimedia Foundation Grants

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.


I've read Grants:Project/ContentMine/WikiFactMine/Final and clicked a number of links but I didn't manage to find a concrete example of something that was produced. I found a list of SPARQL query links, some kind of viewer which might be akin to resonator and an Elasticsearch dump on Zenodo, but no use case for them.

Is it fair to say that the main output is fatameh, a bot originally written in the Vienna hackathon and used for millions of edits by the Research Bot? So, essentially, a bulk import of citations?

If not, it would really be useful to add at least an example or story of at least one user using one of your products to actually do something. Thanks, Nemo 08:48, 19 May 2018 (UTC)Reply