Grants talk:Project/ContentMine/Diversitech

Project Grant proposal submissions due 30 November![edit]

Thanks for drafting your Project Grant proposal. As a reminder, proposals are due on November 30th by the end of the day in your local time. In order for this submission to be reviewed for eligibility, it must be formally proposed. When you have completed filling out the infobox and have fully responded to the questions on your draft, please change status=draft to status=proposed to formally submit your grant proposal. This can be found in the Probox template found on your grant proposal page. Importantly, proposals that are submitted after the deadline will not be eligible for review during this round. If you're having any difficulty or encounter any unexpected issues when changing the proposal status, please feel free to e-mail me at cschillingwikimedia.org or contact me on my talk page. Thanks, I JethroBT (WMF) (talk) 23:17, 29 November 2018 (UTC)[reply]

Eligibility confirmed, round 2 2018[edit]

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 2 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through January 2, 2019.

The Project Grant committee's formal review for round 2 2018 will occur January 3-January 28, 2019. Grantees will be announced March 1, 2018. See the schedule for more details.

Questions? Contact us.

--I JethroBT (WMF) (talk) 02:40, 8 December 2018 (UTC)[reply]

Endorsement I used the Endorse button, but it hasn't given an icon as it used to and it splits paragraphs. This can make it a little difficult to count rapidly.Petermr (talk) 17:56, 13 December 2018 (UTC)[reply]

Clarifying the goals of the project[edit]

Hi,

Two ContentMine grants have already been funded Grants:Project/ContentMine/ScienceSource and Grants:Project/ContentMine/WikiFactMine, so maybe it is worth looking back at what they have achieved. Can someone point me to the Text and Data Mining outcomes of these projects? I think it's important to be honest about the sort of work carried out in these grants - it seems to me that the previous two proposals significantly overstated this aspect. Creating Wikidata items about scientific articles is not TDM. It seems to me that the main activity in the past ContentMine projects was organizing and attending events. Community outreach is important, but that seems to be quite different from the stated goal and outcomes.

Why do you want to create a dedicated Wikibase site for this project? Why is Wikidata not suitable for this work? At the moment only two users edited the ScienceSource wiki over the past 30 days, is it not clear that it is very hard to build a community in a new wiki? Why would it be any different for this project?

− Pintoch (talk) 08:45, 14 December 2018 (UTC)[reply]

@Pintoch: Thanks for these questions, which are indeed natural ones. I can answer on behalf of ScienceSource.

Issues with WikiFactMine were addressed in its final report. Progress with ScienceSource can be gauged in its midpoint report. The grant for ScienceSource was made in May, and was conditional on (a) a detailed communication plan to be submitted to the WMF, and (b) a chance for WMF staff to give input on the user experience (UX).

In that context I can make these points:

Text-mining has gone on in a pilot run of about 100 papers on the ScienceSource wiki, and the annotations produced can be seen in these bot contributions (about 7K). Conversions of pages to HTML can be seen in this Special page.
While the data schema could be adapted to Wikidata by around 20 requests for property creation, and the text-mining bot ScienceSourceIngest could be adapted to run on Wikidata as it does on http://sciencesource.wmflabs.org/, there have to be questions about all that. Rightly. The potential scale of this sort of text-mining is maybe two orders of magnitude greater than the WikiCite effort that now has contributed about one third of all Wikidata items.
The HTML pages could possibly be hosted on the English Wikisource, but that community does not support the posting of large numbers of open access papers, at this point of time. They are in the Article: namespace on the ScienceSource wiki, which is set up with its own configuration that includes the ability to post raw parsoid HTML.
In terms of UX, developer time will be given shortly by ScienceSource to making a convenient user interface (UI) for its wiki. The intended workflow on the site could be carried out right now, just about, by someone familiar with the data schema, but it is not so practical. Models under consideration are the front end of the Hypothesis client, with radio buttons rather than free text in the annotation box; and the functionality of the TABernacle tool available for Wikidata.

With these considerations made explicit, perhaps it is clearer why ScienceSource would surely face community issues if hosted on Wikimedia projects. Text-mining to a Wikibase site now has a generic tool. Working with an adequate annotation model appears to need custom UI development, which is in hand. Contractually ScienceSource will involve WMF people in its UX development process, and in its outreach is following a comms plan sent to the WMF grant-runner.

Going back again to WikiFactMine, I had a look at the actual dump of data at Wikimania, with a Wikidata developer. See http://sciencesource.wmflabs.org/wiki/WikiFactMine for some information and comments, and a link for finding where the data is warehoused on Zenodo. It is usable for certain kinds of searches, and that could possibly help ScienceSource for filling gaps in coverage. That page could be amplified.

Turning now to Diversitech: it will benefit from the text-mining and front end software developed for ScienceSource. The outcomes of ScienceSource (after 11 months say) would be taken into account in Diversitech planning. It is quite correct to say that developing a community can be hard, but the ScienceSource project can make it somewhat easier in this particular field.

The big-picture "goals of the project" for Diversitech are really for someone else to comment on.

I hope I have filled in enough technical information as background. Using a Wikibase site as a "holding area" is actually not an easy option, since the Wikibase technology is not yet slick. A detailed rationale has been given for the ScienceSource case. Charles Matthews (talk) 12:30, 17 December 2018 (UTC)[reply]

@Charles Matthews: thanks for the detailed reply. This confirms my reservations about the project so far. I still do not think you are doing any text mining at all. You are looking for occurrences of keywords in articles. There is a world between doing that and "mining facts" from text. I just do not see how items like https://sciencesource.wmflabs.org/wiki/Item:Q6798 can be useful in any way, honestly. I am not even sure how you can call that an "annotation", given that no extra information is added beyond the simple description of the occurrence. How are you expecting the community to engage with data like that? As a Wikipedia editor, how is it going to help me write articles? − Pintoch (talk) 15:11, 17 December 2018 (UTC)[reply]

I'd like to remark first that https://sciencesource.wmflabs.org/wiki/Item:Q6798 is an anchor point item, underlying an actual annotation https://sciencesource.wmflabs.org/wiki/Item:Q6799. These are machine-readable items, and "annotation" is justified in the W3C sense of doing stand-off annotation, based on phrases that appear on the anchor point item. The schema is designed to allow useful information to be extracted by SPARQL queries. To go back to the history, an API that required developer skills to use has been replaced by a Wikibase instance where SPARQL, now a common enough skill for Wikidatans, can be used, including visualisations from the standard default views.

The basic idea there is to control co-occurrence, using the difference of offsets from the beginning of the text. That allows "proximity search" to be done, for example for drug and disease terms, which is what the SS project is concentrating on. According to Elasticsearch: The Definitive Guide, proximity search can slow down Elasticsearch by a factor of up to 20 (and Elasticsearch is what WFM used). Well, not for us.

I don't know how far to take these technical discussions here. I think I should say, though, that the role in SS of the mined "triples" is to provide real data for a MEDRS algorithm, which is in a different field (I would say expert systems, i.e. rather traditional AI). The claim is that a good formalisation of the WP:MEDRS algorithm would be an advantage for medical editing, and the medical editors believe this (archived discussion from 2016, when this first came up). To be more precise, the difficulty with formalising MEDRS practice is in catching enough edge cases to refine "toy" versions of the basic algorithm that can be taught in five minutes. We hope, in brief, that the type of proximity search I'm talking about will produce enough test cases to take back to the medics, and learn the good practice in the field.

The techniques for Diversitech, operating in the humanities, are likely to be rather different, indeed. Sentiment analysis may be one of them. The "pipeline" approach to analysis of a corpus of texts has to start with texts, and process them into a form which allows for explorations. The challenges there are to simulate more literary techniques to get interpretations. I can't accept that a simple rejection of the "atomic" nature of a term occurrence is valid, just because a single entry seems rather mute. Analytics can be done on such things. Charles Matthews (talk) 22:04, 17 December 2018 (UTC)[reply]

Thanks for the details, I understand a bit better your approach. It seems that the basic technical foundations behind your purposed text mining are still to be designed - that basically amounts to solving open problems in natural language processing (both in the case of Science Source and Diversitech). I think the ContentMine team is great at communicating and gathering momentum around its projects, but is it really well equipped to conduct basic natural language processing research? Even if you eventually managed to get a decent research prototype, the remaining work to get it to work to a reasonable quality standard and in a way that is useful to the community is immense! I think it is really important to be honest about this given that this is all funded by the WMF, so by donations. − Pintoch (talk) 22:45, 17 December 2018 (UTC)[reply]

I should say that w:natural language processing approaches have been considered for ScienceSource, and have been rejected. What you see is a 12-month project in month 7; at the end of November we were given access to federate with the Wikidata Query Service, which allows for important advances. Thank you for your input. Charles Matthews (talk) 04:57, 18 December 2018 (UTC)[reply]

Thank you for your comments and requesting clarification on the way in which the project might help you or other Wikipedia editors write articles. We agree that the problem statement and summary of the solution were not clear enough on this issue so we have restructured them to add more detail on that point. Jcmolloy (talk) 02:03, 19 December 2018 (UTC)[reply]

Specifically on w:Text mining. Quoting that article: "the process of deriving high-quality information from text. " ... "Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). " ... Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. ". I have been involved in Text Mining for over 25 years, with frequent peer-review, and much of the work involves Named Entities (words and stock phrases). These are used for information retrieval (document classification), semantic tagging (part-of-speech tagging), and semantic grounding via dictionaries. Wikidata is particularly important as it provides unique global identifiers which are linked to precise terms - in effect the terms are the handle by which parts of a document are interpreted through Wikidata. Petermr (talk) 17:25, 20 December 2018 (UTC)[reply]

ContentMine is great and I want them to have an opportunity to demonstrate their abilities[edit]

I want to give this team every opportunity to demonstrate their talent and capability. If I had a wish in this, it would be that ContentMine gets judged primarily on what they accomplish at the end of the term of their current grant, and less so about what metrics they can show for on-wiki edits to date. I recognize that is not possible due to the WMF grant award calendar, so the team has my early support right now.

I recognize that this team's previous 2016 project fell short of goals as listed at Grants:Project/ContentMine/WikiFactMine/Final#Targets. There were some amazing successes. The project's coordinator, Tom Arrow, as proven to be a Wikidata and Wikibase superstar and has recruited and trained many top Wikimedia contributors. Were it not for ContentMine being the center of Wikidata community engagement at a time when very little of this was touching the English-speaking world, then I think Wikidata community engagement would have been delayed by at least a year. That said - WikiFactMine did not result in the actual large scale structured data content integration into Wikimedia projects that was expected. I and many others are accepting of this because the project promised miracles and while it did not technically deliver, the budget was small as compared to what typical universities spend in this sector, the communication was excellent, and the entire project was managed with some of the best Wikimedia community consultation and participation that any Wikimedia research project or data project has ever had.

Currently the next generation team at ContentMine is engaged in Grants:Project/ContentMine/ScienceSource. This is a 12 month project. A 3-month midpoint report is published. This project also has not resulted in on-wiki edits yet, and these are promised at Grants:Project/ContentMine/ScienceSource#Goals_around_participation_or_content? and Grants:Project/ContentMine/ScienceSource#Activities. If at 12 months this project actually achieves what it promises then it would be a great success in itself and a model for others. The Wikimedia community needs more data science projects now because all such projects create momentum for other data science projects. Wikidata is the focus of attention for the future of Wikimedia platform development and I feel there is wide agreement that the sooner we set precedents of institutional partnerships, the sooner that Wikimedia projects attract ever more institutional partnerships in this space.

I recognize that the Wikimedia Foundation grants team has changed its schedule of grantmaking, and now is only reviewing project grants every 6 months. This means that anyone who would like to do a project starting from about September 2019 needs to request funds now in this grant cycle. There was not much notice of this, and if there were, then probably more organizations would request funding on a 15-month plan instead of a 12 month plan in cases of projects which might be ongoing or have a second phase. Another way to say this is that if a project like ContentMine's ScienceSource were to continue with lessons learned before September 2019, and it should, then its only option is to request funding right now. I really, really appreciate ContentMine's attempt to show the general applicability of its data processing techniques by developing a process in the sciences then applying the methods to the humanities, and especially to the LGBT+ space.

In the midst of all this the Wikimedia community needs to be judicious. With the 2016 WikiFaceMine project not meeting objectives, as a condition of this grant, the Wiki community is justified in demanding the fulfillment of ScienceSource objectives before awarding additional funding. ScienceSource's funding is to cover through May 2019 as per Grants:Project/ContentMine/ScienceSource/Timeline. Perhaps early reporting can start even sooner to confirm progress, because with sooner on-wiki publication the community support for more such contributions can grow. I recognize that if there is to be a continuation, then planning needs to start now. I would like to be able to show off the promised outcomes of ScienceSource to help me recruit better community understanding and engagement in a next phase of LGBT+ content editing.

I have had conversations with various organizers for ContentMine over the years by email, video chat, in person, and through their newsletters. I know that the team is capable of outputting lots of content for Wikimedia projects, and to do excellent recruitment of research partners and other institutional support including external funding for Wikimedia projects. I have been an organizer in Wikimedia LGBT+ since its inception. I have supported and reviewed both ContentMine WMF-grant funded projects. It would make me happy to match data science research with the LGBT+ space as I think these projects would be great for each other. Blue Rasberry (talk) 17:31, 17 December 2018 (UTC)[reply]

A little worried[edit]

Ni!

I generally support the idea ... but I also feel that this is different from previous similar ContentMine proposals I commented on and supported. It goes way out of the factual and into the political and cultural. I am not convinced the implications of this has been properly recognised and addressed in the proposal.

In a way, and in my view the global political context of recent years strongly suggests this, simply exploring "a multilingual corpus of scholarly papers on LGBT+ topics", with goals of vocabulary harmonization and content multiplication, risks stimulating the production of content that will be perceived as propaganda by a large number of readers, while simply alienating others. This only gets worse if you take into account language disparities, no matter how multilingual de corpus is.

I don't see anything in the proposal that recognizes and addresses this, thus I do not feel comfortable supporting it as it is. The fact that the proposal is essentially technical on such a contentious political subject tells me that this is a tool looking for a problem.

This proposal should be conceived the other way around, as a statement of a problem by a significant community of multilingual volunteers, with a qualitative explanation of their objectives and procedures, including their quality criteria and choice of sources, and then a description of the tool as a support for that.

I wouldn't take this proposal at face value unless there were clear examples of how and around what these multilingual communities would be formed, what are their standards of "high quality", including the choice of sources and the rationale behind them, and how particular linguistic environments would be taken into account by the procedures. Furthermore, what would be done to avoid, across languages, an imposition by volume or temptation to identify uneven things. And avoid the use of language that does not communicate to the general public. With clear examples, currently missing, for each aspect. Finally, a discussion on how the technical choices of the tool could and would afford it a behaviour compatible with these concerns, or at the very least more compatible with these concerns than current alternatives like the simple usage of search engines.

Abraços,

Solstag (talk) 01:58, 30 December 2018 (UTC)[reply]

Thanks for the thoughtful comments. ContentMine is the technology provider as you identified, the proposal came about after we were contacted by and worked with Grand Valley State University on examining a similar corpus and between us we approached other partners such as Wiki Project LGBT to find out if the general approach could address problems they faced, then co-designed a proposal in response to their needs and ideas. We agree that the proposal would be strengthened by the points you mention and will look to expand on them in collaboration with our community partners. Jcmolloy (talk) 14:09, 30 December 2018 (UTC)[reply]

Hi, and great to hear that. I'm at your disposal in case someone within ContentMine, the community partners, or WMF, wants to discuss any of these points. This could be used as a starting point for a basic ethics procedure for funding text-mining of political and cultural content for use in Wikimedia projects. I propose to recognize text-mining as already a hard depart from the usual people-powered, semi-democratic, content production of Wikimedia projects. Text-mining lies somewhere in-between content creation by users and employing AI to create content. Thus, when funding text-mining of political and cultural subjects, it would greatly benefit Wikimedia's mission to have a public ethical review process of the biases inevitably associated with the choices of corpus, analysis and interface. Feliz 2019! --Solstag (talk) 15:32, 2 January 2019 (UTC)[reply]

Other projects to same grantee (Contentmine)[edit]

Which other projects have already been funded and what tools were develop under these projects? How are tools developed with previous grants different from the ones used here? Could volunteers use these tools directly? --Jura1 (talk) 08:47, 31 December 2018 (UTC)[reply]

@Jura1: Referring to "Clarifying the goals of the project" above, where there is already some discussion:

Under WikiFactMine the text-mining was done with an ElasticSearch application and special configuration, and a Javascript tool was released at https://rawgit.com/tarrow/wikifactmine-gadget/master/wikifactmine-gadget.js for direct additions to Wikidata;

Under ScienceSource the text-mining is being done by the ScienceSourceIngest tool at https://github.com/ContentMine/ScienceSourceIngest, and auxiliary SPARQL queries are being made available on d:User:Charles Matthews/ScienceSource queries and on its Wikibase site http://sciencesource.wmflabs.org as they are written.

ContentMine undertakes to release its software under an open license. ScienceSource software will also include a tool for PubMed and PubMed Central metadata, and a GUI for the wiki. The ScienceSource pipeline as a whole will therefore be very different in terms of its code from that used for WikiFactMine, though the functionality is closely related.

It is expected that Diversitech, while not comparable in detail with ScienceSource, will build on its software.

The ScienceSourceIngest tool could be used and adapted by others, certainly. It enables HTML conversion and text-mining from the OpenXML standard, and the bot posting of results to a Wikibase site with suitable data schema (using the MediaWiki API). By modification of stylesheets and schema, it can be treated as a fairly generic tool. So the transition from WFM to SS is from ElasticSearch and a special configuration over four virtual machines, with data accessible via an API, to a more portable piece of software that exports to a Wikibase site. By federating the SS wiki with the Wikidata query service, we have allowed for third-party reuse of results by SPARQL authors.

The Diversitech pipeline is not the same as that for ScienceSource, but is closely related. To state the obvious, these are all research projects aiming to innovate and improve the techniques used by iteration. Charles Matthews (talk) 06:22, 2 January 2019 (UTC)[reply]

Tools from WikiFactMine. It usually takes several years for a scientific community to adopt the dictionary approach and mining philosophy, and acceptance is partly based on workshops and advocacy. As an example I am running a workshop in Delhi in March on Food Security and 50 delegates will be using WikiFactMine-based dictionaries to mine the scientific literature. I have given ca 20 scientific presentations since WikiFactMine started and most of them highlight Wikidata as a key, emerging, tool. See for example [| "Scientific search for everyone"] and others on same site. Tom Arrow, Daniel Mietchen and ContentMine continue to promote and develop the tools, and advocacy after the formal end of the project.

Creation and maintenance of Dictionaries. (I have been involved with scientific dictionary creation for over 30 years, Int Union of Crystallography, Int Union of Pure and Applied Chemistry, Gene Ontology, WHO Int Classification of Disease (ICD-10), CODATA and others.) Dictionaries are central to semantic science and I believe that all authoritative scientific dictionaries will ultimately be integrated into Wikidata (barring political/legal problems). We created some hundreds of dictionaries from Wikidata, see [WikiFactMine dictionaries]. Two of us Tom Arrow and Peter Murray-Rust attended the 3-day [| Wikidata Workshop in Berlin] and presented the WikiFactMine outputs.

Retrieving the scientific literature. A major, novel, part of WikiFactMine was Tom Arrow's tool [Fatameh] for ingesting bibliographic records into Wikidata. Daniel Mietchen has done impressive work in adding over 15 million items to Wikidata, based on WikiFactMine / Fatameh.

Indexing full-text. As mentioned, this has been deployed and tested in the WMLabs ElasticSearch resource. It proved that high-throughput indexing works, but the output needs customising for a community and that is what is being developed in ScienceSource

In ContentMine we are continuing to develop tools pro bono based on WikiFactMine. In many/most disciplines data in Wikidata is sparse and we have developed a method (at alpha level) for creating dictionaries from Wikipedia pages, especially lists and categories. (All code is Open Source). This will be prototyped in Delhi (see above) and is also available, if required, for Diversitech.Petermr (talk) 14:00, 2 January 2019 (UTC)[reply]

Wouldn't it be better to let them finish work on the previous grants and then, once the tools are actually available, let interested parties work with them directly? --Jura1 (talk) 13:26, 26 January 2019 (UTC)[reply]

@Jura1:. Yes, the current project from 2018 will mostly be over before this next project for mid-2019 fully begins. Yes, the team has made some outcomes from the 2018 project available as described in the midpoint report in November 2018. Yes, the team commits to completing the 2018 project in good standing as a condition of going on to the 2019 project.

The WMF grantmaking strategy invites more frequent, smaller, and overlapping grant proposals for anyone who wants continuity between one project and the next. WMF only takes project grants twice a year, so I expect the next time will be in May 2019 to award funds in September 2019. This project in the published timeline for the current project ends in May 2019, so it is not possible to complete the current project, and propose the next project, and continue work without a break unless there is some overlap between the current project and planning the next one. It would be much easier for this team to start the next project as the current one ends, which is why this proposal is happening now. Blue Rasberry (talk) 17:14, 26 January 2019 (UTC)[reply]

It's important to evaluate that the previous grants were well spent. If there is no clear outcome in terms of available tools, it's not clear why even more resources should be channeled into this. Jura1 (talk) 05:29, 27 January 2019 (UTC)[reply]

@Jura1: I'd be happy to explain exactly where ScienceSource is in terms of software development and testing. The major difference between WikiFactMine and ScienceSource has actually been in terms of planning and project management; and, while I take your point about outcomes, I also think it is the planning and project management side that will carry over directly to Diversitech. In any case I believe the software outcomes of ScienceSource will become apparent before the grant process ends: this diff is a metadata tool testing run on Wikidata, from yesterday, the sharp end of getting the data modelling right before scaling up. Things are not being done in a hasty fashion, while the inferences from WikiFactMine that have been brought up here, inevitably, create inconvenient pressure to do just that. Well, that's life. Charles Matthews (talk) 08:33, 30 January 2019 (UTC)[reply]

Learning from past ContentMine projects[edit]

From my perspective the past project by ContentMine that were founded by grants didn't have the effects that I hoped for. The "facts" that were produced by ContentMine weren't Wikidata statements and thus it wasn't possible to directly integrate them into Wikidata. It's unclear to me why ContentMine doesn't try to do the ML-work that would be required to actually get statements about biomedical items.

I'm also not confident that this project will lead to Wikipedia editors finding sources better then they already can via Google Scholar. ChristianKl ❪✉❫ 17:34, 6 January 2019 (UTC)[reply]

A learning opportunity for knowledge equity[edit]

I'm encouraged by both the endorsements for this grant proposal, and some of the thoughtful concerns that have been expressed - they all feel part of the best traditions of Wikimedian transparency and accountability. I do feel that this proposal shifts direction to an extent, from previous proposals by ContentMine, and from my pov, it demonstrates a willingness to learn from their previous/ongoing experiences: first, large scale content mining needs a very intentional inclusion of context, and the shifts of qualitative discourse over time, including (and especially!) in academic literature. Secondly, focusing on a significant research need within the Wikimedia movement around diversity and inclusion, with a focus on LGBT+ communities and content, will support knowledge equity. Thirdly, the integration of this work into our projects needs Wikimedians willing to make that happen.

For us at Whose Knowledge?, we continue to recognise that we need more and better research into Wikimedia knowledge gaps, and we also need better ways to integrate that research into improving Wikipedia, Wikidata, and sister projects. This is an effort that feels like an important learning opportunity for all of us. It will need significant support from the Wikimedia LGBT+ communities (which is why I'm glad for Blue Rasberry's endorsement) and from Wikimedians who realise the complexities of analysing different socio-political and cultural contexts, including in multiple languages (which is why I'm also glad for Solstag's willingness to engage with this effort). In particular, I'm encouraged by some of the academic and research advisors that the ContentMine team has gathered, including Elizabeth Jay Friedman who has deep expertise in some of the key research questions being posed, and hopefully answered, by this effort. Anasuyas (talk) 16:49, 25 January 2019 (UTC)[reply]

This project has great potential with proven results in LGBT research[edit]

In 2017 I began a project at Grand Valley State University that aims to map bibliometric associations within the field of transgender literature. We took a corpus from a psychological database, mapped textual associations within it, and used these to draw conclusions about the field of trans studies throughout history. This research would not have been possible without access to an existing corpus of papers grounded in LGBT history, and it’s possible that our analyses could have been even more complete with access to a larger and more comprehensive corpus of data as this project would enable. ContentMine worked with us to streamline our text-mining and data visualization process, which directly helped us build a thorough and intersectional analysis of the resulting maps. Their work on this project can absolutely facilitate further research in this area. LGBT studies is a relatively new field, particularly with respect to trans issues, and data processing in this area can help to build it up in such a way that its content is readily accessible for researchers like myself.

As a result of ContentMine’s help with our project, we received the Joan Nestle Award from the Committee on LGBT history. Our goal in submitting our paper for consideration was ultimately to encourage further scholarship in the area of LGBT history, particularly with respect to drawing attention to transgender-centric dialogues. Endeavors like this one to make LGBT history and scholarship more accessible to researchers like myself can ultimately help us expand our current knowledge-base and add to a growing corpus of historical research. Since the field of academia has some gatekeeping with regard to accessibility, working specifically with Wikidata can enable research from marginalized populations like queer and trans folk by making this information more accessible to everyone. Archival work like that of my project is particularly important to the LGBT community, since so much of our history has been lost to the AIDS crisis, violence, and pathologizing narratives. Reclaiming and retelling that history through LGBT-specific content is of great importance, and the proposed project would make that information much easier to track down, organize, and analyze. ContentMine’s project has great potential, and this technology has worked for real research already; I hope to see it helping more projects in the future!

Aggregated feedback from the committee for ContentMine/Diversitech[edit]

Scoring rubric	Score
(A) Impact potential Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both? Does it have the potential for online impact? Can it be sustained, scaled, or adapted elsewhere after the grant ends?	6.8
(B) Community engagement Does it have a specific target community and plan to engage it often? Does it have community support?	5.5
(C) Ability to execute Can the scope be accomplished in the proposed timeframe? Is the budget realistic/efficient ? Do the participants have the necessary skills/experience?	5.5
(D) Measures of success Are there both quantitative and qualitative measures of success? Are they realistic? Can they be measured?	8.0
Additional comments from the Committee: The project fits Wikimedia's strategic priorities and has a potential for online impact. However its sustainability is not clear. The contents of the planned Wikibase instance may quickly fall into disuse. Quite good fit with strategic priorities by enriching content on minority topics. However, significant concerns about sustainability and scalability as ContentMine projects have not proved that yet. The approach is iterative. Its is based on the two previous projects by the same team. However the impacts are not clear - the results of previous projects were underwhelming at best. The success can be measured but the proposed measures of success are either trivial or unrealistic, in my opinion. Iterative project which is not well-positioned to create long-term impact. ContentMine have already had two proposals that did not result in addition of content to Wikimedia projects beyond source addition to Wikidata. While from research point of view the project works, the content generation part does not work as planned. Unless there is a good plan to prioritise it, risks would be unacceptably high for this project. The ability to execute the project within its 12 months timeline and necessary skills are present. The budget is realistic. However whether the stated goals will ever be reached is open question. While the team is qualified and has a very good set of advisors, concerns about feasibility on time as previous ContentMine initiatives did not do the major part for Wikimedia on time. Communication efforts are significant and the project as a whole promotes diversity. There is a significant community support. Good community engagement, both from Wikidata and from LGBT communities. The project is big but it reports to impact in other languages but it is not clear in the description. Basically if it will produce content in English projects, the impact will be smaller because this specifc topic (LGBT) is already good. Thete should be more clear statement about the impact in other languages. This is difficult case for me. The previous two projects by the same team produced underwhelming results. They did not actually deduced any useful "facts" from the literature corps that they mined - only keywords and refs. Moreover those underwhelming results have largely failed to find their way into actual Wikipedia articles. By the stated above reasons I can not support this project, which is predestined to the same fate as two previous ones. Not in the current state. I would support funding only if grantee would show that they either have everything ready to actually generate content for Wikimedia articles or prioritise it in first priority and do not provide any further funding until it's done. Research-wise the results are good, there should be an interesting algorithm working, but content generation-wise it's a failure as the project did not bring good results and it was not properly addressed.

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.

Next steps:

Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
Following due diligence review, a final funding decision will be announced on March 1st, 2019.

Questions? Contact us.

I JethroBT (WMF) (talk) 19:26, 6 February 2019 (UTC)[reply]

Answers to Committee Feedback:

The project fits Wikimedia's strategic priorities and has a potential for online impact. However its sustainability is not clear. The contents of the planned Wikibase instance may quickly fall into disuse.

Thanks for your comment, we are happy to hear this project is aligned with Wikimedia strategic priorities. We also strongly believe Diversitech has great potential for online impact. We are aware that sustainability is always a concern when building these types of resources. In this instance we acknowledge the risk and are mitigating it in three ways: 1) A strong community engagement plan. We have spent the last 12 months engaging with different organizations in four different regions (Movilh, Chile; Fundacion Arcoiris, Mexico; Arab Foundation for Freedom and Equality, Middle East; Whose Knowledge, USA) who are very interested in co-designing the tool during the development stage and using it once available. We also engaged with the Wiki LGBT project and will bring part of their community to collaborate with us. Wikimedia UK is on-board to help us with communication and dissemination activities on this project. 2) Considering opportunities for additional funding once the resource is in place. DiversiTech is shaped to the needs of Wikimedians and external organisations working on the LGBT topics and we aim to seek additional funding through their networks. It follows on from two small funded projects brought to ContentMine by Grand Valley State University and McGill University one of which won the Joan Nestle award on LGBT history, demonstrating the value of the research and the potential for academic grant funding once the platform is established. Diversifying the communities involved is not only beneficial for meeting the project aims but it also increases the sources and nature of funding that could be sought to supplement future volunteer effort. 3) ContentMine will offer in-kind support As demonstrated on previous projects, we are committed to supporting DiversiTech for an additional 12 months after the conclusion of the formal project. This volunteer support will include: a) Technology maintenance, making sure the tool is up to date for anybody to use. b) Training and workshops for new editors and groups of interest, making sure we train community even after the project funding ends. c) Community engagement activities, presenting the project on diverse congresses and conferences making sure we keep spreading the outreach of this project.

Quite good fit with strategic priorities by enriching content on minority topics. However, significant concerns about sustainability and scalability as ContentMine projects have not proved that yet.

Thanks for your comment, we are happy to hear this project is aligned with Wikimedia strategic priorities. Enriching content on minority topics is one of the reasons that encouraged us to move this project idea and start the engagement with all the communities involved in this project since the beginning of 2018. In terms of sustainability, we are adding three essential components to this project: A) Community: we are working directly with thriving communities including international organizations dedicated to LGBT rights, Wiki LGBT Project and LGBT Studies research groups from two universities to ensure our technology is answering real problems and providing a solution that will be valuable and maintained. B) Funding: in collaboration with the University of Grand Valley and McGill University we are actively looking for new sources of follow-on funding to build on the resource created during the project. c) In-Kind contribution: as part of ContentMine’s mission and in line with our commitment to the previous project such as WikiFactMine and ScienceSource, we will continue supporting the project as volunteers for 12 months after completion. This includes making sure the tool is up to date, the community is informed and engagement activities are carried out. In terms of scalability we plan to engage not only with the current organizations supporting this project but also with their parent organizations, other regional offices and their partners, for example A) Fundacion Arcoiris is part of the ILGA, which only in Latin America has more than 100 LGBTQ+ related organizations. We are actively engaging this organization and additionally offering training at Fundacion Arcoiris’s partner offices in Latin America. B) AFE through its Social Challenge Programme and NEDWA conference will be a great opportunity to engage and reach new volunteers and editors. Additionally, links with M-coalition will help us to increase our outreach and volunteers engagement.

The approach is iterative. Its is based on the two previous projects by the same team. However the impacts are not clear - the results of previous projects were underwhelming at best. The success can be measured but the proposed measures of success are either trivial or unrealistic, in my opinion.

Thanks for your comments, we worked hard with our advisors, the organizations and community around this project to make sure the measure of success for this project are clear, realistic and can be measured both quantitatively and qualitatively. However, we are happy to improve them if you think are not the right measures. We believe that the project constructively builds upon previous projects and will deliver far beyond a simple iteration on their results. Diversitech has a great deal of community engagement not only within Wikimedia projects and volunteers but also with organizations across the globe with significant communities (for example, Movilh has 193k social media followers and Fundacion Arcoiris is part of ILGA with more than 100 similar organizations only in Latin America) and outreach (mostly in non-English speaking areas). They are all very interested in the outputs of this project and in continuing to use them on an on-going basis. This is a big step forward a real sustainability plan and is the reason we started our community engagement plan early last year, allowing us to have formal support from those organization now. We have also tested this idea with a pilot project from Grand Valley State University and McGill University during 2018, which give us confidence in the technical execution of the project. The resulting research was awarded the Joan Nestle prize for contribution to LGBT history. In terms of previous projects, we would like to clarify we are currently eight months into the ScienceSource project and have thus far met all proposed measures of success for this time point.

Iterative project which is not well-positioned to create long-term impact. ContentMine have already had two proposals that did not result in addition of content to Wikimedia projects beyond source addition to Wikidata. While from research point of view the project works, the content generation part does not work as planned. Unless there is a good plan to prioritise it, risks would be unacceptably high for this project.

Thanks for your comments, we totally agree with prioritizing the content generation and it will be at the centre of our project priorities. The fact we have engaged the community ahead of starting this project as well as tested the technology during pilot projects will allow us to focus our resources on the content generation as you suggest. We have also carried out extensive due diligence with our partner organizations, discussing the project with the community and bringing on board a great team of expert advisor to ensure this project is well positioned to create a long-term impact. Our aim is to provide impactful outputs not only for the Wiki LGBT community but also for a large community of organizations working in languages other than English such as Movilh, Fundacion Arcoiris and the Arab Foundation for Freedom and Equality. They and the bigger organizations they form part of such as ILGA and M-Coalition are very interested in using DiversiTech for their day to day activities. We would like to clarify that we are currently eight months into the ScienceSource project and have thus far met all proposed measures of success for this time point. So far we have achieved the following milestones on ScienceSource: a) Tech development completed; b) All communication and engagement plan KPI’s achieved so far; c) Starting to bring on board the first group of Editors this month as planned.

The ability to execute the project within its 12 months timeline and necessary skills are present. The budget is realistic. However whether the stated goals will ever be reached is open question.

Thanks for your positive comments, we spent a good part of our project planning making sure we can execute this project within the 12 months time frame. Additionally, as part of Contentmine mission, we always provide in-kind time contribution to our projects after the grant is finished. We understand projects need a certain level of maintenance and we are happy to provide that support to ensure volunteers and organizations will be able to keep using DiversiTech long after this project is completed. We also spend a lot of time engaging the right team for this project, among our list of advisors we have Elisabeth Jay Friedman: Professor, University of San Francisco, author of Interpreting the Internet: Feminist and Queer Counterpublics in Latin America(University of California Press, 2016), Anasuya Sengupta, co-founder of Whose Knowledge?, Indian poet and activist, authority on representation for marginalized voices on the Internet, Myra Abdallah, Middle-East and North Africa regional manager of Women in News program of the World Association of Newspapers and News Publishers (WAN-IFRA) and the Director of the Gender and Body rights Media Center of the Arab Foundation for Freedoms and Equality (AFE). Jason Moore of WikiProject LGBT studies and the Wikimedia LGBT+ User Group Lane Rasberry User:Bluerasberry, Wikimedian-in-residence at the Data Science Institute at the University of Virginia. He coordinates projects between the university and Wikipedia, Wikidata, and other Wikimedia projects. He is also a member of the Wikimedia LGBT+ User Group. We also appreciate your comment about our budget being realistic. We agree, having the experience from the previous project we can more accurately predict the resources needed for a project like this. However, as you can see from previous project ContentMine usually provides an in-kind contribution for this type of project, usually from our Directors providing unpaid time. Before setting the goals for this project we had the chance of delivering a pilot project with Grand Valley State University and McGill University which was a great success and one of the reasons we decided to expand this idea to non-English speaking communities from a variety of environments, trying to achieve a truly diverse project. Having tested the project idea, bringing a great team of advisors, well established international organizations and a research group with successful track records in the field gave us the confidence to believe we will meet and exceed the project goals.

While the team is qualified and has a very good set of advisors, concerns about feasibility on time as previous ContentMine initiatives did not do the major part for Wikimedia on time.

Thanks you for your positive comments about our team capabilities and our board of advisors. We are continuously improving our internal capabilities and project processes to make sure we deliver on time and in a lean manner. Regarding your concerns about our consortium being able to deliver on time, we would like to highlight the improvements we have made on our current project. We are currently eight months into ScienceSource and have thus far met all milestones in the original proposal and proactively reported progress on a monthly basis to the Wikimedia Projects Team. The ScienceSource content generation pipeline and the interface were scheduled for completion in Feb 2019 and we are on track to deliver on time. So far we have been able to achieve all the project milestones as described on previous answers. This improvement on our time management and goals achievement is due to the possibility the Wikimedia Foundation gave us to learn from previous experiences, the improvement on internal capabilities (technical as well as managerial), having processes and procedures in place that allows us to measure our progress and manage the risk in a timely manner. For those reasons and all the planning work, we have done previous to the project submission we strongly believe we are in a great position to deliver this project in time and as planned.

Communication efforts are significant and the project as a whole promotes diversity. There is a significant community support.

Thank you for your support and recognizing all the effort we have put on the communication and engagement of this project. For us, it was of great importance to bring the right network of organizations into this project. However, this engagement process took months and long hours of discussions and preparation. We understood that having organizations and their communities that can use DiversiTech even after the project is finished was one of the best ways to ensure the sustainability of the project in the long run. We also feed this is a much more community driven project than a technology project. This time we have engaged with great organizations such as Wikimedia UK that will help us to move the outreach of this project even further. We also have had great advice from Jason Moore from Wiki LGBT and much more. We also appreciate your comment about this project as a whole promote diversity, we also are including organizations from different parts of the global south which we believe it will help us to achieve a truly global, multilingual and diverse project. Finally, we are overwhelmed by all the support we received this time, way beyond expectations. We are also very happy that people who are completely new to the Wikimedia movement are supporting the idea and are one step closer to becoming part of this wonderful community.

Good community engagement, both from Wikidata and from LGBT communities.

Many thanks for your positive comments about our work done related to community engagement. We have prioritised communication and community as this project is focused on needs brought to us by communities both within and outside of Wikimedia. We have been great supporters of Wikidata for a long time now, not only through the Wikimedia projects but also in all our research projects, conferences, workshops and outreach activities. We believe great things can be achieved through Wikidata. We also spend a big part of our preparation time liaising with LGBT related organizations from parts of the world including Movilh, Chile; Fundacion Arcoiris, Mexico; Arab Foundation for Freedom and Equality, Middle East; Whose Knowledge, USA. We also discussed the project with Wiki LGBT community and making sure this project was of great value for a wide audience.

The project is big but it reports to impact in other languages but it is not clear in the description. Basically if it will produce content in English projects, the impact will be smaller because this specifc topic (LGBT) is already good. Thete should be more clear statement about the impact in other languages.

Many thanks for your comments, we also believe this project is “big” in community engagement, support received, quality of the advisors, organizations supporting the project, but also realistic on scope, goals and budget. To clarify, one of the key parts of this project is to create semantic links between terms in multiple languages to assist in building content across different language Wikipedia. We plan to index content in English and Spanish in the first instance, working with WikiProject LGBT and Latin American LGBT groups such as Movilh (Chile) and Fundacion Arcoiris (Mexico). However, the platform is extendable to texts and translations in other languages. Linking of terms through Wikidata will help in completing info-boxes across different language Wikipedias even where further human editing effort is still required. Currently, we are moving forward the necessary planning to include Portuguese sources as well.

This is difficult case for me. The previous two projects by the same team produced underwhelming results. They did not actually deduced any useful "facts" from the literature corps that they mined - only keywords and refs. Moreover those underwhelming results have largely failed to find their way into actual Wikipedia articles. By the stated above reasons I can not support this project, which is predestined to the same fate as two previous ones.

Thanks for your comments, we regret our project was a difficult case for you. We would like to clarify that we have four months remaining to completion of ScienceSource and have thus far met all proposed measures of success for the eight-month time point. We are about to move into phase two of our community engagement with editors to ensure that content is turned into article edits and improvements, in line with the project timings and milestones laid out in the original proposal. We would love to show you what we are doing, having the chance to meet the new team behind this project, get your feedback on our proposed measures of success and use it to improve our project so that we deliver an impactful result.

Not in the current state. I would support funding only if grantee would show that they either have everything ready to actually generate content for Wikimedia articles or prioritise it in first priority and do not provide any further funding until it's done. Research-wise the results are good, there should be an interesting algorithm working, but content generation-wise it's a failure as the project did not bring good results and it was not properly addressed.

Thanks for your comment, we believe is fair to ask us to show we have everything ready to generate content or to have it as first priority before supporting this project. We would like to confirm that this is the case and it is a top priority for us and the organizations around DiversiTech. Some of the actions we have been taking towards include: A) Community: We have spent the last 12 months engaging with different organizations in four different regions (Movilh, Chile; Fundacion Arcoiris, Mexico; Arab Foundation for Freedom and Equality, Middle East; Whose Knowledge, USA) some of them are very interested in co-designing the tool during the development stage and using it once available. We also engaged with the Wiki LGBT project and will bring part of their community to collaborate with us. Wikimedia UK is on-board to help us with communication and dissemination activities on this project. B) Technology: This idea has been tested on two small funded projects brought to ContentMine by Grand Valley State University and McGill University one of which won the Joan Nestle award on LGBT history, demonstrating the viability of the technology on a larger scale as well as the potential for the global community. C) Access to Material: We have a corpus of 10.000 LGBT-related articles of which 3.000 have already been converted into a format ready to load into the proposed platform. As pointed out by other commenters, this project builds on two research projects funded by GVSU which provide a ready corpus and part of the software framework required.

Round 2 2018 decision[edit]

This project has not been selected for a Project Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding. This was a very competitive round with many good ideas, not all of which could be funded in spite of many merits. We appreciate your participation, and we hope you'll continue to stay engaged in the Wikimedia context.

Comments regarding this decision:
We will not be funding your project this round. The committee appreciates the strong community support and outreach efforts from ContentMine around Diversitech, and recognizes its fit with movement strategy. However, the most significant concern expressed by the committee were that that outcomes from past ContentMine projects funded by the Wikimedia Foundation have yielded an unclear impact and benefit for Wikidata, even if targets were met, and did not meet the expectations of community members.

Next steps: Applicants whose proposals are declined are welcome to consider resubmitting your application again in the future. You are welcome to request a consultation with staff to review any concerns with your proposal that contributed to a decline decision, and help you determine whether resubmission makes sense for your proposal.

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.

I JethroBT (WMF) (talk) 15:10, 1 March 2019 (UTC)[reply]