Grants talk:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 2 years ago by MCasoValdes (WMF) in topic Round 2 2021 decision

Proposal Clinics[edit]

Dear @SFauconnier: and @Afkbrb:, Thank you for taking time to prepare your proposal for the Project Grants open call for software and research projects. I wanted to bring your attention to the optional proposal clinics resource offered by our team as the means to support the interested applicants to get feedback from Program Officer and the subject area expert co-facilitating the clinic. If you would like to attend and ask questions, you can find the dates, times, and videoconference links posted on this page  Please feel free to inquire in case you have any questions. Warm regards! RSharma (WMF) (talk) 16:27, 7 March 2021 (UTC)Reply

Thanks for the heads up, RSharma (WMF). I joined the first clinic and we received extremely helpful feedback there! SFauconnier (talk) 20:40, 11 March 2021 (UTC)Reply



@SFauconnier, Afkbrb, and Pintoch: I'm wondering, could this proposal be beneficial for Lexemes? (which is an other big part of Wikimedia structured data not yet supported by OpenRefine) or even work on functionalities specific for Lexemes. I'm not a dev but I'm guessing some brick of code could be similar or in common, I'd love to hear your point of view on that.

Cheers, VIGNERON * discut. 10:00, 16 March 2021 (UTC)Reply

@VIGNERON: Yes, a significant chunk of the work will consist in adding the required machinery to support different entity types (instead of just items at the moment) and this would clearly be a prerequisite for lexicographical data too. But then, support for that domain would clearly not come for free, there are a few things that need to be figured out. We would need to think about how to deal with the nested structure of forms and senses and how users would generally want to represent them in their tables. Also, we would need to have a good think about reconciliation in this domain (since the matching requirements would be quite different from those of items). This is a lot of high-level work that can be done without touching a line of code, so if people from the community are motivated to design what this should look like, we will welcome them with open arms! − Pintoch (talk) 10:38, 16 March 2021 (UTC)Reply

Commons template values[edit]

In my support !vote on the main page, I mentioned that it would be very cool if the reconciliation service could serve the values of particular template fields, as well as structured data values, for particular M-ids.

This is something I think that could be hugely beneficial: harvest selected template-field values for a set of images into OpenRefine, work on them in OR, then write back structured data statements.

Is the serving of template-field values in this way a realistic further goal (or stretch-goal) for the reconciliation service proposed? Jheald (talk) 09:16, 17 March 2021 (UTC)Reply

Hello Jheald, thanks so much for your endorsement and for this great piece of input. I agree it would be very valuable to have - I personally know of quite a few collections of files on Commons where I'd like to take the data in the infobox template, polish and wrangle it, and translate it to structured data. That said, I also see where it may be hard to implement: information (and fields) in templates can vary very wildly. I'm taking a mental note of it, I think this deserves either a Phabricator or Github ticket so that it doesn't get lost. @Pintoch: any quick first thoughts on this? Spinster (talk) 13:10, 17 March 2021 (UTC)Reply
@Jheald and Spinster: It would be nice indeed! Yes the workflow you mention would be very neat. It seems intuitively doable. For instance if the service is written in Python, libraries such as mwparserfromhell work very well for that. If that's really too hard, it should at least be possible to retrieve the full wikitext from the reconciliation service, and then do the parsing in OpenRefine. Not as nice, but doable. Also thanks for the reminder about the issue of ignored reconciliation results (#3369), I bumped its priority. − Pintoch (talk) 13:50, 17 March 2021 (UTC)Reply

WD image "consistency check" with Wikicommons depicts supported?[edit]

Will this proposal make it possible to find Wikidata objects with image (P18) and easy see if the used file has on Wikicommons depicts (P180) to the WD object.... - Salgo60 (talk) 13:34, 17 March 2021 (UTC)Reply

@Salgo60: Not directly - I think this will likely be solved by improvements to the Wikidata and Commons Query Services, for instance with phab:T258776. Perhaps OpenRefine could help with adding those missing depicts statements once the list has been extracted, but I don't think OpenRefine itself would be the best tool to generate that list in the first place. (But personally I am not really up to date about the status of the query services so I might be off base) − Pintoch (talk) 14:01, 17 March 2021 (UTC)Reply
Thanks for the answer... I am today doing some data roundtrip test with Swedish runestone and one challenge is finding all the pictures of runestones in WIkicommons so I can add what they "depict" then I learned that this function is useful... I got help with a SPARQL query but thinks maybe its useful in a tool like Open Refine one problem right now is that WCQS is updated just every week - Salgo60 (talk) 18:46, 17 March 2021 (UTC)Reply

Questions from Joalpe[edit]

Hey all. Really interesting proposal --and good to hear from you, SFauconnier :) ! I am writing here for some clarifications, for a better assessment of the grant request (wearing my hat of project grants committee member).

  1. I have some uncertainty about the current status of OpenRefine. This is particularly important as OR is not part of Wikimedia, though it is heavily used by our community.
    1. How big/active is its development team?
    2. Are the functionalities proposed here part of a broader vision for the tool, or should they be considered as an ad-hoc increment to the current code?
    3. Are there any risks for the continuity and sustainability of OR we should consider? If so, how can they be mitigated?
  2. I find the expected role of the WMF team quite modest in this proposal (pinging here FRomeo (WMF)), "support[ing] this project by coordinating pilots and documenting work flows and case studies". I understand SDC is a central part of the work being done by the GLAM & Culture team, so:
    1. How does the development of the OR functionalities proposed here fit the overall, current work on SDC by the WMF?
    2. Could any mentorship for the development of the tool be provided from the WMF tech team?
  3. Fiscal sponsorship to Code for Science & Society appears quite high, ~US$ 14,000. I don't remember working with such a high rate of fiscal sponsorship --I know that in my context we normally negotiate much lower rates.
    1. Could you please specify what actual service it will provide that impacts the development being proposed here?
    2. How possible is it for this value to be reduced?

I hope these questions make sense, and I look forward to learning more about this proposal. Thanks for all you do! --Joalpe (talk) 15:27, 31 March 2021 (UTC)Reply

Hi Joalpe, thanks for these questions! As a member of OpenRefine's dev team I'll comment on your first few points and let others chime in for the rest.
  1. Our contributors community is growing quite fast at the moment - we had 47 contributors in 2020 (compared to 20 in 2019). That includes translation and documentation contributions. And we have a stable core team of about 5-7 members who have been in the project for quite some years. Of course we'd be happy for more, but it's already quite healthy I would say. (Perhaps our website gives a misleading impression of the state of the project - I would really like to find the time to revamp it)
  2. As mentioned in the reply to VIGNERON above, a significant part of this project will lay the groundwork for further Wikibase integration: we will add support for Wikibase federation (which might get used outside of Commons at some point - if not already?) and it will be a good first step towards support for lexicographical data. Of course our Wikibase integration is just one of the application domains of the tool and we aim to serve other communities too, but this is an important area for us (given the tool’s close connection and history with collaborative knowledge graphs).
  3. Given the current growth of the project I am cautiously optimistic. The project has been supported by other funders (Google News Initiative in the past, Chan Zuckerberg Initiative currently). So it's unlikely that the project becomes a ghost town soon. − Pintoch (talk) 08:13, 2 April 2021 (UTC)Reply
Dear Joalpe,
You are correct that this project is very important to our plans for structured data adoption, as it will allow non-technical users to make large content contributions to Commons. We expect it to be very useful for staff within GLAMs, especially librarians who were the largest segment in OpenRefine's most recent user survey. While OpenRefine supports other open knowledge communities, 45% of OpenRefine users connecting to a reconciliation service use Wikibase. We are confident in the OpenRefine team's ability to execute on this project after they successfully evolved their tool to support any Wikibase instance. I will certainly make sure they get the advice they need if any technical questions or issues arise during development.
FRomeo (WMF) (talk) 10:27, 5 April 2021 (UTC)Reply
Hello Joalpe, thank you so much for these great questions. I think Pintoch and FRomeo (WMF) already responded quite well to your first two questions; I will mainly respond to the third one (regarding fiscal sponsorship), although the answer to that question is also strongly related to your first one (regarding sustainability).
Code for Science and Society (CS&S) provides comprehensive fiscal sponsorship of projects, which is more in depth than most models. CS&S indeed charges 15% of cash received to its fiscally sponsored projects to cover the costs associated with financial management of grants to the standards of a US 501(c)(3) nonprofit public charity. This covers (1) the usual, basic overhead - financial and administrative support (examples: management of contracting, legal services, insurance, bookkeeping, grant compliance, tax compliance, and maintaining audited financials). But in addition CS&S also provides (2) strategic guidance and in-depth collaboration on technical development, community growth, fundraising (including help with grant writing and reporting) and support of other project activities.
Since OpenRefine joined CS&S, OpenRefine now has a governance model with both an advisory committee (running the project on a day-to-day basis) and a strategy-oriented advisory committee (I'm a member of the latter). Both are actively supported by CS&S. In addition, through CS&S, OpenRefine is part of networks like Invest in Open Infrastructure and the Research Software Alliance; both are groups that work on sustainability of open software and infrastructure.
While writing this specific application, I have actively worked with CS&S; their feedback has been incredibly helpful and their team is already jumping in to assist us in navigating some more complex administrative issues behind the scenes. It’s really valuable to have an additional group of knowledgeable people on board who can take care of that and who keep an eye on strategy at large, so that we, the community/developers team, don’t need to worry about that and can focus on the things we do best: which is taking care of good new software features. So I really think their fee (which can't be modified - it's the same, 15%, for all funded projects) is worth it and is not excessive for a full year of such impactful support.
As a small addition to Fiona’s answer to question 2: I’m also confident that, via the WMF GLAM and Culture Team, we’ll be able to get technical advice if needed. It’s probably good to mention that we are not dependent on any work from WMF or WMDE technical teams - no code review from their side will be needed, as our code will be independent from the Wikidata / Commons code bases.
Warmly, SFauconnier (talk) 18:31, 6 April 2021 (UTC)Reply
@Pintoch, FRomeo (WMF), and SFauconnier: Thank you for such comprehensive answers. It is admirable the number of stakeholders that have joined together to work on this proposal. I am grateful for all the work you do for the movement. --Joalpe (talk) 03:06, 7 April 2021 (UTC)Reply

Eligibility provisionally confirmed, Round 2 2021 - Research and Software proposal[edit]

This Project Grants proposal is under review!

We've provisionally confirmed your proposal is eligible for review in Round 2 2021 for Research and Software projects, contingent upon:

  • confirmation that the project will not depend on staff from the Wikimedia Foundation for code review, integration or other technical support during or after the project, unless those staff are part of the Project Team.
  • compliance with our COVID-19 guidelines.

Schedule delay

Please note that due to unexpected delays in the review process, committee scoring will take place from April 17 through May 2, instead of April 9-24, as originally planned.

  • Please watch your talkpage, which will be the primary method of communication about your proposal. We appreciate your timely response to questions and comments posted there.
  • Please refrain from making changes to your proposal during the scoring period, so that all committee members score the same version of your proposal.
  • After the scoring period ends, you are welcome to make further changes to your proposal in response to committee comments.

COVID-19 planning for travel and/or offline events

Proposals that include travel and/or offline events must ensure that all of the following are true:

  • You must review and can comply with the guidelines linked above.
  • If necessary because of COVID-19 safety risks, you must be able to complete the core components of your proposed work plan _without_ offline events or travel.
  • You must be able to postpone any planned offline events or travel until the Wikimedia Foundation’s guidelines allow for them, without significant harm to the goals of your project.
  • You must include a COVID-19 planning section in your activities plan. In this section, you should provide a brief summary of how your project plan will meet COVID-19 guidelines, and how it would impact your project if travel and offline events prove unfeasible throughout the entire life of your project.

Community engagement

We encourage you to make sure that stakeholders, volunteers, and/or communities impacted by your proposed project are aware of your proposal and invite them to give feedback on your talkpage. This is a great way to make sure that you are meeting the needs of the people you plan to work with and it can help you improve your project.

  • If you are applying for funds in a region where there is a Wikimedia Affiliate working, we encourage you to let them know about your project, too.
  • If you are a Wikimedia Affiliate applying for a Project Grant: A special reminder that our guidelines and criteria require you to announce your Project Grant requests on your official user group page on Meta and a local language forum that is recognized by your group, to allow adequate space for objections and support to be voiced).

We look forward to engaging with you in this Round!

Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

Marti (WMF) (talk) 07:00, 17 April 2021 (UTC)Reply

Questions about your proposal[edit]

Dear SFauconnier and Afkbrb,

Thank you for submitting this proposal. I have a couple of questions for you:

  • The project specifies creating "tools", but doesn't specify how or where. Can you provide clearer language about where the code will live, where the tool will be hosted, and how you foresee it being maintained over time? One of the things we review for is the likelihood that a software project will generate unplanned maintenance tasks for the Wikimedia Foundation's technical staff down the road after the project is completed. Anything you can share to help us understand your thinking about future support for the tool is helpful.
  • The proposal mentions "Wikimedia development." Can you clarify what this means?

(Also... I wanted to pass along that one of the technical staff who reviewed the proposal noticed the phrase "M numbers" or "Mids"' about halfway through the proposal document, and suggested changing it to "M-ids" to avoid confusion.)

Warm regards,

--Marti (WMF) (talk) 07:07, 17 April 2021 (UTC)Reply

@Mjohnson (WMF): I think I am responsible for this imprecise wording, sorry! The plan is to host two tools as Toolforge projects: the Commons reconciliation service and the batch upload tool (if we decide to do one). This should not generate any overhead for WMF (beyond the fact that WMF runs the Toolforge). I personally think the reconciliation service ought to be supported closer to the core (as a MediaWiki extension, just like Wikidata's reconciliation should also be a MediaWiki extension) but going down this route would require a much closer collaboration with WMF: we would for instance be tied by the security review and other deployment processes. (I am still trying to figure out what is the best strategy to make such a closer integration happen, but that's probably off-topic here.) − Pintoch (talk) 12:05, 19 April 2021 (UTC)Reply
@Mjohnson (WMF): Re-reading my reply above I realize I could have been clearer about your second question: in this context "Wikimedia development" means development of the Toolforge tools mentioned above (so, without any coordination or support from WMF required). − Pintoch (talk) 05:33, 22 April 2021 (UTC)Reply

Aggregated feedback from the committee for Structured Data on Wikimedia Commons functionalities in OpenRefine[edit]

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
Additional comments from the Committee:
  • The project likely fits with Wikimedia's strategic priorities and It probably can be sustained in the long term although this will require some effort.
  • In line with the priorities, has a community supporting OpenRefine, can be scaled to similar projects as uses the same Wikibase. But I am so afraid as it is a third party project and we are investing some much money and such expectations into it. It is like building the whole MediaWiki engine dependent on a small library developed and supported by a third party. I believe to improve the impact WMF needs either to fork and incorporate it or do something else to ensure sustainability.
  • This proposal is directed to building a necessary infra-structure for the movement, as it improves how reconciliation --a key step in knowledge service-- is provided. The proposal is building strong partnerships that could have lasting impacts.
  • The project is scalable and can be adapted by individuals and GLAM institutions seeking to mass upload files to Wikimedia commons. The project will provide continuous support to community members who want to learn how to upload files to Wikimedia commons using openrefine.
  • The project is more iterative than innovative - it builds on the existing tool and expands its capabilities. The potential benefits are high but there are some risks as in any complex project. The main risk is that the development is not finished within one year.
  • Great idea, working with SDC, such a potential, lots of things could be built upon.
  • This solves a real problem for the community.
  • This tool when developed will be the next big tool for Wiki commons. It is innovative and the proposers are quite clear on its objective. It will go a long way to support GLAM institutions as well as individuals who want to upload a larger amount of files to Wikidata. Almost all the steps involved to execute this project have been outlined. There have been a couple of similar projects in the past. The target for the project is quite clear.
  • The project can be accomplished in 12 months although there is a large amount of work. The budget seems realistic and the participants seem to have the necessary skills.
  • Seems to be, already works for Wikidata and the idea is the same. The budget is high but reasonable.
  • The team is experienced and strong. Sandra Fauconnier has been a program officer in the WMF and has a consistent track of delivering.
  • I think this project can be completed in 12 months. The project is realistic with a clear budget. Also, the project participants have the necessary skills/experience to execute this project successfully. Each team member has a solid understanding of what the project entails.
  • There is significant community support and the project will support diversity by enhancing structured data on Commons.
  • I haven't seen anyone on Commons being against SDC.
  • This is probably the highest level of endorsements I have ever seen on a proposal: as of today, there are ~50 endorsers, including some very active Wikimedians. This is a clear indicator of how badly our community needs this!
  • The project is likely to engage a lot of people from the Wikimedia community. I believe participants will continue to support this project even after it ends due to its impact. Comments from the endorsements also suggest that the project is likely to succeed. The project participants are from diverse backgrounds.
  • I am willing to give this project a chance although a clarification about future maintenance costs is needed.
  • I am saying "yes" but I sooooo afraid about various sides of the project. And I wanted WMF to think more about mitigating these risks rather than applicants:
    • - it's a huge amount for financing, are we sure that the Grants department will be able to follow the project and its milestone during the whole term?
    • - we will be hugely dependent on the 3rd party project (despite it being sharing the same ideas as WMF and working in the same direction): what will we do if something goes wrong there? Will we be able to fork it?
  • Very strong proposal. Happy that this has been put forward. I recommend the team to look for a Wikimedia developer from the Global South, as to have someone from an emerging community be part of such a focused project could lead to more diversity in how technical skills are spread in the movement.
  • I am voting for this project because of the impact and clarity as outlined. The project participants have a solid understanding of the project and are likely to succeed. I think the project should receive full funding.

Mercedes Caso (platícame) 18:26, 10 May 2021 (UTC)Reply

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.

Next steps:

  1. Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
  2. Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.
Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

Mercedes Caso (platícame) 21:21, 13 May 2021 (UTC)Reply

Scheduling interview[edit]

Dear Sandra,

Your project has been selected by the Project Grants Committee to advance to the next stage of review. Consequently, I would like to meet with you for up to one hour to discuss your Project Grant proposal. I am reaching out to you here to kindly ask you to check your email and get back to me at your earliest convenience regarding your availability to meet.

Warm regards,

Mercedes Caso (platícame) 18:28, 10 May 2021 (UTC)Reply

Thanks for the nice conversation we had yesterday, Mercedes! SFauconnier (talk) 11:39, 13 May 2021 (UTC)Reply

Clarification on the "Wikimedia development" mentioned in the grant[edit]

Following a call with Mercedes Caso I wanted to give more clarity on the scope of the development effort proposed in this project. The development will happen in two places:

  • in the OpenRefine tool itself (whose code base is at We intend to maintain this code in the years to come, but in the unlikely event that the project becomes a ghost town, this is something anyone can fork easily (since this is not a web service but a tool that users run locally).
  • in tools hosted on the Wikimedia Toolforge (that is what we call "Wikimedia development"). Those will be open source and will also be maintained by the community (similarly to QuickStatements, for instance). Using Toolforge makes it easy to share deployment responsibilities with other community members.

So, none of the features developed in this project will fall under the remit of the WMF's technical teams. In particular, no MediaWiki extension will need to be deployed on Wikimedia wikis. − Pintoch (talk) 18:31, 12 May 2021 (UTC)Reply

Team adjustments in our proposal[edit]

After a conversation with Mercedes Caso and following earlier email exchanges with the WMF grants team, I have updated our project proposal, in which (with great sadness) I have removed Lu Liu as prospective OpenRefine developer for this project. Turns out that it is legally not possible for us to work with someone based in China (some background here). So the position of OpenRefine developer is, at the moment, still vacant as well.
We're quite confident that we'll be able to recruit someone who can do this work very well. However, hourly rates for a developer based in another part of the world (for instance in Africa, as suggested by the grants committee above) may be more expensive. After consultation with Mercedes, we will revise our budget accordingly (and reasonably within the allowed margins of this grants program), and send a revised version to the WMF grants team for review before updating our budget here on meta as well. SFauconnier (talk) 11:50, 13 May 2021 (UTC)Reply

I have just updated our budget accordingly, and have added a note about this change as well. I want to extend many thanks to everyone who has helped us to figure this out! SFauconnier (talk) 15:31, 27 May 2021 (UTC)Reply

Grants:Project/Structured Data on Wikimedia Commons functionalities in OpenRefine - Meta

Round 2 2021 decision[edit]

Congratulations! Your proposal has been selected for a Project Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $99,412

Comments regarding this decision:
The committee is pleased to fund this project as it builds necessary infra-structure for the movement, since it improves how reconciliation is provided, a key step in knowledge service. They especially appreciated seeing the wide community endorsement. The proposal is building strong partnerships that could have lasting impacts. The project has co-funders and a diversified source of funding, which shares and lowers the risk of the Wikimedia Foundations acting as a sole funder. NOTE: Funding of any offline activities (e.g. travel and in-person events) is contingent upon compliance with the Wikimedia Foundation's COVID-19 guidelines. We require that you complete the Risk Assessment Tool:

  • 14 days before any travel and/or gathering event
  • 24 hours before any travel and/or gathering event

Offline events may only proceed if the tool results continue to be green or yellow.

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!

Upcoming changes to Wikimedia Foundation Grants

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.

Mercedes Caso (platícame) 21:30, 1 June 2021 (UTC)Reply