Wikicite/grant/Improving Wikidata-Wikisource Integration

From Meta, a Wikimedia project coordination wiki

This Wikimedia Foundation grant has a fiscal sponsor. Wikimedia Österreich administered the grant on behalf of Open Heritage Foundation.

Project summary[edit]

Project Name
Improving Wikidata-Wikisource Integration (Metadata and Copyright Information)
Start/End dates
Nov/Dec 2020 - May 2021
Amount requested (and the currency you wish to receive it in)
₹1,055,590.25 (Indian rupees)
Amount requested (in US$ equivalent)
$14,300[1]
Note: Amount approved for project is $10,000 + fiscal sponsor fee. LWyatt (WMF) (talk) 14:27, 9 October 2020 (UTC) [reply]

The people[edit]

Contact person name/Wikimedia username
KCVelaga
Contact person e-mail address
kcvelaga(_AT_)openheritagefoundation.org
Organisation (optional)
Open Heritage Foundation
Project participants
# Name Role
1 Krishna Chaitanya Project Coordinator/Product Manager
2 hire to be made post funding Developer
3 hire to be made post funding Community Coordinator
4 Satdeep Gill Advisor
5 Sam Wilson Technical Advisor

The project[edit]

Description[edit]

The project is about improving the state of Wikidata’s integration with Wikisource and vice-versa. The project has two major components, and one minor component. The two major components are exploration+experimentation and product development, and the minor component is community engagement on a couple of Wikisources. If you consider this book on English Wiksource and its respective Wikidata item - the data that is presented on the Wikisource Index page is already present on the Wikidata item. The idea of the project is to explore and develop ways in which the already existing data on Wikidata can be used on Wikisource.

The project is to develop a technical product (or a set of products - modules/templates/bots/tools) that will enable users of a Wikisource to avoid duplication of data and efforts. There are several ways in which this can be done, and currently we do not have any definite solution as it is very dependent on how the project progresses. Before the potential solutions are detailed, a problem with Wikisource-Wikidata that highly impacts this project needs to be detailed out.

Currently, Wikisource Index pages are where the metadata of a book is presented, and the general workflow is to create the Index page first, and then proceed to creating the Main Page for a book. The latter step might be done immediately after the first step, or later. However, Wikidata items are only linked to the Main Page of the book, but not the Index page. This poses a problem: if the Main Page is not created immediately after creation of the Index Page, which does happen a lot, no linking of Wikidata and Wikisource takes place, even if they are related. And all the data here is added manually. Even if the Main Page is created, and it gets linked with a Wikidata item, it is not being put to much use.

So through this project we will try to address this. Some of the following solutions we are thinking about are;

  • Adding a Wikidata QID field to the Index page form i.e. on the Proofread extension. When a user creates an Index page, the respective Wikidata ID can be filled in. Post that, other fields will be matched and filled with respective properties on Wikidata. The second part, filling of other fields after the Wikidata ID is added - will either be done directly by the Proofread extension, or there can be a bot which will be locally deployed on Wikisources who would want to opt for this, and the bot will handle the rest.
  • The first point requires some discussion with the community, which we will be doing in the first month, as it requires changes to the Proofread extension that is globally used. If the discussions don't go in favour, we will be focusing on locally deployable solutions - bots and modules. If it goes this way, then towards the end, we will be having a comprehensive set of technical tools, which can be individually deployed based on a community’s interest.

Another part of the project is engaging the community around this, again by creating a tool which will allow them to check the books on Wikisource which are not currently linked to Wikidata, and suggest possible items if already existing to them. Though this can also be done via Quarry partly, not everyone can do it. We will first be working with Punjabi Wikisource community to do this, and also for the same project will be considered for testing purposes on pilot basis, in addition to the test Wikisource.

Apart from the above mentioned set, a bot is also envisioned to automate the presentation of copyright information on Wikisources, if the practice is not already being done. For example, at the end of this page there is a PD template which conveys the copyright information of the book. However, this is manually added, and many of the small Wikisources do not follow this practice. The bot will fetch information from Wikimedia Commons licensing section, add it to Wikidata items, if not already present, then add relevant template to the book page on Wikisource.

Motivation[edit]

Currently there is a lot of duplication on data between Wikidata and Wikisource. Duplication is not only in terms of data, but also efforts, because the same data is being added in two places. Having the data written in various places with individually on their own, leads to inconsistencies, and this project will address this issues - for example, when linking an Index page to Wikidata, if the user finds some error with the data that is coming from Wikidata - it can be fixed, which would otherwise may be left as it is. The project will improve the integration between Wikidata and Wikisource, and serves as one of the primary goals of WikiCite, that is to put the bibliographic data on Wikidata to use.

Also, the project will help us understand the implicit problems and practices on both Wikidata and Wikisource that will impact their integration. It can be a good learning experience and be considered a starting point for further work.

Activities[edit]

The broad timeline of the project is as follows:

  • Nov-December 2020: Scoping
  • Research and in-depth understanding of the problem in focus
  • Facilitating a community discussion around proofread extension
  • Working with the developer and the advisors to decide on the best possible solution for metadata and copyright information automation.
  • January 2020: Phase 1 Development
  • Deciding on the functionalities and features of the tools to be developed.
  • Begin to work on the technical development of the tools.
  • February 2021 - March 2021: Phase 2 Development
  • Continuation of technical development work.
  • Working with the Wikisource community to test the features on a rolling basis.
  • Community engagement to create content i.e. linking Wikidata items to Wikisource.
  • April 2021: Final Development & Testing
  • Wrapping up on the technical development.
  • Community engagement for content creating and testing.
  • May 2021: Wrap-up
  • Integrating feedback from testing on the technical front.
  • Communication of results and final outcomes to the community.
  • Final project report.
  • Webinar and/or training with the community.

Apart from the broad timeline, the aspects of each of the members would be working upon are;

  • Project Coordinator/Product Manager will be in charge of the overall project, liaising between the developer, the Wikisource community, passing on feedback, keeping a track of the development phase, creating toolkits for setup (if necessary), technical documentation of tools built, and finally documenting all the process and the learnings as a report. In addition, the person will also be contributing to the code wherever possible, for minor tweaks.
  • Developers will majorly be doing the programming of the tools that we decide upon, and also engage in conversations with the community as needed.
  • Community Coordinator will be liaising with the Punjabi Wikisource community to mobilise users for testing and content creation purposes.

Measures of success[edit]

Since this isn’t an outreach program, there won’t be any content related metrics. However, as there is a community engagement component, we would like a connection to be made between at least a 100 works on Punjabi Wikisource to Wikidata. We would like to consider this project as success if

  • Technical tools have been developed for Wikidata-Wikisource integration - even if not everything is integrated it ideally should, it will at least achieve some progress towards reducing metadata duplication on Wikisource and Wikidata for respective works.
  • All attempts and processes would be clearly documented so to understand what worked well on the technical front, and what did not, such that it will help further work in this area.
  • Development of toolkits (technical documentation) on how to use the tools developed, and workflows involved to deploy something locally on their Wikisources, if needed.
  • Outlining areas where more integration can happen between Wikisource and Wikidata, and a direction in which such work can be continued.
  • Webinar which will include the showcase of tools, and also if necessary, training of the tools created will be hosted towards the end of the project, with the community.

Community[edit]

Wikisource community will be the primary target audience. We will start engaging with the community, with a discussion around the Proofread extension, and then working with the Punjabi Wikisource community for initial testing and pilot phase. If anyone is interested, people from other communities are also welcome. Once there is some substantial development probably from the end of Feb 2021, we will be sending out regular updates to the community of work, primarily via Tech News and other channels.

The Budget[edit]

  • Project Coordinator / Product Manager - 200 hrs over 6 months - 200 hrs * $17.5/hr[2] = $3500
  • Developer - 400 hrs over 6 months - 400 hrs * $17.5/hr[2] = $7000
  • Community Coordinator - 100 hrs * $15/hr[2] = $1500
  • Contingency and miscellaneous expenses - $1000
  • Total: $13,000
  • Fiscal sponsor charges (10% of total): $1,300
  • Grant total: $14,300

Note: We have put forward this application even though the budget is above the mentioned upper limit, as the project is important in terms of Wikidata-Wikisource integration and is clearly aligned with the objectives of WikiCite. The paid time as mentioned is well below part-time work, about 8 hrs / week for Project Coordinator, and about 16 hr/s week for the developer - we haven’t mentioned per week time above because it may not equally distributed in practice, the work Project Coordinator will be more concentrated towards beginning and end of the project, than the times technical development is in progress. These are the required hours we feel are minimum to work on materialising this project. So we would like to request the committee to make an exception considering the importance and the scope of the project.

COVID risk assessment (for in-person events)[edit]

not applicable

Notes[edit]

  1. We acknowledge that US$10,000 is the upper limit for budget, please see the budget section for further clarification
  2. a b c according Indian standards

Feedback[edit]

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.
Please provide links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions.

Endorsements[edit]

Optional: Community members are encouraged to endorse your proposal and leave a rationale here.

Questions[edit]

Any questions about this proposal and feedback from reviewers should be placed on the associated discussion page.

Report[edit]