Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal/Midpoint

From Meta, a Wikimedia project coordination wiki


Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 6 months.

Summary[edit]

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

  1. First feedback loop completed: agreement with the community and in sync with the Wikidata team, see #Community_outreach;
  2. the Wikidata primary sources tool becomes a standard candidate for third-party data releases: uplift proposal;
  3. The back end as a Wikidata Query Service fork: [1], [2];
  4. the front end as a MediaWiki extension: code (work in progress)
  5. the major reference preview feature is deployed in the gadget version of the tool;
  6. the statement browser sidebar utility is also deployed in the gadget.

Methods and activities[edit]

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Project management[edit]

An old Phabricator work board

Dissemination[edit]

  • The project leader regularly interacts with key people in the Wikimedia movement:
    • via written communication (e.g., e-mails, Phabricator comments, Gerrit code review replies) for planned tasks and conversation follow-ups;
    • in person during the events;
  • the team is always encouraged to reach out to relevant community members, especially in case of technical and MediaWiki-related challenges;
  • Dissemination material takes the form of talks with slides, demos, and informal conversations.

Technical infrastructure[edit]

  • The user interface (i.e., front end) development follows quite subjective practices, probably due to the lack of standard ones:
    • Afnecors tests his code on a MediaWiki Vagrant instance with the wikibase_repo and wikidata roles activated (not without pain, and without data): mw:MediaWiki-Vagrant;
    • Kiailandi prefers to directly hack the wikidata.org site using the console of his favorite Web browser;
    • both use a JavaScript IDE to write their code;
    • Hjfocs is an old-school guy, and uses an enhanced text editor;
    • the common.js Wiki page in the user namespace is useful to test new features and/or for refactoring of the tool version 1, which is a Wikidata gadget. See for instance d:User:Kiailandi/common.js;
  • The back-end development tries to stick with standard Web services implementation guidelines, namely:
    • implementation of integration tests, that can be run at compile time on the local machine;
    • real-world test deployment on a third-party server;
    • production deployment will happen on a dedicated Wikimedia VPS machine, see phab:T180347.

Midpoint outcomes[edit]

Slides of the talk given at WikiCite 2017

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

We have fully devoted the first half of the StrepHit renewal to the PST, with two principal goals:

  1. to understand with the relevant community the ideal software architecture;
  2. to take control over the first version of the tool.

Community outreach[edit]

The Italian community at ItWikiCon 2017

Primary sources tool uplift proposal[edit]

General architecture proposal for the Wikidata Primary sources tool version 2

Back end[edit]

The back end is a Wikidata Query Service fork, with an additional module.

The main novel outcome is a facility to standardize the dataset release workflow for third-party providers. It includes:

Front end[edit]

Screenshot of the statement browser sidebar utility

The front end is a MediaWiki extension, see mw:Manual:Extensions. It has 2 substantial components:

  1. item-based curation, where a user receives suggestions when looking at an item page;
  2. filter, a modal window where a user can dive into the PST data through different facets.

Reference preview[edit]

Screenshot of the reference preview buttons, item-based view
Screenshot of the reference preview facility

The major original outcome is a utility that eases the curation workflow through references preview. It is currently implemented on the item-based view. When a user is on an item page for which the PST has suggestions:

  1. one preview button appears whenever a reference suggestion exists;
  2. the user can click on that button to see a preview of the reference content without leaving Wikidata;
  3. if the reference originates from a dataset with a known corpus of Web pages (e.g., StrepHit), it will be directly consumed to display the preview. Otherwise, a best-effort scraper is employed;
  4. the item, property, and value labels get highlighted with a yellow marker: this allows the user to quickly grasp the reference;
  5. finally, the user can approve or reject the referenced statement.

Side project: QuickStatements to Wikidata RDF converter[edit]

qs2rdf is a standalone tool written in Python. It translates Wikidata statements serialized in QuickStatements to Wikidata RDF. QuickStatements has a concise syntax, thus being easier to use for data providers. However, it is totally not standard. Wikidata RDF is more verbose and complex to use, but relies on a longtime mature World Wide Web standard: https://www.w3.org/TR/PR-rdf-syntax/

Code repository: https://github.com/marfox/qs2rdf

Finances[edit]

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far?

Yes. We replaced the Wikimedia Developer Summit budget item in the initial proposal with WikiCite, since the renewal of this project was accepted after that summit.

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

The overall cost of dissemination activities (i.e., Wikimedia Developer Summit + Wikimania) are likely to be lower than the planned ones, as the rescheduled events took place near the grantee's physical location. Still, we do not foresee a remarkable change, which we will balance with the project leader or the training set items.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges[edit]

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

The following barriers have considerably impacted the whole project roadmap and schedule, in descending order of magnitude:

  1. the learning curve for developing a MediaWiki extension is huge;
  2. there is an abundance of non-standard software development practices, coupled with scattered documentation that do not help newcomers;
  3. the previous PST front end was written in a monolithic fashion, preventing us from proper refactoring;
  4. the Wikidata Query Service is a complex Java project, which entails higher:
    • implementation times for relatively simple Web services (but this is Java);
    • memory requirements for deployment, preventing us from just relying on a Toolforge machine (but this is Java);
    • package size due to dependency requirements (but this is Java).

Therefore, the main lesson learnt may sound rather sad, but would certainly allow one to save the vast majority of the overall implementation effort in future projects: it is probably more reasonable to develop a new version of a piece of software from scratch, rather than to try to put the hands on existing code that was built in a totally undocumented, quick-and-dirty fashion.

What is working well[edit]

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Learning_patterns/Working_with_developers_who_are_not_Wikimedians.

Next steps and opportunities[edit]

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. We will fully concentrate our efforts into the two main targets.

Primary sources tool[edit]

  • Complete the development of new features, with priority to:
    • Web interface for data providers to upload, update, and delete their datasets;
    • filter based on domain of interest;
    • filter based on arbitrary queries;
  • release version 2;
  • announce the release to the community;
  • write a tutorial for data providers;
  • encourage past users to play with the new version;
  • promote the tool to engage additional users.

StrepHit[edit]

  • Investigate practical solutions to address the main research challenges, specifically:
    • knowledge modeling, or how to translate facts extracted from natural language to Wikidata statements;
    • entity reconciliation, or how to understand if that John Smith extracted from that source corresponds to the same John Smith in Wikidata;
  • rethink the lexical database;
  • release version 2 of the datasets;
  • request and develop a bot for direct upload of confident statements.

Grantee reflection[edit]

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 6 months?

  • At the time of writing the renewal proposal, the IEG program was being merged into the more comprehensive Project Grants (PG) one. This had an expected temporal impact on the overall process, maybe because the proposal was naturally not bound to any scheduled call. On the other hand, we actually felt very hugged, since we underwent a separate review just for us!
  • We especially took pleasure in rethinking the whole proposal, thanks to the very productive feedback on the discussion page;
  • the organizers gave us the opportunity to kick off the project renewal at WikiCite_2017: this was a crucial starting point, as we had the chance both to meet the core primary sources tool developer in person and to organize a structured meeting with the Wikidata development team.