Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal/Timeline

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Individual Engagement Grants This project is funded by an Individual Engagement Grant

proposal people timeline & progress finances midpoint report final report

Individual Engagement Grants This Individual Engagement Grant is renewed

renewal scope timeline & progress finances midpoint report final report

Timeline for StrepHit: Wikidata Statements Validation via References[edit]

Timeline Date
Back-end redesign: phab:T166497 August 2017
Back-end version 2 February 2018
Front-end redesign: phab:T166495 February 2018
Tool documentation May 2018
Data release tutorial April 2018
StrepHit lexical database version 2 May 2018
StrepHit standard datasets version 2 May 2018
StrepHit direct inclusion dataset May 2018
StrepHit unresolved entities dataset May 2018


Monthly updates[edit]

Each update will cover a 1-month time span, starting from the 22nd day of the previous month. For instance, June 2017 means May 22nd to June 22nd 2017.

June 2017[edit]

Community outreach: WikiCite 2017[edit]

  • The project leader was accepted and attended WikiCite_2017 (see WikiCite_2017#Attendees);
  • kick-off talk given at the main conference track: WikiCite_2017/Program#May_23.2C_2017:_Conference;
  • conference third (hack) day: face-to-face meeting with the Wikidata development team to understand the next implementation steps for the tool uplift;
  • connected in person with Tpt, core developer of the primary sources tool version 1;
  • synchronized with T_Arrow, member of the WikiFactMine team, a strategic partner. When the WikiFactMine project ends, we expect our dataset upload service to be ready to accept the WikiFactMine dataset.
Slides of the talk given at WikiCite 2017


The team has focused on task U1 of the project work package and has published a set of mock-ups that integrate known requirements:


The set is visible to anyone in Phabricator and a high-priority subset was shown to the WikiCite audience as part of the given presentation.

Tool uplift proposal[edit]

The team has come up with an official uplift proposal, which replaces the old tool page: d:Wikidata:Primary_sources_tool. It is based on:

  • feedback collected by the WikiCite audience;
  • outcomes of the meeting with the Wikidata development team;
  • investigation of technical solutions for both the back end and the front end. @Afnecors and Kiailandi: great work so far in diving into the MediaWiki world!


  • First implementation steps towards the solution proposed for the back end: phab:T167810;
  • submitted the first patch to an active Wikimedia project! Currently under review: gerrit:360376;
  • started the front-end refactoring: phab:T168243, phab:T168239.

July 2017[edit]

Community outreach[edit]

Front end[edit]

Focused on a major refactoring of the existing code to fit the new architecture, i.e., a MediaWiki extension. More specifically, we worked on:

  • refactoring HTML templates, see phab:T168247;
  • writing unit tests for the HTML templates, see phab:T168248;
  • refactoring functions that interact with the user, see phab:T168251;
  • writing unit tests for the functions that interact with the user, see phab:T168254.

Back end[edit]

The first development iteration of the Ingestion API (responsible for the interaction with the data providers) is over. More specifically, we worked on:

August 2017[edit]

Community outreach: Wikimania 2017[edit]

Front end[edit]

Efforts still focused on the major refactoring needed for the new architecture. The development of a MediaWiki extension built upon Wikibase is non-trivial and particularly time-consuming, due to:

  • a lack of exhaustive documentation, which entails the following workflow.
    • to dive into the (vast) codebase;
    • to identify relevant pieces of code;
    • to understand its behavior;
    • to interact with the authors and integrate their eventual feedback.
  • the absence of a Wikidata user interface toolkit, preventing direct access to relevant objects;
  • the profusion of non-standard development practices.

Back end discussion outcomes[edit]

We digested the discussion with User:Smalyshev_(WMF) at Wikimania 2017. The main action points have become Phabricator tickets:

Other technical details:

  • having the primary sources tool code in a separate repo implies less overhead in terms of Gerrit patches, e.g., web.xml for new Web services;
  • the actual code review is independent from patches;
  • the interaction between the new Web services and Blazegraph is now implemented via HTTP, but can be probably done through Servlet filters;
  • data normalization is probably not needed;
  • ranking can be safely ignored, i.e., always set to best.

Things external to the code:

  • RDF is probably a complex format for data providers;
  • we need a tool to generate Wikidata-compliant RDF out of tabular data;
  • the most challenging step to generate a Wikidata-compliant dataset is probably entity reconciliation, i.e., mapping between internal IDs and Wikidata QIDs;
  • we may propose to use Open Refine, where a plug-in for Wikidata reconciliation is being developed by d:User:Pintoch.

September 2017[edit]

Front end[edit]

Back end[edit]

October 2017[edit]

Front end[edit]

  • The following known features are deployed to the gadget (version 1), for users to test:
    • de-duplication;
    • suggested properties browser menu;
  • started migrating the browser menu to the new architecture; phab:T175164;
  • the development environment for the filter is ready: phab:T176641;
  • started the implementation of the preview facility: phab:T160332;
    • implemented a best-effort renderer for generic Web sources (current target: Freebase datasets);
    • generic rendering is too error-prone: implemented direct usage of source corpora when datasets make them available (current target: StrepHit);
    • requested a ToolForge account to deploy the rendering services;
    • implemented highlighting of relevant source content.

Back end[edit]

November 2017[edit]

Community outreach: itWikiCon 2017[edit]

  • Afnecors co-organized itWikiCon, the first Italian local chapter Wiki conference;
  • Hjfocs attended and disseminated StrepHit.

Side project: qs2rdf[edit]

The first version of the QuickStatements-to-RDF converter is complete: phab:T173749.

Front end[edit]

  • adapted the front end to the back end version 2: phab:T168244, phab:T168255;
  • migrated the filter to the new architecture: phab:T176642;
  • added reference column to the filter: phab:T178299, phab:T148165;
  • implemented the Web service for Web sources corpora;
  • indexed the StrepHit corpus with ElasticSearch for effective retrieval performance;
  • integrated approve/reject buttons into the preview facility.

Back end[edit]

  • First test deployment of the full back end, version 2: phab:T178585;
  • translated current StrepHit datasets into RDF: phab:T178795;
  • the front end heavily relies on the QuickStatments format:
    • added support for QuickStatments output to the curation API;
    • adapted integration tests;
  • started working on the search service: phab:T180486.

VPS machine[edit]

December 2017[edit]

Front end[edit]

Back end[edit]

  • Requested support for specific time zones in WDQS: phab:T179068. Addressed by Smalyshev_(WMF);
  • the search service is ready: phab:T180486;
  • the random service is ready: phab:T180483.
    • implemented a caching mechanism with separate threads for efficient response;
    • dataset-specific cache gets updated when a new dataset is uploaded;
    • global cache gets updated through a fixed schedule;
    • resolved the non-randomness issue of the random button: phab:T148180;
  • the datasets service is ready: phab:T182192.

VPS machine[edit]

  • worked on the setup: phab:T182789;
  • the back-end production deployment is ready;
  • requested and obtained access to shared storage (Wikimedia data dumps): phab:T183229;
  • the front-end v2 testing deployment is currently blocked by phab:T183274.

January 2018[edit]

Happy new year! Back to work after the Christmas break.

Front end[edit]

  • The software developers were not available due to university exam sessions;
  • the project leader started the integration of the back-end version 2 production services (i.e., those deployed in the VPS machine) into the front-end gadget code.

Back end[edit]

  • Implemented the statistics service: phab:T183364;
  • dataset statistics are ready, use a scheduled caching mechanism with a separate thread for efficient response phab:T183367;
  • user statistics are ready: phab:T183370;
    • they are computed on the fly through queries to a specific named graph in the back-end storage engine: phab:T170820;
    • user name sanity check: phab:T182185;
    • the curate service updates user activities count in a live fashion, i.e., whenever a user curates a Wikidata statement.

VPS machine[edit]

  • Worked with BDavis_(WMF) to resolve phab:T183274, which was blocking the import of Wikidata XML dumps into a MediaWiki Vagrant instance;
  • the import process has started: phab:T182989.

February 2018[edit]

The main outcome of this month is the alpha release of version 2: phab:T185571. The back end services are now switched to version 2, while the front end is the gadget version with major features implemented.

The team has also intensively worked on the beta release of version 2: phab:T185572.

Front end[edit]

  • Handled duplicate suggested statements with date values: phab:T177226;
  • implemented version 2 of the configuration dialog, i.e., the dataset selection window, as per phab:M218: phab:T187043;
  • fixed an issue in the reference preview affecting the source: phab:T186698;
  • added the text box to input generic SPARQL queries: phab:T178306;
  • started working on the entity of interest input box: phab:T178303.

Back end[edit]

  • improved the search service performance: phab:T185576;
  • added an optional dataset description parameter in the upload service: phab:T187221;
  • output the dataset description in the statistics service: phab:T187220.

VPS machine[edit]

  • Finished loading a full copy of Wikidata XML dumps into a MediaWiki vagrant instance: phab:T182989;
  • worked with BDavis_(WMF) to resolve phab:T185637: instead of bypassing a bug in Apache HTTP client through a specific NGINX configuration, upgrade the dependency in WDQS.



  • handle multiple reference-value pairs in QuickStatements as a single RDF reference node.

March 2018[edit]

The team focused its development efforts on the beta release of version 2. The highest priority was given to the improvement of the filter component.

Front end[edit]

  • Filter module:
    • added property autocompletion: phab:T178305;
    • added entity value autocompletion: phab:T178307, phab:T178301;
    • started work on the baked filters (i.e., all properties, all item values, frequent StrepHit properties and item values): phab:T189123;
    • user workflow: disable mutually exclusive filters depending on user input;
    • first working version of statements search with SPARQL;
    • worked on properly displaying search results in a table;
    • worked on showing action buttons (i.e., preview, approve, reject) with search SPARQL query results;
    • connected the reference preview facility to search results;
  • finalized the Web interface for data providers, in the form of a special page: phab:T170821;
  • finalized the HTML templates module;
  • improved the statement suggestions browser:
    • ported the code to generate it into the MediaWiki extension: phab:T175164;
    • now it also shows properties when only references are new (not full claims): phab:T177231;
    • fixed a bug that prevented it from reappearing after a page reload on Firefox and Safari: phab:T186604;

Back end[edit]

  • Implemented the service for property autocompletion: phab:T188002;
  • implemented the service for entity value autocompletion: phab:T188004;

VPS machine[edit]

  • Extended the maximum size of a dataset to be uploaded to 256 MB: phab:T186731

April 2018[edit]

Front end[edit]

  • Fixed curation button labels in the reference preview: phab:T186611;
  • resolved a conflict with the ESC key that was closing both the filter and the reference preview windows: phab:T189579;
  • pick different methods to retrieve labels via the Wikidata API depending on the number of results;
  • optimized the labels cache;
  • finalized the dataset selection dialog;
  • implemented a tool-specific portlet that stands out on the left sidebar;
  • optimized the automatic blacklist of reference URLs;
  • filter module:
    • completed the implementation of the baked filters dropdown menu: phab:T189123;
    • enabled curation actions on baked filters statements;
    • optimized the autocompletion filters;
    • forced a limit of 100 results in arbitrary SPARQL to avoid heavy queries;
  • rewrited log handling with fine-grained message levels;
  • completed work on the upload/update special page to comply with the MediaWiki style.

Back end[edit]

  • Major refactoring of the whole codebase:
    • moved shared pre/post-processing logic to static methods in a utility class;
    • moved SPARQL queries, API parameters, settings read from environment variables into standalone classes;
  • improved logging with fine-grained messages;
  • started work on documentation;
  • started work on compliance with MediaWiki coding conventions;
  • Technical details:
    • ensured that parameters are not shared among requests;
    • upload service: respond with a bad request if the dataset files are not RDF;
    • ingestion API: full check of required request parameters;
    • validate user name wherever it is passed as a request parameter;
    • implemented a filter to add CORS headers in each service;

VPS machine[edit]

  • Loaded the URL blacklist and whitelist for more comprehensive testing.

May 2018[edit]

This is the final month of the project. Closed as many issues as possible in Phabricator: phab:project/board/2788/. Some are still open due to third-party assignees, some are out of scope. See the final report for more details: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Final.

Front end[edit]

Complete refactoring and clean-up of the codebase.

Back end[edit]

VPS machine[edit]

  • Deployed the latest version of the back end;
  • installed the latest version of the MediaWiki extension front end.


  • Finalized the confident and supervised datasets version 2.