Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal
This project is requesting a 12-month renewal of the grant, to continue work in the areas described below.
The Whys and Wherefores
StrepHit was born to improve the quality of Wikidata.
Serving as an artificial intelligence, its mission is to read authoritative Web sources, understand them, and ultimately guarantee reliable references that corroborate Wikidata content.
So far, StrepHit has read 1.8 million pages across 53 sources, and has produced more than 2.6 million Wikidata facts.
Of course, StrepHit is not as intelligent as a human, and its understanding should be validated by the community.
In principle, such operation may be achieved via the primary sources tool: not only it offers a data curation facility to Wikidata editors, but it also plays a crucial role in the inclusion of external data releases.
Therefore, StrepHit and its sister datasets are simply unprofitable if the tool is unusable!
The primary sources tool was first built by Google to facilitate the eventual migration of Freebase to Wikidata.
The tool is actually conceived to host more datasets and ultimately aims at standardizing the data release flow from third-party providers. StrepHit has pioneered this direction via its Web Sources Knowledge Base and by engaging more data donors, as part of its measures of success.
We believe the tool has a huge potential, which is however flawed by the following critical issues.
- It is frustrating and almost unusable from an end-user perspective;
- its maintenance and improvement processes are fairly sub-optimal, as:
Hence, the following requirements are essential for the tool sustainability:
- to understand the back-end code, written in C++;
- access to the WMF Labs machine to deploy the back-end;
- a Wikidata administrator to deploy the front-end;
- centralized and exhaustive documentation.
As part of StrepHit's initial goals, the grantees have already worked on:
- requirement #1, as demonstrated by the merged pull requests;
- requirement #3, since the project leader has been granted access to the machine.
The remaining ones still need to be taken in charge.
Collaboration with ContentMine
It is worth to mention that ContentMine, an Open Data organization, has explicitly stated its interest in a data release on Wikidata via the primary sources tool. Hence, we foresee to make the most of this opportunity under the umbrella of an IEG. Moreover, ContentMine is clearly a strategical partner, since it is currently proposing a project for Wikidata, where the collaboration with StrepHit is specifically mentioned.
We first focus on a major restructuring of the primary sources tool. Then, we reserve a set of activities for the quality of StrepHit datasets, which are closely intertwined with the testing phase of the tool, as they represent its current data backbone (besides Freebase).
- Primary sources tool:
- interact with the community to improve the mapping between extracted facts and Wikidata properties;
- improve the performance of the supervised extraction;
- fix open technical issues.
The activities stated above can be grouped into 4 high-level categories:
- end-user experience;
- technical requirements;
- data provider experience;
- StrepHit datasets.
They are broken down into the following critical tasks, prefixed with U, T, D, and S respectively. Those related to the primary sources tool stem from known user requests, and are referenced accordingly.
Primary Sources Tool
|U1||Tool workflow||Sketch the complete flow by integrating all the already collected requirements, step back to the community with mock-ups, redesign if needed||M1-M2||10%|
|U2||Filtering facility||Allow users to curate items based on their interests||M3-M4||2.5%|
|U3||Primary Sources list||Make this sub-tool usable||M3-M5||5%|
|U4||Effective curation flow||All the components needed for curation should be shown in one single step||M3-M6||7.5%|
|U5||Data de-duplication||Avoid showing duplicate statements||M3-M4||2.5%|
|U6||Community dissemination||Promote the tool and engage its key stakeholders||M1-M12||7.5%|
|T2||Back-end redesign||Decide with the community whether the C++ implementation is sustainable or it should be replaced with a more adopted language||M2||2.5%|
|T3||Front-end redesign||Work with strategic partners to create a suitable front-end, based on U1 outcomes||M3-M8||10%|
|T4||Sources whitelist||Build a list of reliable sources||M6||2.5%|
|T5||Back-end documentation||Improve technical documentation of the back-end module||M6-M12||2.5%|
|T6||Front-end documentation||Improve technical documentation of the front-end module||M6-M12||2.5%|
|T7||Developer community||Engage volunteer contributors||M3-M12||10%|
|D1||Data provider workflow||Work with ContentMine to consolidate the flow for third-party data providers||M1-M6||5%|
|D2||Data release tutorial||Write an exhaustive walkthrough for third-party data releases||M7||2.5%|
|S1||Lexical database||Coordinate with the community to improve the mapping of extracted facts into Wikidata properties||M7-M12||10%|
|S2||Training set||Keep adding training data via crowdsourced annotation jobs||M7-M12||2.5%|
|S3||Classification performance||Train and test the supervised classifiers to improve their precision||M7-M12||2.5%|
|S4||Datasets for direct inclusion||Isolate with the community the best StrepHit datasets to be uploaded through a bot||M10-M12||5%|
|S5||Unresolved entities dataset||Decide with the community whether to mint new Wikidata Items based on the unresolved entities||M10-M12||5%|
N.B.: overlap between tasks (in terms of timing) are needed for iterative planning.
The total amount requested is 50,630 USD.
|Project Leader||Responsible for the whole work package||Full time (40 hrs/week), 12 PM(1)||41,379 €|
|Front-end Developer||Assistant for the user interface implementation||Part time (20 hrs/week), 6 PM(2)||Provided by the hosting research center|
|Wikimedia Developer Summit 2017||Travel, board & lodging||Una tantum||1,196 €|
|Wikimania 2017||Registration, travel, board & lodging||Una tantum||2,079 €|
|Training Set||Crowdsourced job payment for the annotation of training sentences||Una tantum||550 €|
(1) Person Months
(2) The need for an assistant was also suggested by a community member
The item costs are computed as follows.
- The project leader's gross salary is estimated upon the hosting research center (i.e., Fondazione Bruno Kessler) standard salaries, namely "Quadro direttivo" (project manager). The salary complies both with (a) the provincial collective agreement as per the provincial law n. 14, and with (b) the national collective agreement as per the national law n. 240. These laws respectively regulate research and innovation activities in the area where the research center is located (i.e., Trentino, Italy), and at a national level. More specifically, the position is set to a gross labor rate of 21.55 € per hour. The rate is in line with other national research institutions, such as the universities of Trieste, Firenze, and Roma;
- the front-end developer will be either chosen from available human resources or directly hired by the hosting research center, thus minimizing the impact on the requested budget;
- Wikimedia Developer Summit 2017 has been suggested by LZia (WMF), and sums the sub-items below:
- Wikimania 2017 sums the sub-items below:
- 1,002 €, i.e., the average cost of 1 round-trip flight from Milan to Montréal, retrieved at Skyscanner;
- 877 €, i.e., the average cost of 1 single room for 5 nights in a 3-star hotel, retrieved at Trivago;
- 200 €, i.e., the Wikimania registration cost for 3 conference days, based on Wikimania 2016;
- the training set construction job has an average cost of 4.35 ¢ per annotated sentence: the item is an estimate to reach a threshold of 500 sentences for each of the current 49 frames.
The total budget expressed in Euros is approximately equivalent to the requested amount in U.S. Dollars, given an exchange rate of 1.12 USD = 1 €.
N.B.: Fondazione Bruno Kessler will be physically hosting the human resources, but it will not be directly involved into this proposal with respect to the requested budget: the project leader will serve as the main grantee and will appropriately allocate the funding.
We target the following list of partners to maximize favorable outcomes of the work package.
|ContentMine||U1, U6, T2, T3(3)|
|Wikidata Developers team||U4, T2, T3|
|WMF Design Research team||U1, T3|
|WMF Discovery (User Experience) team||U1, T3|
(3) These tasks respectively match C2, T6, T7, C5 in Grants:Project/WikiFactMine#Activities
Measures of Success
The self-sustainability of the primary sources tool codebase should have the highest priority. Hence, the most important measure of success is the number of technical contributors, which should reach 4 people.
All the numerical success metrics apply to the primary sources tool and can be assessed via the status page.
- 25,000 new curated statements, computed as the sum of approvals and rejects. Currently, they are 231,707 over a total of 16,547,445;(4)
- full engagement of 2 Open Data organizations, namely ContentMine and Openpolis, in the form of dataset releases via the primary sources tool;
- 30 new primary sources tool users may represent a stretch goal, as suggested by LZia (WMF);
From a qualitative perspective, we will keep collecting feedback through the currently open request for comment.
(4) the displayed numbers were looked up on July 19th 2016.
The following global metrics are inherently connected to the local ones:
- the number of articles created or improved in Wikimedia projects measure will be based on Wikidata statements;
- the number of active editors involved can be determined from the primary sources tool user log database;
- the number of individuals involved will include an heterogeneous set of people, ranging from dissemination activities attendees, to hackathon participants, all the way to software contributors.
- https://tools.wmflabs.org/wikidata-primary-sources/status claims only a few active users (cf. the
topusersfield), out of 427 total users (cf. the
- Pellissier Tanon, T., Vrandečić, D., Schaffert, S., Steiner, T., and Pintscher, L. (2016, April). From Freebase to Wikidata: The Great Migration. In Proceedings of the 25th International Conference on World Wide Web (pp. 1419-1428). ACM (2016)
- cf. the list of maintainers for
- http://hr.fbk.eu/sites/hr.fbk.eu/files/ccpl_28set07_aggiornato_2009.pdf - page 82, Tabella B
- https://www.units.it/intra/personale/tabelle_stipendiali/?file=tab.php&ruolo=RD - ninth item of the central dropdown menu
The following list of links includes both notifications and discussions, and is sorted in descending order of relevance.
- WMF Discovery (User Experience): https://lists.wikimedia.org/pipermail/discovery/2016-October/001325.html;
- WMF Design Research: https://lists.wikimedia.org/pipermail/design/2016-October/002519.html
- ContentMine: Grants_talk:Project/ContentMine/WikiFactMine#WikiFactMine_and_StrepHit
- Freebase: https://groups.google.com/d/msg/freebase-discuss/3quRFI2f1rU/ajAr9dJgBQAJ
- Wikiresearch: https://lists.wikimedia.org/pipermail/wiki-research-l/2016-July/005296.html
- Wikimedia: https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084807.html
- Open Knowledge Foundation: https://lists.okfn.org/pipermail/okfn-discuss/2016-July/011098.html
- DBpedia: https://sourceforge.net/p/dbpedia/mailman/message/35239186/
- Spaghetti Open Data: https://groups.google.com/d/msg/spaghettiopendata/BjQfEtqnNnw/kPcuZyoGBwAJ (in Italian)
Statements of Support
Do you think this project should be continued for another 6 months? Please add your name and comments here. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.
- The work done as part of this IEG has been great for us and I'd like to recommend it for an extension to further boost this important area of Wikidata. --Lydia Pintscher (WMDE) (talk) 15:36, 25 July 2016 (UTC)
- The StrepHit project should be continued in order to improve the Primary Sources Tool. --tomayac (talk)
- There is a strong need for enhancements to usability of the current primary sources tool. As part of the prior Freebase Experts community, I want to give my recommendation for further IEG extension. Thadguidry (talk) 13:14, 26 July 2016 (UTC)
- Magnus Manske (talk) 13:35, 26 July 2016 (UTC)
- When considering the long (or short!) term, adding sources manually is untenable, and unscalable - as evidenced by the current status. I urge wikimedia/data to compromise on its sources policy, or invest in an automated solution like this. The 'google-the-fact' plan is of questionable value, and is a considerable drain on the long-term viability of this project. Spencerk (talk) 13:53, 26 July 2016 (UTC)
- It would be fantastic if the Primary Sources Tool could be updated. The tool is vastly underutilized, and I think it is a great way to increase the quality and quantity of references in Wikidata, and thus improve the reliability of Wikidata. --denny (talk) 16:59, 26 July 2016 (UTC)
- This work is not just critical for Wikidata but also more broadly for the various initiatives to improve the quality, volume and machine-readability of sources we started under the WikiCite umbrella. --Dario (WMF) (talk) 17:36, 26 July 2016 (UTC)
- I too would like to see a renewed commitment and funding of this work. In my opinion we not only gain useful tools but also more experience in this most interesting area of computing. Especially usability and accessibility seem important,so more people can participate in these new developments. --Tobias1984 (talk) 21:17, 26 July 2016 (UTC)
- A very useful and helpful tool and an updated version would be great! --Todrobbins (talk) 22:13, 26 July 2016 (UTC)
- I would like to see more tools being developed to help Wikidata end-users make the best possible use of their time (for those who have little time to spare) and user-friendly (for those who have plenty of time to spare, like my Wikitherapy participants). As my new plans include "breaking into" Wikidata especially in the medical field, I endorse this renewal.--Saintfevrier (talk) 04:37, 27 July 2016 (UTC)
- This tool is very important for the future of Wikidata and is not currently used as much as it should because of usability issues. Tpt (talk) 07:40, 27 July 2016 (UTC)
- I wasn't able to use this tool in a successful way, but it's very interesting. I support the endorsement so it could be improved and implemented like a normal beta feature.Coentor (talk) 18:13, 27 July 2016 (UTC)
- I strongly endorse this project. I think that improvements to the Primary Sources Tool will facilitate the addition of new data into Wikidata - including the data I am currently working with. Epantaleo (talk) 13:39, 28 July 2016 (UTC)
- Yiyi (talk) 14:52, 28 July 2016 (UTC)
- Mark Hurd (talk) 15:06, 28 July 2016 (UTC)
- Scott_WorldUnivAndSch (talk) 21:26, 28 July 2016 (UTC)
- I have tried the tool and I have found it useful! --Jaqen (talk) 11:03, 29 July 2016 (UTC)
- I'm glad to endorse this project and I hope more tools like this are developed and supported --Steko (talk) 13:50, 29 July 2016 (UTC)
- As I can see, this tool improves the abilities of wikidata significantly. Please go on! Lantus (talk) 11:44, 31 July 2016 (UTC)
- Epìdosis 18:05, 31 July 2016 (UTC)
- Very useful --Davidpar (talk) 23:36, 31 July 2016 (UTC)
- Pretty much what everyone else said; improving the primary sources tool's usability is likely the most effective way to increase the user base of the tool itself and, ultimately, of Wikidata --Edorigatti (talk) 10.14, 1 August 2016 (UTC)
- The tool is a big help. It's not perfect and some of the usability issues mentioned by others above need to be addressed but I think it's a great start and does a lot of good. Reguyla 22.214.171.124 17:02, 2 August 2016 (UTC)
- Love to see the result. I hope that'll improve Wikidata a lot.-Biyanto Rebin (WMID) (talk) 05:05, 3 August 2016 (UTC)
- I find the project very interesting. It would be even more useful if it helped the insertion of new data in an automated fashion. Nvitucci (talk) 09:06, 3 August 2016 (UTC)
- I strongly endorse the continuation of this project and its new focus on improving the primary sources tool. This could be one of the most important gateways for gathering a validating new data and references for wikidata --i9606 (talk)
- I'm particularly keen to endorse this project and its work on improving primary sources tool because it's the best way to curate the addition of machine generated data, both from StrepHit and other textmining pipelines Bomarrow1 (talk) 10:53, 4 August 2016 (UTC)
- Needed work. TomT0m (talk) 13:02, 4 August 2016 (UTC)
- I strongly support this effort. - PKM (talk) 17:13, 4 August 2016 (UTC)
- --Ermanon (talk) 23:50, 4 August 2016 (UTC)
- --Melderick (talk) 10:50, 5 August 2016 (UTC)
- Lymantria (talk) 12:24, 5 August 2016 (UTC)
- Primary Sources Tool is important and need to be improved.
Example of bugs :
The link to geohack of geographical coord. is often disabled when I use Primary Sources Tool ;
The button "Reject" or "Approve" often interchange their places (The button Reject can be at the place of button Approve...)) ;
The Primary Sources statement list need a automatically refresh when I use it ;
Lot of result are exactly redundant with wikidata's data and need to be automatically reject/treat ;
When I add a reference with PST it doesn't add it to the statement but create a new statement ;
(New Bug) The statements is sometime display empty, and force to refresh the page if I want to use PST ;
etc... Nouill (talk) 16:55, 6 August 2016 (UTC)
- The Primary Sources Tool has a lot of bugs at the moment, this will help a lot! Sjoerd de Bruin (talk) 09:28, 7 August 2016 (UTC)
- Very valuable tool to improve the reliability of Wikidata; such improvements definitely deserve the effort. -- Laddo (talk) 21:05, 7 August 2016 (UTC)
- Happy to endorse the continued improvement to this tool: definitely going to be an important part of developing Wikidata, Sadads (talk · contribs)
- H4stings (talk) 06:11, 9 August 2016 (UTC)
- Been a pleasure to use so far. Keep going! Telaneo (User talk page) 22:10, 9 August 2016 (UTC)
- Of course! -- Bodhisattwa (talk) 18:34, 10 August 2016 (UTC)
- I strongly endorse the continued use and development of StrepHit for the improvement of Wikidata and other Wikimedia Projects --- Nimbosa (talk) 13:42, 14 August 2016 (UTC)
- Just started playing with the tool; I think there is a lot of opportunity here and I would like to see a good thing get better! harej (talk) 00:24, 18 August 2016 (UTC)
- Very useful Afernand74 (talk) 20:39, 18 August 2016 (UTC)
- The StrepHit project has very worthy goals and a good track record but still needs a lot of work in terms of usability and integration with existing and emerging workflows, at least for the areas that I am most active in, i.e. sciences and medicine. This renewal has the potential to help significantly to move things forward in this regard. -- Daniel Mietchen (talk) 01:00, 19 August 2016 (UTC)
- The project has lot of potential but needs some more work -- usability to be improved, integration with other pipelines and wrokflows, as many others pointed out. -- Giulio Petrucci 22 August 2016
- --Crazy1880 (talk) 17:46, 22 August 2016 (UTC)
- YES, this project should be continued for another 6 months. Good work! Projekt ANA (talk) 26 August 2016
- +1 --Kippelboy (talk) 10:18, 29 August 2016 (UTC)
- --Uomovariabile (talk) 17:40, 31 August 2016 (UTC)
- Also yes ·addshore· talk to me! 12:12, 2 September 2016 (UTC)
- It's a helpful tool but it still needs a lot of work to reach it's promise. Founding it seems vital. ChristianKl (talk) 09:43, 12 September 2016 (UTC)
- This tool is valuable, but it needs improvement. --Molarus (talk) 19:53, 15 September 2016 (UTC)
- Yes, this tool is very useful, even if it needs some work. Mateon1 (talk) 17:57, 17 September 2016 (UTC)
- Conny (talk) 19:27, 17 September 2016 (UTC).
- Yes, the tool is useful and its improvement should definitely be funded. Mushroom (talk) 09:42, 19 September 2016 (UTC)
- --CristianNX (talk) 10:01, 14 October 2016 (UTC)
- Pajn (talk) 19:39, 21 October 2016 (UTC)
- Sadads (talk) 15:02, 20 October 2017 (UTC)