Grants:IEG/StrepHit: Wikidata Statements Validation via References

This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

statusselected

Individual Engagement Grants

StrepHit: Wikidata Statements Validation via References

summaryStrepHit is a Natural Language Processing pipeline that understands human language, extracts facts from text and produces Wikidata statements with reference URLs.
StrepHit will enhance the data quality of Wikidata by suggesting references to validate statements, and will help Wikidata become the gold-standard hub of the Open Data landscape.

targetWikidata

strategic priorityimproving quality

themetools

amount30,000 USD

grantee• Hjfocs

advisor• Cgiulianofbk

contact• fossati

spaziodati.eu

volunteer• Vladimir Alexiev• BolioliAndrea• Danrok• Projekt ANA• PasqualeSignore• Nisprateek• Auva87

give feedback

endorse

created on17:07, 3 September 2015 (UTC)

round 2 2015

Friendly space expectations

Project Idea[edit]

StrepHit (pronounced "strep hit", means "Statement? repherence it!")^[1] is a Natural Language Processing pipeline that harvests structured data from raw text and produces Wikidata statements with reference URLs. Its datasets will feed the primary sources tool.^[2]
In this way, we believe StrepHit will dramatically improve the data quality of Wikidata through a reference suggestion mechanism for statement validation, and will help Wikidata become the gold-standard hub of the Open Data landscape.

The Problem[edit]

The vision of a Web as a freely available repository of machine-readable structured data has not only engaged a long strand of research, but has also been absorbed by the biggest web industry players. Crowdsourced efforts following the Wiki paradigm have enabled the creation of several knowledge bases - most notably DBpedia,^[3] Freebase,^[4] and Wikidata^[5] - which are proven useful for a variety of applications, from question answering to entity summarization and entity linking, just to name a few.

However, the trustworthiness of Wikidata assertions plays the most crucial role in delivering a high-quality, reliable knowledge base: in order to assess their truth, assertions should be validated against third-party resources, and few efforts have been carried out under this perspective. One form of validation can be achieved via references to external (i.e, non-wiki), authoritative sources. This has motivated the development of the primary sources tool: it will serve as a platform for users to either accept or reject new references and/or assertions coming from third-party datasets.
We argue that there is a need for datasets which guarantee at least one reference for each assertion, and StrepHit is conceived to do so.

The Solution[edit]

StrepHit applies Natural Language Processing techniques to a selected corpus of authoritative Web sources in order to harvest structured facts. These will serve two purposes: to authenticate existing Wikidata statements, and ultimately to enrich them with references to such sources. More specifically, the solution is based on the following main steps:

Corpus-based relation discovery, as a completely data-driven approach to knowledge harvesting;
Linguistically-oriented fact extraction from reliable third-party Web sources.

The solution details are best explained through the use case shown below. The technical implementation is provided in the implementation details section.

Use Case[edit]

Soccer is a widely attested domain in Wikidata: it counts a total of 188,085 items describing soccer-related entities,^[6] which is a significant portion (around 1.27%) of the whole knowledge base. Moreover, those Items are generally very rich in terms of statements (cf. for instance the Germany national football team).

On account of such observations, the soccer domain properly fits the main challenge of this proposal, namely to automatically validate Wikidata statements against a knowledge base built upon the text of third-party Web sources (from now on, the Web Sources Knowledge Base).

The following table displays four example statements with no reference from the Germany national football team Item, which can be validated by candidate statements extracted from the given references.

Wikidata	Sentence	Extracted Statement	Reference
`<Germany, participant of, Miracle of Cordoba>`	"(...) The Miracle of Cordoba, when they eliminated Germany from the 1978 World Cup"	`<Germany, eliminated in, Miracle of Cordoba>`	The Telegraph
`<Germany, team manager, Franz Beckenbauer>`	"In 1984 Beckenbauer was appointed manager of the West German team"	`<West German team, manager, Beckenbauer>`	Encyclopædia Britannica
`<Germany, inception, 1908>`	"The story of the DFB’s national team began (...) on April 5th 1908"	`<DFB’s national team, start, 1908>`	DFB
`<Germany, captain, Michael Ballack>`	"Michael Ballack, the captain of the German national football team"	`<German national football team, captain, Michael Ballack>`	Spiegel

Proof of Work[edit]

The soccer use case has already been partially implemented: the prototype has yielded a small demonstrative dataset, namely FBK-strephit-soccer, which has been uploaded to the primary sources tool.

We invite reviewers to play with it, by following the instructions in the project page.
The dataset will serve as a proof of work to demonstrate the technical feasibility of the project idea.

Google Summer of Code 2015[edit]

As part of the Google Summer of Code 2015 program,^[7] we proposed a project under the umbrella of the DBpedia Association. The goal was to enrich the DBpedia knowledge base via fact extraction techniques, leveraging Wikipedia as input source. The project got accepted,^[8] and yielded a dataset similar to FBK-strephit-soccer, which is currently integrated into the Italian DBpedia chapter. An informal overview can be found at the Italian DBpedia chapter Web site.^[9] We successfully carried the implementation out,^[10] and attracted interest from different communities.^[11]^[12]^[13] We believe the fact extractor is complementary to StrepHit, and foresee to reuse its codebase as a starting point for the full implementation.

Project Goals[edit]

The technical goals of this project are as follows:

to identify a set of authoritative third-party Web sources and to harvest the Web Sources Corpus;
to recognize important relations between entities in the corpus via lexicographical and statistical analysis;
to implement the StrepHit Natural Language Processing pipeline, serving in all respects as an open source framework that maximizes reusability;
to build the Web Sources Knowledge Base for the validation and enrichment of Wikidata statements;
to deploy a stable system that automatically suggests references given a Wikidata statement.

The above goals have been formulated keeping in mind that they should be as realistic, pragmatic, precise and measurable as possible. On account of the outreach objective (cf. below), additional emphasis will be given to the StrepHit codebase maintainability and architecture extensibility.

Community Outreach[edit]

The target audience is represented by several communities: each one will play a key role at different phases of the project (detailed in the community engagement), and will be attracted accordingly. We list them below, in descending order of specificity:

Wikidata users, involved as data curators;
Wikipedia users and librarians, involved as consultants for the identification of reliable Web sources;
technical contributors (i.e., Natural Language Processing developers and researchers), involved through standard open source and social coding practices;
data donors, encouraged by the availability of a unified platform to push their datasets into Wikidata.

We intend to achieve this goal via constant dissemination activities (cf. timeline of task T10 and its subtasks in the work package), which will also cater for post-mortem sustainability. Special attention will be paid to stimulate multilingual implementations of the StrepHit pipeline.

In Scope[edit]

At the end of the project minimal time frame (6 months), we roughly estimate the following outcomes:

the Web Sources Corpus is composed of 250,000 documents (where 1 document yields 1 reference URL), harvested from 50 different sources, in the English language;
the corpus analysis yields a set of top 50 relations;
the StrepHit pipeline is released as a beta version with an open source compliant license;
the Web Sources Knowledge Base contains 2.25 million Wikidata statements;
the primary sources tool has a stable release.

The above numbers are computed upon the Google Summer of Code 2015 project output: the input corpus approximately contained 55,000 documents from a single source and returned 50,000 facts expressing 5 relations. Each fact can be translated into 1 Wikidata statement.

Project Plan[edit]

Implementation Details[edit]

The main linguistic theory we aim at implementing is Frame Semantics.^[14] A frame can be informally defined as an event triggered by some term in natural language text and embedding a set of participants, called frame elements. For instance, the sentence “Germany played Argentina at the 2014 World Cup Final” evokes the Match frame (triggered by the verb “played”) together with the Team and Opponent participants (respectively Germany and Argentina). Such theory has led to the creation of FrameNet,^[15] namely a general-purpose lexical database for English containing manually annotated textual examples of frame usage. Specialized versions include Kicktionary ^[16] for the soccer domain. Frame Semantics will enable the discovery of relations that hold between entities in raw text. Its implementation takes as input a collection of documents from a set of Web sources (i.e., the corpus) and outputs a structured knowledge base composed of machine-readable statements (according to the Wikibase data model terminology). The workflow is depicted in Figure 1 and is intended as follows:

Extraction of verbs via text tokenization, lemmatization, and part of speech tagging. Verbs serve as the frame triggers (also known as Lexical Units);
Selection of top-N meaningful verbs through lexicographical and statistical analysis of the input corpus. The ranking is produced via a combination of term weighting measures such as TF/IDF and purely statistical ones such as standard deviation;
Each selected verb will trigger one or more frames, depending on its ambiguity. The set of frames, together with their participants, represents the input labels for an automatic frame classifier, based on supervised machine learning,^[17] namely Support Vector Machines (SVM);^[18]
Construction of a fully annotated training set, leveraging a novel crowdsourcing methodology^[19]^[20] (implemented and published in our previous top-conference publications);
Massive frame extraction on the input corpus via the classifier trained in the previous step;
Structuring the extraction results to fit the Wikibase Data Model. A frame would map to a property, while participants would either map to Items or to values, depending on their role.

Contributions to the Wikidata Development Plan[edit]

In general, this project is intended to play a central role in the primary sources tool. A list of specific open issues follows.

Open issue	Phabricator ID	Reason
Framework for source checking	T90881	StrepHit seems like a perfect match for this issue
Nudge editors to add a reference when adding a new claim	T76231	Automatically suggesting references would encourage editors to fulfill these duties
Nudge when editing a statement to check reference	T76232	Same as above

Work Package[edit]

The work package consists of the following tasks:

ID	Title	Objective	Month	Effort
T1	Development corpus	Gather 200,000 documents from 40 authoritative Web sources	M1-M3	15%
T2	State of the art review	Investigate reusable implementations for the StrepHit pipeline	M1	5%
T3	Corpus analysis	Select the top 50 verbal lexical units that emerge from the corpus	M2-M3	5%
T4	Production corpus	Regularly harvest 50,000 new documents from the selected sources	M2-M6	5%
T5	Training set	Construct the training data via crowdsourcing	M3-M4	15%
T6	Classifier testing	Train and evaluate the supervised classifier to achieve reasonable performance	M3-M4	20%
T7	Frame extraction	Transform candidate sentences of the input corpus into structured data via frame classification	M5	5%
T8	Web Sources Knowledge Base	Produce the final 2.5 million statements dataset and upload it to the primary sources tool	M5-M6	15%
T9	Stable primary sources tool	Fix critical issues in the codebase	M5-M6	5%
T10	Community dissemination	Promote the project and engage its key stakeholders	M1-M6	10%

Overlaps between certain tasks (in terms of timing) are needed for iterative planning.

Tasks Breakdown[edit]

The above tasks may be further split into the following subtasks, depending on the stated effort:

ID	Title	Description
T1.1	Sources identification	Select the set of Web sources that meet minimal requirements
T1.2	Sources scraping	Build scrapers to harvest documents from the set of Web sources
T3.1	Verb extraction	Extract verbal lexical units via part of speech tagging
T3.2	Verb ranking	Produce a ranked list of the most meaningful verbal lexical units via lexicography and statistics
T5.1	Lexical database selection	Investigate the most suitable resource containing frame definitions
T5.2	Crowdsourcing job	Post the dataset to be annotated to a crowdsourcing platform
T5.3	Training set creation	Translate the annotation results into the training format
T6.1	Evaluation set creation	Gold-standard dataset to assess the classifier performance
T6.2	Frame evaluation	Reach a F1 measure value of 0.75 in the frame classification
T6.3	Frame elements evaluation	Reach a F1 measure value of 0.70 in the frame elements classification
T8.1	Data model mapping	Research a sustainable way to map the frame extraction output into the Wikibase data model
T8.2	Dataset serialization	Serialize the frame extraction output into the QuickStatements syntax,^[21] based on T8.1
T10.1	Wikipedians + librarians engagement	These communities represent a precious support for T1.1
T10.2	Wikidatans engagement	Data curation and feedback loop for the Web Sources Knowledge Base
T10.3	NLP developers engagement	Find collaborators to make StrepHit go multilingual
T10.4	Open Data organizations engagement	Encourage them to donate data to Wikidata via the primary sources tool

Budget[edit]

The total amount requested is 30,000 USD.

Budget Breakdown[edit]

Item	Description	Commitment	PM⁽¹⁾	Cost
Project Leader	Responsible for the whole work package	Full time (40 hrs/week)	6	16,232 €
NLP Developer	Assistant for the StrepHit pipeline implementation (English language)	Part time (20 hrs/week)	3	7,095 €
Training Set	Crowdsourced job payment for the annotation of training sentences	Una tantum	N.A.	1,090 €
Dissemination	Participation (travel, board & lodging) to relevant community conferences, e.g., Wikimania 2016	Una tantum	N.A.	1,500 €
Total				25,917 €

(1) Person Months

The item costs are computed as follows:

the project leader's and the NLP developer's gross salaries are estimated upon the hosting research center (i.e., Fondazione Bruno Kessler) standard salaries,^[22] namely "Ricercatore di terza fascia" (grade 3 researcher) and "Tecnologo/sperimentatore di quarto livello" (level 4 technologist). The salaries comply both with (a) the provincial collective agreement as per the provincial law n. 14,^[23] and with (b) the national collective agreement as per the national law n. 240.^[24] These laws respectively regulate research and innovation activities in the area where the research center is located (i.e., Trentino, Italy), and at a national level. More specifically, the former position is set to a gross labor rate of 16.91 € per hour, and the latter to 14.78 € per hour. The rates are in line with other national research institutions, such as the universities of Trieste,^[25] Firenze,^[26] and Roma;^[27]
the training set construction job has an average cost of 4.35 ¢ per annotated sentence, for a total of 500 sentences for each of the 50 target relations;
the dissemination boils down to attending 2 relevant conferences.

The total budget expressed in Euros is approximately equivalent to the requested amount in U.S. Dollars, given the current exchange rate of 1.14 USD = 1 €.

N.B.: Fondazione Bruno Kessler will be physically hosting the grantees, but it will not be directly involved into this proposal: the project leader will serve as the main grantee and will appropriately allocate the funding.

Community Engagement[edit]

All the following target communities have been notified before the start of the project (cf. the community notification) and will be involved according to the different phases:

Wikidatans;
Wikipedians;
Librarians (and GLAM-related communities);
Natural Language Processing developers and researchers;
Open Data organizations.

The engagement process will mainly be based on a constant presence on community endpoints and social media, as well as on the physical presence of the project leader to key events.

Phase 0: Testing the Prototype[edit]

The FBK-strephit-soccer demonstrative dataset contains references extracted from sources in Italian. Hence, we have invited the relevant Italian communities to test it. This effort has a double impact:

it may catch early signals to assess the potential of the project idea;
it spreads the word about the primary sources tool.

Phase 1: Corpus Collection[edit]

The Wikipedia community has defined comprehensive guidelines for sources verifiability.^[28] Therefore, it will be crucial to the early stage of the project, as it can discover and/or review the set of authoritative Web sources that will form the input corpus. Librarians are also naturally vital to this phase, due to the relatedness of their work activity.

Phase 2: Multilingual StrepHit[edit]

Besides the Italian demo dataset, the first StrepHit release will support the English language. We aim at attracting Natural Language Processing experts to implement further language modules, since Wikidata publishes multilingual content and benefits from a multilingual community. We believe that references from sources in multiple languages will have a huge impact in improving the overall data quality.

Phase 3: Further Data Donation[edit]

The project outcomes will serve as an encouragement for third-party Open Data organizations to donate their data to Wikidata through a standard workflow, leveraging the primary sources tool.

Sustainability[edit]

Once the project gets integrated into the Wikidata workflow and the target audience gets involved, we can ensure further self-sustainability by fulfilling the following requirements:

to enable a shared vision with strategic partners;
to foster multilingual implementations of the StrepHit pipeline.

Out of Scope: the Vision[edit]

The project builds upon the findings of our previous research efforts, which aim at constructing a knowledge base with large amounts of real-world entities of international and local interest (cf. Figure 2). The different Wikipedia chapters constitute its core. Governmental and research Open Data are interlinked to the knowledge base. This will allow the deployment of a central data hub acting as a reference access point for the user community. Hence, data consumers such as journalists, digital libraries, software developers or Web users in general will be able to leverage it as input for writing articles, enriching a catalogue, building applications or simply satisfying their information retrieval needs.

Strategic Partners[edit]

We aim at sharing the aforementioned vision with the following partners (besides Wikidata):

Partner	Reason	Supporting references
Wikimedia Engineering community	Actively working on a similar vision	Wiki Loves Open Data^[29]^[30] initiative, part of the quarterly goals^[31]
Google	Responsible for the primary sources tool development	Primary sources tool codebase^[32]
Freebase (now Google Knowledge Graph team)	Eventual migration of Freebase data to Wikidata	Freebase shutdown announcement,^[33] migration project page,^[34] migration FAQ^[35]
Ontotext	Interested in collaborating under the umbrella of the Multisensor FP7 project	Multisensor project homepage,^[36] Ontotext involvement^[37]

Measures of Success[edit]

All the quantitative local metrics to measure success are related to the primary sources tool, can be verified at the Wikidata primary sources status page and are presented below in descending order of specificity.⁽¹⁾

50,000 new curated statements (namely the sum of approvals and rejects), currently 19,201;
100 new primary sources tool active users,⁽²⁾ given that (a) the top 10 users have performed 22,971 actions and (b) the currently active Wikidata users amounts to 15,603;^[38]
involvement of 5 data donors from Open Data organizations.

The following global metrics naturally map to the local ones:

the number of active editors involved may apply to the primary sources tool;
the number of articles created or improved in Wikimedia projects may apply to Wikidata Items;
bytes added to or removed from Wikimedia projects.

From a qualitative perspective, success signals will be collected through:

a dedicated Wikidata project page (similar to e.g., WikiProject Freebase);
a Wikidata request for comment process;
a survey.

(1) the displayed numbers were looked up on September 11th 2015;
(2) in order to count the users specifically engaged by StrepHit, a distinction among datasets should be clearly visible. However, the primary sources status API endpoint does not seem to handle dataset grouping yet. An issue has been filed in the code repository and will be included in task T9 of the work package.

References[edit]

↑ A typical English phonetic trick.
↑ https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
↑ Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia – a Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web (2014)
↑ Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. pp. 1247–1250. ACM (2008)
↑ Vrandečič, D., Krötzsch, M.: Wikidata: a Free Collaborative Knowledge Base. Communications of the ACM 57(10), 78–85 (2014)
↑ According to the following query: http://tools.wmflabs.org/autolist/autolist1.html?q=claim[31:(tree[1478437][][279])]%20or%20claim[31:(tree[15991303][][279])]%20or%20claim[31:(tree[18543742][][279])]%20or%20claim[106:628099]%20or%20claim[106:937857]
↑ http://www.google-melange.com/gsoc/homepage/google/gsoc2015
↑ http://www.google-melange.com/gsoc/project/details/google/gsoc2015/edorigatti/5733935958982656
↑ http://it.dbpedia.org/2015/09/meno-chiacchiere-piu-fatti-una-marea-di-nuovi-dati-estratti-dal-testo-di-wikipedia/?lang=en
↑ https://github.com/dbpedia/fact-extractor
↑ http://us2.campaign-archive1.com/?u=e2e180baf855ac797ef407fc7&id=1a32bc675e
↑ https://twitter.com/pythontrending/status/639350253435621376
↑ https://www.reddit.com/r/MachineLearning/comments/3jdrds/fact_extraction_from_wikipedia_text_a_google/
↑ Fillmore, C.J.: Frame Semantics and the Nature of Language. In: Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language, pp. 20–32. Blackwell Publishing (1976)
↑ Baker, C.F.: FrameNet: a Knowledge Base for Natural Language Processing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. vol. 1929, pp. 1–5 (2014)
↑ Schmidt, T.: The Kicktionary – a Multilingual Lexical Resource of Football Language (2009)
↑ https://en.wikipedia.org/wiki/Supervised_learning
↑ https://en.wikipedia.org/wiki/Support_vector_machine
↑ Fossati, M., Giuliano, C., Tonelli, S.: Outsourcing FrameNet to the Crowd. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. pp. 742–747 (2013)
↑ Fossati, M., Tonelli, S., Giuliano, C.: Frame Semantics Annotation Made Easy with DBpedia. Crowdsourcing the Semantic Web (2013)
↑ http://tools.wmflabs.org/wikidata-todo/quick_statements.php
↑ http://hr.fbk.eu/sites/hr.fbk.eu/files/ccpl_28set07_aggiornato_2009.pdf - page 82, Tabella B
↑ http://www.consiglio.provincia.tn.it/doc/clex_26185.pdf
↑ http://www.camera.it/parlam/leggi/10240l.htm
↑ https://www.units.it/intra/personale/tabelle_stipendiali/?file=tab.php&ruolo=RD - third item of the central dropdown menu
↑ http://www.unifi.it/upload/sub/money/2012/ric_td_lg_costo_tp.xls
↑ https://web.uniroma2.it/modules.php?name=Content&action=showattach&attach_id=15798
↑ https://en.wikipedia.org/wiki/Wikipedia:Verifiability
↑ https://www.wikidata.org/wiki/Wikidata:Wiki_Loves_Open_Data
↑ https://phabricator.wikimedia.org/T101950
↑ https://phabricator.wikimedia.org/T101100
↑ https://github.com/google/primarysources
↑ https://plus.google.com/109936836907132434202/posts/bu3z2wVqcQc
↑ https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase
↑ https://www.wikidata.org/wiki/Help:FAQ/Freebase
↑ http://www.multisensorproject.eu/
↑ http://www.multisensorproject.eu/project/partners/ontotext/
↑ https://www.wikidata.org/wiki/Special:Statistics

Get Involved[edit]

Participants[edit]

Marco Fossati is a researcher with a double background in Natural Languages and Information Technologies. He works at the Data and Knowledge Management (DKM) research unit at Fondazione Bruno Kessler (FBK), Trento, Italy. He is member of the DBpedia Association board of trustees, founder and representative of its Italian chapter. He has interdisciplinary skills both in linguistics and in programming. His research focuses on bridging the gap between Natural Language Processing techniques and Large Scale Structured Knowledge Bases in order to drive the Web of Data towards its full potential. His current interests involve Structured Data Quality, Crowdsourcing for Lexical Semantics annotation, Content-based Recommendation Strategies.

Advisor Claudio Giuliano is a researcher with more than 16 years experience in Natural Language Processing and Machine Learning. He is currently head of the Future Media High Impact Initiative unit at FBK, focusing on applied research to meet industry needs. He founded and led Machine Linking, a spin-off company incubated at the Human Language Technologies research unit: the main outcome is The Wiki Machine, an open source framework that performs word sense disambiguation in more than 30 languages by finding links to Wikipedia articles in raw text. Among The Wiki Machine applications, Pokedem is a socially-aware intelligent agent that analyses Italian politicians profiles, integrating data from social media and news sources. Claudio will serve as the scientific advisor of this project.

Volunteer We use FrameNet in FP7 MultiSensor and devised embedding in NIF. Ontotext would be interested to help. Mainly with large-scale data/NLP wrangling. Vladimir Alexiev (talk) 13:43, 9 September 2015 (UTC)
Volunteer with contributing embodied cognition concepts into StrepHit, by expanding the concept of lexical units, frames and scenarios, starting from Perception_active, Perception_body, Perception_experience.

Furthermore I would like to contribute to the discussion of the project, and to test the use case and how it could be applied in the future to a medical domain. Projekt ANA (talk) 22:17, 19 September 2015 (UTC)

Volunteer Use, test and give feedback. Danrok (talk) 17:13, 22 September 2015 (UTC)
Volunteer I am a student with background in computer science and computational linguistics. I have close to two years of experience in NLP. I can join this project as a freelance NLP developer Nisprateek (talk) 05:03, 25 September 2015 (UTC)
Volunteer Use case and validation (Andrea Bolioli, CELI) BolioliAndrea (talk) 14:36, 1 October 2015 (UTC)
Volunteer I'm a PhD student in NLP and I'd like to work as a freelancer programmer. So far I worked mainly on QA, and, more precisely, in searching supporting passages for automatically answering open-domain questions.

I'd like to give an hand by building software for validating the statement already present in on Wikidata by using the candidate statements extracted from text passages present on authoritative sources. Auva87 (talk) 14:05, 24 December 2015 (UTC)

Volunteer Ciao Marco, sono Pasquale Signore, Università degli Studi di Bergamo e Intern presso Wikimedia Italia.

Aspetto un tuo feedback ;) PasqualeSignore (talk) 07:28, 5 May 2016 (UTC)

Community Notification[edit]

The following list displays all the links where relevant communities have been notified of this proposal, and to any other relevant community discussions. The list is sorted in descending order of community specificity.

Wikidata

Primary sources tool project page;
Wikidata project chat: first notification, mention in the weekly summary #174, second notification, third notification;
Referencing improvements input;
Wikidata mailing list: first notification, second notification, third notification.

N.B.: As per the phase 0 of the community engagement plan, we have invited the relevant Italian-speaking communities to test the soccer demo dataset, since the references are extracted from Italian Web sources.

Wikipedia

Librarians

Natural Language Processing practitioners

Mailing lists: NLP and Web of Data research (1), (2), (3);
Reddit: MachineLearning, LanguageTechnology, textdatamining, semanticweb subreddits;
LinkedIn groups: Natural Language Processing People (private, join required), Natural Language Processing, Text Mining, Text Analytics.

Open Data organizations

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

Community member: add your name and rationale here.
I support StrepHit because humans play an important role in improving the data quality on sites such as Wikidata. The reality of the Web of Data will come about through human and machine cooperation. BernHyland (talk) 15:15, 21 September 2015 (UTC)
Because it is an important issue to have reliable References. Crazy1880 (talk) 11:20, 22 September 2015 (UTC)
This is a promising direction for technological development, and an IEG project is a good umbrella for kick-starting this. I believe that the project can yield valuable results for Wikidata, in a way that is complementary to other ongoing efforts. I also see some critical points:
- The project plan is very fine-grained, maybe too fine-grained for a 6 month project (speaking from experience here).
- I would like a clearer commitment to creating workable technical infrastructure here. Content (extracted facts) should not be the main outcome of an IEG; rather there should be a fully open sourced processing pipeline that can be used after the project.
- How does the interaction with OntoText fit into the open source strategy of WMF? (As far as I recall, OntoText does not have open source options for its products.)
  - I'm offering help with embedding FrameNet to NIF and data wrangling. Ontotext tools don't need to be used int he project at all. --Vladimir Alexiev (talk) 12:54, 8 December 2015 (UTC)
- One of the main goals are 100 new active users of the primary sources tool. But how would this be measured? Since Primary Sources is still under development, it is to be expected that the user numbers will grow naturally over the next year. How can this be distinguished from the users attracted by this IEG project?

It would be good if these could be clarified. Yet, overall, I support the project. --Markus Krötzsch (talk) 12:34, 22 September 2015 (UTC)

I am very interested in this project and I would like to see it go forward. --CristianCantoro (talk) 07:50, 23 September 2015 (UTC)
Clearly Wikidata's database will be too big to be created and updated only by hand, so we do need some means to automate or semi-automate things as much as possible. It is surely possible for StrepHit to be a part of that. Danrok (talk) 09:27, 23 September 2015 (UTC)
Sourcing on Wikidata is important for the Wikidata project as providing sourced claims is one pillar of Wikidata, and of the mediawiki world. Any efficient way to source Wikidata is welcome, as sourcing on Wikidata as is can be fastidious, and especially those who enable to reuse existing sources in the Wikimedia world to make those sources available to any other projects. This will allow contributors to continue to source as they usually do and the whole mediawiki world to benefit from this. TomT0m (talk) 13:43, 23 September 2015 (UTC)
I'd like to endorse this proposal from the Wikidata development side. The quality of the data in Wikidata is the biggest topic around Wikidata development and we have spend a lot of time on it over the past year. The proposal is a great extension to this. I am also happy to see continued effort put into the primary sources tool to make it the preferred tool for data imports. It is a crucial part of Wikidata's pipeline. --Lydia Pintscher (WMDE) (talk) 11:58, 24 September 2015 (UTC)
On behalf of the Wikimedia Italia board, I'd like to endorse this proposal. We already had a project for an annotation tool in a previous edition of the Google Summer of Code, but it didn't followed up as we planned. We are really interested in this, and we look forward for the eventual results. -- Sannita - not just another it.wiki sysop (on behalf of WM-IT) 14:47, 25 September 2015 (UTC)
I think this is a great idea. Being able to include supporting references for each statement is one of the things which makes Wikidata stand out, but manually finding and adding good references is very time consuming, so it's no surprise that people (including me) rarely do it. The existing primary sources tool is a step in the right direction, but a large number of the references it currently suggests are questionable sources. What's needed is something like this which focuses on finding references in more reliable sources. Nikki (talk) 12:08, 28 September 2015 (UTC)
CELI (an italian Natural Language Company) is interested in this project BolioliAndrea (talk) 15:02, 1 October 2015 (UTC)
it seems very promising, and strategical.--Alexmar983 (talk) 22:35, 1 October 2015 (UTC)
Support Having dabbled into this myself, I think it is quite worthwhile. --Magnus Manske (talk) 13:45, 7 October 2015 (UTC)
An automatic way to extract facts and supportive statements from web sources could be a significant contribution. Brilliant idea! Giulio.petrucci (talk) 13:05, 14 October 2015 (UTC)
StrepHit is a step towards developing an open Artificial Intelligence Projekt ANA (talk) 16:06, 14 October 2015 (UTC)
It seems like a really interesting topic to investigate, independent of the outcome. Tobias1984 (talk) 16:28, 14 October 2015 (UTC)
Adding referenced statements to Wikidata is a key part of improving its quality and reliability. The project seems to address this important piece of improving Wikidata and free knowledge in general. Bene* (talk) 18:03, 14 October 2015 (UTC)
Because I consider most tasks that can reasonably be done by machines to be ones that should be done by machines, saving the more valuable human time for more difficult/interesting tasks. Popcorndude (talk) 21:05, 14 October 2015 (UTC)
This is a great idea Jimkont (talk) 07:40, 15 October 2015 (UTC)
I have enabled the "Primary Sources list" tool for Wikidata and tried it out. AFAIU Primary Sources list gets its information from Freebase and from this suggested StrepHit project which has a particular focus on football. I do not see with Primary Sources list which corpus the information comes from, but I tried Michael Laudrup as I thought StrepHit might be the source for that. What I experience is quite good suggestions for new claims in Wikidata. I think it is a very interesting project and well-aligned with a research project proposal of my own which focuses on scientific data. I would very much like to see it funded, both for the further development and for the documentation of methods and experiences, which will help the rest of us. Comment: "to identify a set of authoritative third-part Web sources..." It would be nice if there was an associated tool or standardize way so other people could add URLs and HTML-tags as authoritative, - rather than just identifying sources for this particular particular project. Finn Årup Nielsen (fnielsen) (talk) 10:12, 15 October 2015 (UTC)
Great project Platonides (talk) 19:19, 18 October 2015 (UTC)
Lack of referenced statements is the single most reason people I ask state why they don't use Wikidata more. --Saehrimnir (talk) 02:09, 19 October 2015 (UTC)
Support This is a big and important task the team has signed up for. We will learn a lot from this research. I'm happy to chat with the team to share some of my experiences from research in this area if they find it useful. --LZia (WMF) (talk) 17:51, 20 October 2015 (UTC)
Support Amazing idea. As someone responsible for a lot of unsourced wikidata statements :) I endorse this project. Maximilianklein (talk)
Support Part of #WikiCite2016 -- Would like to see some sort of end-user outcome, if that exists. -- Erika aka BrillLyle (talk) 16:13, 23 August 2016 (UTC)

[1] A typical English phonetic trick.

[2] ttps://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

[3] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia – a Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web (2014)

[4] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. pp. 1247–1250. ACM (2008)

[5] Vrandečič, D., Krötzsch, M.: Wikidata: a Free Collaborative Knowledge Base. Communications of the ACM 57(10), 78–85 (2014)

[6] According to the following query: http://tools.wmflabs.org/autolist/autolist1.html?q=claim[31:(tree[1478437][][279])]%20or%20claim[31:(tree[15991303][][279])]%20or%20claim[31:(tree[18543742][][279])]%20or%20claim[106:628099]%20or%20claim[106:937857]

[7] ttp://www.google-melange.com/gsoc/homepage/google/gsoc2015

[8] ttp://www.google-melange.com/gsoc/project/details/google/gsoc2015/edorigatti/5733935958982656

[9] ttp://it.dbpedia.org/2015/09/meno-chiacchiere-piu-fatti-una-marea-di-nuovi-dati-estratti-dal-testo-di-wikipedia/?lang=en

[10] ttps://github.com/dbpedia/fact-extractor

[11] ttp://us2.campaign-archive1.com/?u=e2e180baf855ac797ef407fc7&id=1a32bc675e

[12] ttps://twitter.com/pythontrending/status/639350253435621376

[13] ttps://www.reddit.com/r/MachineLearning/comments/3jdrds/fact_extraction_from_wikipedia_text_a_google/

[14] Fillmore, C.J.: Frame Semantics and the Nature of Language. In: Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language, pp. 20–32. Blackwell Publishing (1976)

[15] Baker, C.F.: FrameNet: a Knowledge Base for Natural Language Processing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. vol. 1929, pp. 1–5 (2014)

[16] Schmidt, T.: The Kicktionary – a Multilingual Lexical Resource of Football Language (2009)

[17] ttps://en.wikipedia.org/wiki/Supervised_learning

[18] ttps://en.wikipedia.org/wiki/Support_vector_machine

[19] Fossati, M., Giuliano, C., Tonelli, S.: Outsourcing FrameNet to the Crowd. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. pp. 742–747 (2013)

[20] Fossati, M., Tonelli, S., Giuliano, C.: Frame Semantics Annotation Made Easy with DBpedia. Crowdsourcing the Semantic Web (2013)

[21] ttp://tools.wmflabs.org/wikidata-todo/quick_statements.php

[22] ttp://hr.fbk.eu/sites/hr.fbk.eu/files/ccpl_28set07_aggiornato_2009.pdf - page 82, Tabella B

[23] ttp://www.consiglio.provincia.tn.it/doc/clex_26185.pdf

[24] ttp://www.camera.it/parlam/leggi/10240l.htm

[25] ttps://www.units.it/intra/personale/tabelle_stipendiali/?file=tab.php&ruolo=RD - third item of the central dropdown menu

[26] ttp://www.unifi.it/upload/sub/money/2012/ric_td_lg_costo_tp.xls

[27] ttps://web.uniroma2.it/modules.php?name=Content&action=showattach&attach_id=15798

[28] ttps://en.wikipedia.org/wiki/Wikipedia:Verifiability

[29] ttps://www.wikidata.org/wiki/Wikidata:Wiki_Loves_Open_Data

[30] ttps://phabricator.wikimedia.org/T101950

[31] ttps://phabricator.wikimedia.org/T101100

[32] ttps://github.com/google/primarysources

[33] ttps://plus.google.com/109936836907132434202/posts/bu3z2wVqcQc

[34] ttps://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase

[35] ttps://www.wikidata.org/wiki/Help:FAQ/Freebase

[36] ttp://www.multisensorproject.eu/

[37] ttp://www.multisensorproject.eu/project/partners/ontotext/

[38] ttps://www.wikidata.org/wiki/Special:Statistics

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]