Grants talk:IEG/StrepHit: Wikidata Statements Validation via References

From Meta, a Wikimedia project coordination wiki

Your feedback is crucial to the project! Feel free to add your thoughts here![edit]

_reference?" data-mw-fallback-anchor="Repherence_-.3E_reference.3F" data-mw-thread-id="h-Repherence_->_reference?-2015-09-07T17:55:00.000Z">Repherence -> reference?[edit]

Is "Repherence" intended or should it be read as "Reference"? -- CristianCantoro (talk) 17:55, 7 September 2015 (UTC)Reply[reply]

Exactly, you got the phonetic trick :-) --Hjfocs (talk) 17:03, 8 September 2015 (UTC)Reply[reply]

Wikidata Annotation Tool Google Summer of Code 2014 project[edit]

In the GSoC of last year (2014) Apsdehal worked to build a Wikidata annotation tool. You can review the weekly updates for the project and grab the source code on GitHub. The idea at the core of the Wikidata Annotation Tool was using the semantic annotation tool Pundit ( to create statement to be automatically fed in Wikidata (keeping the original source as a reference). It looks to me that this project is quite similar in its spirit and goal, is my interpretation correct? I am (of course) happy to know that there are several efforts in this direction. --CristianCantoro (talk) 18:24, 7 September 2015 (UTC)Reply[reply]

This is very much for texts outside of WMF. Consequently the tools part is most welcome. Thanks, GerardM (talk) 09:25, 8 September 2015 (UTC)Reply[reply]
Thanks @CristianCantoro for the pointer! It indeed seems closely related work. Unfortunately, I cannot seem to try it: the Web service does not exist and I cannot login to Pundit using the provided bookmarklet. From what I understand in the project completion report, this tool allows to perform a manual annotation based on some text fragment of interest, while StrepHit will do it automatically. Is my understanding close to the truth?
--Hjfocs (talk) 17:33, 8 September 2015 (UTC)Reply[reply]
@Hjfocs, you are absolutely right. The Wikidata Annotation Tool is a tool for manually annotating facts from arbitrary web pages and push them to Wikidata. It was never thought to be automatic. --CristianCantoro (talk) 19:21, 8 September 2015 (UTC)Reply[reply]

authoritative web sources and Wikipedia references[edit]

Hello Marco, congratulations, I like your draft proposal!

Your work packages and project goals are ambitious and realistic. For your use case soccer you propose a corpus collection from a set of authoritative web sources from 50 different sources like The Telegraph, Encyclopedia Britannica, DFB and Spiegel. 250,000 documents should be chosen from these sources, where 1 document yields 1 reference URL. In the use case in the proposal these are two media sources and one online encyclopedia which is not Wikipedia and one "official" site.

The english Wikipedia article for Cordoba has already 10 references, so it would be useful (and maybe quite easy) to add these references to StrepHit as some primary suggestions. (Or are they already somewhere there in the primary sources tool?)

Instead of the source The Telegraph I would recommend as more authoritative, e.g. : Sentence: "The Germans, for example, always enjoy recalling the 1954 World Cup Final, better known as the ‘Miracle of Bern’, …. Whenever these two nations meet, football aficionados rummage through the history books to reference Austria’s 3-2 win at the 1978 World Cup in Argentina. …. The encounter is remembered as the ‘Miracle of Cordoba’ by one set of fans and the ‘Shame of Cordoba’ by the other."

I understand that this is probably a more challenging task than the sentence "The Miracle of Cordoba, when they eliminated Germany from the 1978 World Cup". But this task could show a lot of natural language processing capabilities. (By the way, is there really a property "eliminated in"? I did not find it.) --Projekt ANA (talk) 22:59, 12 September 2015 (UTC)Reply[reply]

Hi @Projekt ANA, thanks for your interest and for your suggestions!
  • Regarding references in Wikipedia, there is Sourcery by @Magnus Manske. It is interesting to see that reference URLs proposed by StrepHit can also be added to Wikidata through this tool (I tried with Andrea Pirlo). I think that an integration between Sourcery and the primary sources tool can be of great benefit for Wikidata.
  • The sources selection phase is intended to stick to the Wikidata verifiability guidelines and the Wikipedia ones. Said that, I agree the FIFA source you mentioned may be more authoritative than the Telegraph one, since FIFA is the "official" reference for soccer. The examples were all meant to be simple, just to facilitate the understanding of the idea: that is also the reason why you found a property that does not exist in Wikidata.
--Hjfocs (talk) 09:55, 14 September 2015 (UTC)Reply[reply]

Frame Semantics and embodied cognition[edit]

The proposed implementation is based on Frame Semantics. Did you also consider an embodied cognition approach, like ITALK and Poeticon for the iCub? The soccer use case deals also with the human body, e.g. Player_Body_Parts has 41 Lexical Units in Embodied cognition and a model of human anatomy would be very useful for the use case. --Projekt ANA (talk) 12:07, 13 September 2015 (UTC)Reply[reply]

Hi @Projekt ANA,
thanks again for your stimulating links, I did not consider such approach. In which way do you think it can be integrated to StrepHit? Currently the focus is on verbal lexical units, which are likely to trigger factual frames and may be more suitable for Wikidata properties, while I assume human anatomy will be modeled through nominal lexical units, as in the kicktionary example. I am not sure how to map such relations to Wikidata, and am eager to hear your thoughts.
--Hjfocs (talk) 10:27, 14 September 2015 (UTC)Reply[reply]
Hello @Hjfocs,
thank you for your quick response!
I think an integration of embodied cognition concepts into StrepHit could be started by expanding the concept of lexical units, frames and scenarios, starting from Perception_active, _body, _experience. Nominal lexical units can be transformed very often to verbs (e.g. the knee – to knee; feeling - to feel).
Let us take the sentence again:
The Miracle of Cordoba, when they eliminated Germany from the 1978 World Cup
One result is
<Germany, eliminated in, Miracle of Cordoba> (assumed that there will be this property)
But there are also:
<Germany, participant, Miracle of Cordoba> <Germany, participant, 1978 World Cup > <they, participant, 1978 World Cup > <they, participant, Miracle of Cordoba> <Miracle of Cordoba, location, Cordoba> < Miracle of Cordoba, point in time, 1978> <1978 World Cup, point in time, 1978> <1978 World Cup, location, Cordoba > < Miracle of Cordoba, part of, 1978 World Cup >
And if we take
"The Germans, for example, always enjoy recalling the 1954 World Cup Final, better known as the ‘Miracle of Bern’, …. Whenever these two nations meet, football aficionados rummage through the history books to reference Austria’s 3-2 win at the 1978 World Cup in Argentina. …. The encounter is remembered as the ‘Miracle of Cordoba’ by one set of fans and the ‘Shame of Cordoba’ by the other."
We could also get:
< Miracle of Cordoba, named after, Miracle of Berne>
This would be because of the similarity of the phrase (Miracle of …) and because of the context of this reference in which both “Miracles” are cited (remains to be seen if this named after is correct or not).
Embodied cognition also aims at emotions and feelings – to enjoy a victory, but also to make a mock of the loser, and the corresponding frame has to deal with the sarcasm of the miracle.
May I ask, what are your plans concerning the parser? Will you use TurboParser V 2.3, on Github)? And will you use the 1214 Lexical Units)? And SEMAFOR and the 1926 Lexical Units (en-de-fr) of the kicktionary, including the 16 scenes? Or will you develop a different parser? Thank you very much for your information!
Cheers --Projekt ANA (talk) 22:02, 15 September 2015 (UTC)Reply[reply]
Hey @Projekt ANA,
Let me clarify that the approach I propose follows a data-driven, bottom-up strategy: the set of frames that will be used will emerge after the input corpus analysis step, according to the set of top verbal lexical units (cf. point #2 in the implementation details and T3 in the work package).
Said that, I like your idea to first choose the frames you mentioned and then develop the pipeline.
The additional statements you pointed out can indeed be extracted, as they may all represent frame elements (depending on the frame definition). This is not made explicit in my examples just for the sake of simplicity.
Your example seems a challenging one indeed, as it would require further NLP techniques: for instance, co-reference resolution may be needed to extract < Miracle of Cordoba, named after, Miracle of Bern>.
With respect to the parser, the idea is to apply part of speech tagging only and to avoid syntactic parsing. Hence, we may consider to reuse the TurboParser part of speech tagging module. A review is planned, cf. T2 in the work package. The same applies to SEMAFOR, although the starting idea is to use a supervised classifier based on SVM.
Regarding the lexical units, it is not planned to exploit the full FrameNet or Kicktionary, since lexical units will emerge from the corpus analysis step. A subset of those lexical databases shall then be considered.
Cheers! --Hjfocs (talk) 10:54, 17 September 2015 (UTC)Reply[reply]
P.S.: why don't you consider a collaboration in the project? You can do so through the join blue button in the proposal page.

9/29/15 Proposal Deadline: Reminder to change status to 'proposed'[edit]

Hi Hjfocs,

This draft is looking like it's well on its way. I'm writing to remind you to make sure to change the status of your proposal from 'draft' to 'proposed' by the September 29, 2015 deadline in order to submit it for review by the committee in the current round of IEG. If you have any questions or would like to discuss your proposal, let me know. We're hosting a few IEG proposal help sessions this month in Google Hangouts. I'm also happy to set up an individual session. Warm regards, --Marti (WMF) (talk) 20:44, 20 September 2015 (UTC)Reply[reply]

Thanks for the reminder @Marti (WMF), the status is updated. Looking forward to hearing @I_JethroBT (WMF)'s thoughts as well, then we can schedule a discussion. Cheers --Hjfocs (talk) 08:37, 22 September 2015 (UTC)Reply[reply]
@Hjfocs: Great! Thanks for the heads up. I_JethroBT and I are jointly hosting the remaining IEG Proposal Clinics via Hangouts. If you'd like live feedback, you're welcome to join! --Marti (WMF) (talk) 12:59, 22 September 2015 (UTC)Reply[reply]

Clarifications as per Markus's endorsement[edit]

Hi @Markus, thanks for your endorsement!

Let me address here the points you raised:

  • The project plan is very fine-grained, maybe too fine-grained for a 6 month project (speaking from experience here).
the Work Package has been built with as much specificity and pragmatism as possible. On the other hand, I understand that the tasks may be broken down too much: the risk here would be the additional effort of reporting eventual changes in the subtasks. Or do you think the work package is too optimistic and I should be more conservative?
  • I would like a clearer commitment to creating workable technical infrastructure here. Content (extracted facts) should not be the main outcome of an IEG; rather there should be a fully open sourced processing pipeline that can be used after the project.
I completely agree and have not highlighted this, since I assumed it is an implicit requirement. I will stress this in the goals.
  • How does the interaction with OntoText fit into the open source strategy of WMF? (As far as I recall, OntoText does not have open source options for its products.)
Ontotext has expressed the willing to collaborate as a partner of the European Union Multisensor research project. Due to the public nature of such EU-funded efforts, I assume its outcomes will be published as open source. I will check that with @Vladimir Alexiev.
We'd contribute effort and advice, not closed source tools. In Multisensor we're using open source tools (eg SEMAFOR), and converting to RDF (an embedding of FN into NIF). We're interested to advance our FN knowledge and skills, and interested in WD in general. --Vladimir Alexiev (talk) 14:04, 28 September 2015 (UTC)Reply[reply]
  • One of the main goals are 100 new active users of the primary sources tool. But how would this be measured? Since Primary Sources is still under development, it is to be expected that the user numbers will grow naturally over the next year. How can this be distinguished from the users attracted by this IEG project?
A qualitative measure to catch StrepHit-specific success signals can be set up, using for instance a dedicated project page, a request for comment and a survey.
From a quantitative perspective, the primary sources status API endpoint does not seem to handle dataset grouping yet. I opened a ticket in the repo, and will explicitly mention this in the proposal.

Cheers! --Hjfocs (talk) 15:38, 23 September 2015 (UTC)Reply[reply]

Basic flaw[edit]

In this proposal it is suggested that the primary sources tool is a success. It is not. Arguably data languishes in there and there is hardly any move of data to Wikidata. When this is how success is measured, show me failure. Thanks, GerardM (talk) 11:04, 27 September 2015 (UTC)Reply[reply]

@GerardM: these seem to be value judgments that are not supported by evidence. The primary sources tool status page reports a constantly growing number of edits day by day: as of September 28th, they amount to 24,472, compared to 19,201 as of two weeks earlier, thus showing a steady progression. New top users are also regularly appearing. Hence, this suggests the tool is gaining traction despite it is still under development. See also this article for a higher level overview and @Markus's comment above. The quantitative measures of success for StrepHit are built upon such evidences, which contrast your criticism. Cheers, --Hjfocs (talk) 08:57, 28 September 2015 (UTC)Reply[reply]

Eligibility confirmed, round 2 2015[edit]

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for round 2 2015 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 2 2015 begins on 20 October 2015, and grants will be announced in December. See the schedule for more details.

Questions? Contact us.

Marti (WMF) (talk) 19:20, 29 September 2015 (UTC)Reply[reply]

fnielsen's endorsement feedback[edit]

Hi @fnielsen,

first, thanks for your endorsement! I like a lot your idea of a URL + HTML tags whitelist for verifiable sources. I'm not sure whether the tool you suggest is already available or not. @Platonides maintains a spam blacklist, maybe he can point us to something relevant. Cheers! --Hjfocs (talk) 09:22, 19 October 2015 (UTC)Reply[reply]

@Fnielsen: This page contains a URL blacklist for the primary sources tool, although it does not seem to be as exhaustive as @Platonides's one. As a first step, we could add a whitelist page there. Cheers, --Hjfocs (talk) 09:55, 20 October 2015 (UTC)Reply[reply]

Aggregated feedback from the committee for StrepHit: Wikidata Statements Validation via References[edit]

Scoring criteria (see the rubric for background) Score
1=weak alignment 10=strong alignment
(A) Impact potential
  • Does it fit with Wikimedia's strategic priorities?
  • Does it have potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
(B) Innovation and learning
  • Does it take an Innovative approach to solving a key problem?
  • Is the potential impact greater than the risks?
  • Can we measure success?
(C) Ability to execute
  • Can the scope be accomplished in 6 months?
  • How realistic/efficient is the budget?
  • Do the participants have the necessary skills/experience?
(D) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
  • Does it support diversity?
Comments from the committee:
  • Anything that helps improve references in Wikidata can potentially amplify knowledge across all projects. This is a great idea and I hope it works.
  • Automated or semi-automated addition of references to Wikidata (and other Wikimedia projects) have been on the movement's wish-list for a long time. A successful implementation, even if limited to a subset of commonly-used sources, has the potential to produce significant impact on content quality.
  • The project seems to be something that would produce content, and afterwards the code will be available but should be run again to produce additional content. I see it as an experimental project.
  • The (answered) criticisms of GerardM and MarkusK reflect my thinking.
  • Any project that uses Wikidata can use the outcomes of this project
  • The measures of success are well-defined and easy to measure, and the results of the project are likely to lead to further research in the use of NLP for automated or assistive editing.
  • A good approach and the next natural step. The risk is that the ask is for $30,000 and, as MarkusK points out, there is a good chance of failure.
  • Not sure whether the team has the necessary Wikimedia experience, but they seem to have lots of volunteers and support.
  • The implementation plan seems reasonable, and the proposed budget is commensurate to the level of effort required. The applicants should be commended for identifying a third-party basis for the proposed labor rates.
  • See Markus' concerns about scope. The budget is probably reasonable for the hours they're specifying and they have the necessary skills and experience.
  • Wikidata is a way to help smaller projects move forward and the translation tool is a way to amplify work already done in specific projects.
  • There is significant community engagement and support, as well as support from movement partner institutions.
  • Seems to have buy in from the Wikidata community!
  • Whether or not this succeeds, I believe that we need what they are trying to achieve, so we can learn from this attempt.
  • I would like to know how it will integrate into Wikimedia projects. The delivery would be the creation of this content and the code to produce or replicate it. It would be interesting, would like to analyze impacts in greater detail.
I would like to thank all the reviewers for their comments.
Let me address below the points needing additional explanations.
  • The applicants should be commended for identifying a third-party basis for the proposed labor rates.
I have included in the proposal a supplementary clarification of the budget items concerning the human resources.
  • See Markus' concerns about scope.
I would like to stress that the Work Package has been built upon previous work carried out under the Google Summer of Code 2015 program (cf. this section). The project was successful not only in terms of technical implementation, but also in terms of community attraction, even though it was achieved in a shorter time frame (i.e., 3 months), and at a smaller scale.
  • how it will integrate into Wikimedia projects.
StrepHit is intended to serve as a reusable tool which will generate content for Wikidata, when run. Wikidata was born to become the central structured data hub for all the Wikimedia projects: currently, a large amount of Wikis can already be fed by Wikidata via the arbitrary access efforts, most notably 250 Wikipedia language chapters, Wikisource, and Wikivoyage. Hence, StrepHit will potentially benefit all the Wikimedia projects thanks to its inclusion in what is meant to become the standard flow for content addition in Wikidata, namely the primary sources tool.
As a final note, let me highlight a crucial point for further development besides the IEG time frame, for which I feel like I have a special interest and skills: multilingualism, as pointed out in phase 2 of the community engagement.
English is scheduled for implementation in the 6-months scope, due to its high coverage, thus high impact. However, I would love to make StrepHit support more and more languages: since I am proficient in French, Spanish, and native speaker of Italian, these would be the next ones. Therefore, reaching out to such language communities at an early stage is an important step.
Cheers, --Hjfocs (talk) 14:36, 20 November 2015 (UTC)Reply[reply]

Round 2 2015 decision[edit]

Congratulations! Your proposal has been selected for an Individual Engagement Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $30,000

Comments regarding this decision:
The Committee is excited to be a funding partner in the investigation into what has been described as a “canonical question” of the Wikimedia movement. We appreciate your work to establish key partnerships--as reflected in your endorsements--which are critical to this project’s success, and we would like to see some of these formalize into advisor positions, where possible. We look forward to the research documentation to come from this project, as well as the concrete data quality enhancements on Wikidata.

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!
Questions? Contact us.

Letters of Support[edit]

Support from the Spanish DBpedia[edit]

Me and many members in our research group at UPM (Universidad Politécnica de Madrid) are delighted with the StrepHit project. We consider that this initiative will increase the value of DBpedia, one of the most used datasets. Additionally, this project can also provide a valuable generic infrastructure for multilingual conversion of natural language to semantic data, applicable to other areas outside DBpedia.

Therefore, we want to show our interest to the people responsible for funding this project.

The Spanish DBpedia is the second DBpedia, just after the English version, in terms of number of semantic data generated from Wikipedia. The results of the phase 1 of the StrepHit project, focused on the Italian DBpedia, are very promising, and its extension to languages such as Spanish would produce an important increase in the number of semantic data available. These new data could be exploited not only in academia but by companies. The analysis of the data requests made to the Spanish DBpedia shows an increasing interest by companies; even higher, in terms of number of requests, than the ones from academia.

- Mariano Rico (responsible for the Spanish DBpedia)

Support from ContentMine[edit]

ContentMine and StrepHit have a very similar interest in mining and extracting information in bulk from sources and producing a machine readable output. We're interested in making any data that the community would like in Wikidata quickly and easily available. To this end we're very excited in the work done by Marco on both the Primary Sources tool and in aligning the kind of information extracted from fulltext sources with the Wikidata datamodel. We're hoping to start using the Primary Sources tool in the next few days/weeks as a means of making our output available.

- T. Arrow ContentMine Developer Bomarrow1 (talk) 13:23, 28 June 2016 (UTC)Reply[reply]

The role of usability[edit]

It seems currently the grant focuses a lot on the data-mining part. Improving the usability however isn't a big part of the grant. I would advocate that future grants for StrepHit take usability concerns more in mind and fix issues such as the page reloading every time a statement is approved. ChristianKl (talk) 08:07, 24 August 2016 (UTC)Reply[reply]