Grants:IEG/StrepHit: Wikidata Statements Validation via References/Midpoint

From Meta, a Wikimedia project coordination wiki

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.


In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

       Planned achievements, as per the project goals and timeline:
  1. Web Sources Development Corpus: 1.6 M items, 500 k documents (biographies), 53 reliable sources;
  2. Candidate Relations Set: 50 relations;
  3. Primary Sources Tool: increased development activity [1], 2 merged pull requests [2], [3].
       Bonus achievements, beyond the goals:
  1. Web Sources Corpus: +300 k (+150%) documents, +13 sources, compared to the expected size of the development corpus (cf. Work Package T1);
  2. Semi-structured Development Dataset: 100 k Wikidata statements.
       Codebase: 6.7 k lines of Python code, 311 commits, 10 open issues, 13 closed issues.

Methods and activities[edit]

Whiteboard with Yellow Stickers to Manage Consolidated and New Ideas Respectively

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Technical Setup[edit]

  • We requested the credentials and created a GitHub repository within the Wikidata organization [4];
  • the official documentation page is hosted at [5] (work in progress);
  • besides the planned work package, special development efforts have been devoted to:
    • a modular architecture;
    • parallel processing;
    • caching;
    • let StrepHit be used both as a library and as a set of command line tools;
    • an easy-to-use command line to run all the pipeline steps;
    • a flexible logging facility.

Project Management[edit]

  • Monday face-to-face meetings for brainstorming ideas and weekly planning;
  • daily scrums, especially for unexpected technical issues, but also for brainstorming;
  • whiteboard for crystallized ideas;
  • yellow stickers on the whiteboard for ideas to be investigated;
  • regular interaction with relevant mailing lists and key people to discuss potential impacts and to gather suggestions;
  • project dissemination in the form of seminars and talks.

Research Activities[edit]

As a further outreach point for research communities, we have submitted a full article to the Semantic Web Journal, [6] among the top ones worldwide in the Information Systems field [7]. The whole process is known to be time-consuming: we have so far uploaded a first version, [8] focusing on past efforts carried out with DBpedia. It has passed the first round of reviews. We are currently working on a major revision that will include more details concerning StrepHit.

Midpoint outcomes[edit]

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

From a technical perspective, the project has so far delivered software and content. Specifically, with respect to software, the following modules have reached a mature state:

  • Web Sources Corpus [9], i.e., a set of Web spiders that harvest data from the selected biographical authoritative sources [10];
  • Corpus Analysis [11], i.e., a set of scripts to process the corpus and to generate a ranking of the Candidate Relations;
  • Commons [12], i.e., several facilities to ensure a scalable and reusable codebase. On the general-purpose hand, these include parallel processing, fine-grained logging, and caching. On the specific Natural Language Processing (NLP) hand, special attention is paid to foster future multilingual implementations, thanks to the modularity of the NLP components, such as tokenization [13], sentence splitting [14], and part-of-speech tagging [15].

The following modules have been started and are in active development:

  • Extraction [16], i.e., the logic needed to extract different set of sentences, to be used for training and testing the classifier, as well as for the actual production of Wikidata content;
  • Annotation [17], i.e, a set of scripts to interact with the CrowdFlower crowdsourcing platform APIs, in order to create and post annotation jobs, and to pull results.

Content outcomes are presented in the next sections.

Milestone #1: Web Sources Development Corpus[edit]

Download the resource.

Distribution of StrepHit IEG Web Sources Corpus Biographies according to the length in characters
Distribution of StrepHit IEG Web Sources Corpus Items with Biographies and without Biographies
Pie Chart of StrepHit IEG Web Sources Corpus Items across Source Domains
Pie Chart of StrepHit IEG Web Sources Corpus Biographies across Source Domains

Items & Biographies across Web Domains[edit]

Source domain # items # biographies 447,045 10,621 355,784 7,988 206,993 0 199,502 199,496 118,883 101,117 60,403 60,355 40,331 40,331 38,018 1,321 37,313 0 19,696 9,848 19,086 19,086 13,858 13,850 10,679 0 8,721 8,719 7,044 0 6,959 6,921 6,378 5,631 6,340 0 4,470 4,470 3,952 3,927 3,407 3,407 2,442 2,259 2,060 2,060 1,596 1,580 650 0 627 585 601 601 525 0
Total 1,623,381 504,189

Items & Biographies Wikisource Breakdown[edit]

Source # items # biographies
DNB 28,001 27,997
Catholic Encyclopedia 11,466 11,462
Naval Bio 4,692 4,688
Indian Bio 2,440 2,427
American Bio 2,209 2,207
National Bio 1912 1,631 1,631
Australasian Bio 1,590 1,590
Irish Officers 1,530 1,524
Bio English Lit 1,346 1,340
Men at the Bar 1,115 1,115
National Bio 1,901 1,033 1,033
Christian Bio 921 921
Musicians 702 702
Freethinkers 546 546
Men of Time 432 431
Chinese Bio 245 245
English Artists 223 223
Medical Bio 109 109
Portraits and Sketches 50 50
Who is who in China 47 47
Greek Roman bio Myth 37 37
Modern English Bio 11 11
Who is who America 10 10
Total 60,403 60,355

Milestone #2: Candidate Relations Set[edit]

The ranking is composed of verbs discovered via the corpus analysis module. Each of them will trigger a set of Wikidata properties, depending on the number of FEs (cf. the set above).

Currently, a total of 173 distinct FEs is extracted. The final amount of Wikidata properties will rely on a mapping, planned as per Work Package T8.1. We have already implemented a straightforward automatic mapping facility, based on string matching.


  1. bear
  2. issue
  3. work
  4. print
  5. play
  6. live
  7. include
  8. exhibit
  9. write
  10. paint
  11. serve
  12. study
  13. appoint
  14. return
  15. go
  16. name
  17. appear
  18. call
  19. leave
  20. draw
  21. lead
  22. record
  23. move
  24. found
  25. join
  26. begin
  27. teach
  28. elect
  29. remain
  30. succeed
  31. produce
  32. act
  33. enter
  34. establish
  35. add
  36. create
  37. continue
  38. travel
  39. win
  40. visit
  41. form
  42. send
  43. command
  44. bring
  45. attend
  46. retire
  47. promote
  48. meet
  49. kill
  50. employ

Bonus Milestone: Semi-structured Development Dataset[edit]

Download the resource.

During the corpus collection phase, we were asked (thanks Spinster!) to include sources with semi-structured data (cf. the list of selected sources), typically names and dates.

The result is a dataset that caters for the following Wikidata properties:

Sample Statements[edit]

Machine-readable ones are expressed in the QuickStatements syntax [18].

Correct Examples[edit]
Machine Human
Q389547 P570 +00000001837-01-01T00:00:00Z/9 S854 "" According to BBC Your Paintings, Charles Howard Hodges died in 1837
Q17355708 P1477 "emma nicol" S854 ",_Emma_(DNB00)" According to the Dictionary of National Biography, Emma Nicol's birth name is "emma nicol"
Q594729 P21 Q6581097 S854 "" According to the Union List of Artist Names, Anton Teichlein is a male
Q215502 P742 "Morgan, Henry" S854 "" According to the British Museum, Henry Morgan's pseudonym is "Morgan, Henry"
Q1562861 P569 +00000001939-08-21T00:00:00Z/11 S854 "" According to the Notable Names Database, Clarence Williams III was born in 1939
Questionable Examples[edit]
Machine Human Comments
Q3770981 P1477 "giusepe melani" S854 "" According to Union List of Artist Names, Giuseppe Melani's birth name is "giusepe melani" the source is wrong (typo?)
Q598060 P742 "Martyr Vermigli, Peter" S854 "" According to the British Museum, Peter Martyr Vermigli's pseudonym is "Martyr Vermigli, Peter" debatable source assertion and Wikidata property label
Q57297 P742 "E.W.L.T.; Ernesto Guglielmo Temple ;" S854 "" According to the Database of Scientific Illustrators, Wilhelm Tempel's pseudonym is "E.W.L.T.; Ernesto Guglielmo Temple ;" wrong parsing of the source data

References Statistics[edit]

Domain # references 6,262 17,456 238 418 1,166 366 247 103 5,923 2,416 254 387 33,452 9,847 240 501 17,296 2,465 1,577 39
Total 100,266

Community Outreach[edit]


  • Kick-off seminar at FBK, Trento, Italy
  • Talk at Wikipedia's 15th anniversary, co-located with the event Web 3.0, il potenziale del web semantico e dei dati strutturati, Lugano, Switzerland
  • Semantic Web Journal submission



Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far?

Yes. Compared to the initial plan, the only variance (below 10% of the overall budget) is the NLP developer's starting date, which was expected to be at the beginning of the project (11th January 2016), but was actually 1st February 2016. Consequently, the expense difference should move to the project leader budget item. The Finances page is updated accordingly: items are converted in USD and rounded to fit the total budget.

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

The dissemination budget item may be lower than the planned one, due to the relatively low cost of the scheduled events. We expect that the variance will be neglectable: we will eventually adjust the item and feed the training set creation or the project leader ones, as needed.


The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges[edit]

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.


Almost every challenge is technical, and most of them stem from NLP. In order of decreasing impact:

  • input corpus
    • a relatively big input corpus from several sources introduces the need to cope with high language variability;
    • certain documents are written in old English, others stem from the OCR output of a paper scan, etc.
  • target lexical database
    • it is unlikely that FrameNet would be a perfect fit for the data we aim at generating;
    • this especially applies to the crowdsourcing part, since labels and definitions are minted by expert linguists, but cast to non-expert laymen.
  • primary sources tool
    • contributing to the maintenance of a third-party resource with generally low development activity can be time-consuming;
    • it entails various tasks, from understanding possibly undocumented source code, to nudging the maintainers for addressing issues, all the way to accessing the machine that hosts the tool.
  • scalability
    • it should be always taken into account when writing code.

Going Forward[edit]

Keeping a strong pragmatic attitude in mind:

  • praise the unexpected
    • we would like to give higher priority to unexpected findings that may have an overall positive impact;
    • specifically, we will invest more time in the improvement of the semi-structured dataset, which may cater for a huge amount of unsourced Wikidata statements.
  • adapt as needed
    • the work package can be modified to suit new tasks, as long as it does not prevent the implementation of planned ones.
  • let people play with the data
    • the first half of the project was devoted to the back-end development. We expect to engage more and more users once we are able to generate data.

What is working well[edit]

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Next steps and opportunities[edit]

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

Technical steps[edit]

  • to take special care in running pilot crowdsourcing experiments;
  • to build a fully crowdsourced training set;
  • to reach a satisfactory performance in the automatic classification;
  • to find a suitable mapping to Wikidata for the final datasets.


  • Presentations and networking at the scheduled events are crucial to engage data donors from third-party Open Data organizations;
  • during the development of StrepHit, the team is brought to contribute to external software via standard social coding practices. These are tremendous opportunities that may have a great impact: for instance, we have submitted a pull request to the popular Python documentation generator Sphinx, in order to support the Mediawiki syntax.


We are definitely considering a renewal of this IEG to extend StrepHit capabilities towards widespread languages other than English (i.e., the current implementation). The project leader has both linguistic and NLP skills to foresee the implementation for Spanish, French, and Italian.

Grantee reflection[edit]

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

  • The IEG/IdeaLab hangout sessions during the project proposal period were really useful and motivating;
  • the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers;
  • monthly reports are a nice way to keep fine-grained track of the progress;
  • iterative planning is essential to face everyday's technical issues (mostly code libraries and Web services downtimes);
  • the team should always take into account completely unexpected changes.