Grants:IEG/StrepHit: Wikidata Statements Validation via References/Midpoint

This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

       Planned achievements, as per the project goals and timeline:

Web Sources Development Corpus: 1.6 M items, 500 k documents (biographies), 53 reliable sources;
Candidate Relations Set: 50 relations;
Primary Sources Tool: increased development activity [1], 2 merged pull requests [2], [3].

       Bonus achievements, beyond the goals:

Web Sources Corpus: +300 k (+150%) documents, +13 sources, compared to the expected size of the development corpus (cf. Work Package T1);
Semi-structured Development Dataset: 100 k Wikidata statements.

       Codebase: 6.7 k lines of Python code, 311 commits, 10 open issues, 13 closed issues.

Methods and activities

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Technical Setup

We requested the credentials and created a GitHub repository within the Wikidata organization [4];
the official documentation page is hosted at mediawiki.org [5] (work in progress);
besides the planned work package, special development efforts have been devoted to:
- a modular architecture;
- parallel processing;
- caching;
- let StrepHit be used both as a library and as a set of command line tools;
- an easy-to-use command line to run all the pipeline steps;
- a flexible logging facility.

Project Management

Monday face-to-face meetings for brainstorming ideas and weekly planning;
daily scrums, especially for unexpected technical issues, but also for brainstorming;
whiteboard for crystallized ideas;
yellow stickers on the whiteboard for ideas to be investigated;
regular interaction with relevant mailing lists and key people to discuss potential impacts and to gather suggestions;
project dissemination in the form of seminars and talks.

Research Activities

As a further outreach point for research communities, we have submitted a full article to the Semantic Web Journal, [6] among the top ones worldwide in the Information Systems field [7]. The whole process is known to be time-consuming: we have so far uploaded a first version, [8] focusing on past efforts carried out with DBpedia. It has passed the first round of reviews. We are currently working on a major revision that will include more details concerning StrepHit.

Midpoint outcomes

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

From a technical perspective, the project has so far delivered software and content. Specifically, with respect to software, the following modules have reached a mature state:

Web Sources Corpus [9], i.e., a set of Web spiders that harvest data from the selected biographical authoritative sources [10];
Corpus Analysis [11], i.e., a set of scripts to process the corpus and to generate a ranking of the Candidate Relations;
Commons [12], i.e., several facilities to ensure a scalable and reusable codebase. On the general-purpose hand, these include parallel processing, fine-grained logging, and caching. On the specific Natural Language Processing (NLP) hand, special attention is paid to foster future multilingual implementations, thanks to the modularity of the NLP components, such as tokenization [13], sentence splitting [14], and part-of-speech tagging [15].

The following modules have been started and are in active development:

Extraction [16], i.e., the logic needed to extract different set of sentences, to be used for training and testing the classifier, as well as for the actual production of Wikidata content;
Annotation [17], i.e, a set of scripts to interact with the CrowdFlower crowdsourcing platform APIs, in order to create and post annotation jobs, and to pull results.

Content outcomes are presented in the next sections.

Milestone #1: Web Sources Development Corpus

Download the resource.

Items & Biographies across Web Domains

Source domain	# items	# biographies
www.genealogics.org	447,045	10,621
www.metal-archives.com	355,784	7,988
rkd.nl	206,993	0
vocab.getty.edu	199,502	199,496
collection.britishmuseum.org	118,883	101,117
en.wikisource.org	60,403	60,355
www.nndb.com	40,331	40,331
www.bbc.co.uk	38,018	1,321
www.catholic-hierarchy.org	37,313	0
www.daao.org.au	19,696	9,848
adb.anu.edu.au	19,086	19,086
gameo.org	13,858	13,850
www.uni-stuttgart.de	10,679	0
archive.org	8,721	8,719
cesar.org.uk	7,044	0
munksroll.rcplondon.ac.uk	6,959	6,921
sculpture.gla.ac.uk	6,378	5,631
structurae.net	6,340	0
yba.llgc.org.uk	4,470	4,470
www.wga.hu	3,952	3,927
collection.cooperhewitt.org	3,407	3,407
dictionaryofarthistorians.org	2,442	2,259
www.newulsterbiography.co.uk	2,060	2,060
royalsociety.org	1,596	1,580
www.parliament.uk	650	0
www.museothyssen.org	627	585
www.brown.edu	601	601
www.academia-net.org	525	0
Total	1,623,381	504,189

Items & Biographies Wikisource Breakdown

Source	# items	# biographies
DNB	28,001	27,997
Catholic Encyclopedia	11,466	11,462
Naval Bio	4,692	4,688
Indian Bio	2,440	2,427
American Bio	2,209	2,207
National Bio 1912	1,631	1,631
Australasian Bio	1,590	1,590
Irish Officers	1,530	1,524
Bio English Lit	1,346	1,340
Men at the Bar	1,115	1,115
National Bio 1,901	1,033	1,033
Christian Bio	921	921
Musicians	702	702
Freethinkers	546	546
Men of Time	432	431
Chinese Bio	245	245
English Artists	223	223
Medical Bio	109	109
Portraits and Sketches	50	50
Who is who in China	47	47
Greek Roman bio Myth	37	37
Modern English Bio	11	11
Who is who America	10	10
Total	60,403	60,355

Milestone #2: Candidate Relations Set

The ranking is composed of verbs discovered via the corpus analysis module. Each of them will trigger a set of Wikidata properties, depending on the number of FEs (cf. the set above).

Currently, a total of 173 distinct FEs is extracted. The final amount of Wikidata properties will rely on a mapping, planned as per Work Package T8.1. We have already implemented a straightforward automatic mapping facility, based on string matching.

Ranking

bear
issue
work
print
play
live
include
exhibit
write
paint
serve
study
appoint
return
go
name
appear
call
leave
draw
lead
record
move
found
join
begin
teach
elect
remain
succeed
produce
act
enter
establish
add
create
continue
travel
win
visit
form
send
command
bring
attend
retire
promote
meet
kill
employ

Bonus Milestone: Semi-structured Development Dataset

Download the resource.

During the corpus collection phase, we were asked (thanks Spinster!) to include sources with semi-structured data (cf. the list of selected sources), typically names and dates.

The result is a dataset that caters for the following Wikidata properties:

Sample Statements

Machine-readable ones are expressed in the QuickStatements syntax [18].

Correct Examples

Machine	Human
`Q389547 P570 +00000001837-01-01T00:00:00Z/9 S854 "http://www.bbc.co.uk/arts/yourpaintings/artists/hodges-charles-howard-17641837"`	According to BBC Your Paintings, Charles Howard Hodges died in 1837
`Q17355708 P1477 "emma nicol" S854 "https://en.wikisource.org/wiki/Nicol,_Emma_(DNB00)"`	According to the Dictionary of National Biography, Emma Nicol's birth name is "emma nicol"
`Q594729 P21 Q6581097 S854 "http://vocab.getty.edu/ulan/500110819"`	According to the Union List of Artist Names, Anton Teichlein is a male
`Q215502 P742 "Morgan, Henry" S854 "http://collection.britishmuseum.org/id/person-institution/156902"`	According to the British Museum, Henry Morgan's pseudonym is "Morgan, Henry"
`Q1562861 P569 +00000001939-08-21T00:00:00Z/11 S854 "http://www.nndb.com/people/103/000024031/"`	According to the Notable Names Database, Clarence Williams III was born in 1939

Questionable Examples

Machine	Human	Comments
`Q3770981 P1477 "giusepe melani" S854 "http://vocab.getty.edu/ulan/500051662"`	According to Union List of Artist Names, Giuseppe Melani's birth name is "giusepe melani"	the source is wrong (typo?)
`Q598060 P742 "Martyr Vermigli, Peter" S854 "http://collection.britishmuseum.org/id/person-institution/112005"`	According to the British Museum, Peter Martyr Vermigli's pseudonym is "Martyr Vermigli, Peter"	debatable source assertion and Wikidata property label
`Q57297 P742 "E.W.L.T.; Ernesto Guglielmo Temple ; http://viaf.org/viaf/45102696" S854 "http://www.uni-stuttgart.de/hi/gnt/dsi2/index.php?table_name=dsi&function=details&where_field=id&where_value=5752"`	According to the Database of Scientific Illustrators, Wilhelm Tempel's pseudonym is "E.W.L.T.; Ernesto Guglielmo Temple ; http://viaf.org/viaf/45102696"	wrong parsing of the source data

References Statistics

Domain	# references
adb.anu.edu.au	6,262
collection.britishmuseum.org	17,456
gameo.org	238
munksroll.rcplondon.ac.uk	418
archive.org	1,166
collection.cooperhewitt.org	366
sculpture.gla.ac.uk	247
dictionaryofarthistorians.org	103
en.wikisource.org	5,923
rkd.nl	2,416
structurae.net	254
viaf.org	387
vocab.getty.edu	33,452
www.bbc.co.uk	9,847
www.museothyssen.org	240
www.newulsterbiography.co.uk	501
www.nndb.com	17,296
www.uni-stuttgart.de	2,465
www.wga.hu	1,577
yba.llgc.org.uk	39
Total	100,266

Community Outreach

Done

Kick-off seminar at FBK, Trento, Italy

Talk at Wikipedia's 15th anniversary, co-located with the event Web 3.0, il potenziale del web semantico e dei dati strutturati, Lugano, Switzerland

Event page (in Italian): http://www.ated.ch/manifestazioni/7/web-30-il-potenziale-del-web-semantico-e-dei-dati-strutturati_3194.html

Semantic Web Journal submission

First version: http://semantic-web-journal.org/content/n-ary-relation-extraction-joint-t-box-and-box-knowledge-base-augmentation

Planned

Semantic Web Journal Revision
Hackathon at Spaghetti Open Data Reunion: http://www.spaghettiopendata.org/page/benvenut-sod16
WikiCite 2016: WikiCite_2016
Poster at Wikimania 2016: https://wikimania2016.wikimedia.org/wiki/Posters

Finances

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far?

Yes. Compared to the initial plan, the only variance (below 10% of the overall budget) is the NLP developer's starting date, which was expected to be at the beginning of the project (11th January 2016), but was actually 1st February 2016. Consequently, the expense difference should move to the project leader budget item. The Finances page is updated accordingly: items are converted in USD and rounded to fit the total budget.

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

The dissemination budget item may be lower than the planned one, due to the relatively low cost of the scheduled events. We expect that the variance will be neglectable: we will eventually adjust the item and feed the training set creation or the project leader ones, as needed.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

Challenges

Almost every challenge is technical, and most of them stem from NLP. In order of decreasing impact:

input corpus
- a relatively big input corpus from several sources introduces the need to cope with high language variability;
- certain documents are written in old English, others stem from the OCR output of a paper scan, etc.
target lexical database
- it is unlikely that FrameNet would be a perfect fit for the data we aim at generating;
- this especially applies to the crowdsourcing part, since labels and definitions are minted by expert linguists, but cast to non-expert laymen.
primary sources tool
- contributing to the maintenance of a third-party resource with generally low development activity can be time-consuming;
- it entails various tasks, from understanding possibly undocumented source code, to nudging the maintainers for addressing issues, all the way to accessing the machine that hosts the tool.
scalability
- it should be always taken into account when writing code.

Going Forward

Keeping a strong pragmatic attitude in mind:

praise the unexpected
- we would like to give higher priority to unexpected findings that may have an overall positive impact;
- specifically, we will invest more time in the improvement of the semi-structured dataset, which may cater for a huge amount of unsourced Wikidata statements.
adapt as needed
- the work package can be modified to suit new tasks, as long as it does not prevent the implementation of planned ones.
let people play with the data
- the first half of the project was devoted to the back-end development. We expect to engage more and more users once we are able to generate data.

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

Technical steps

to take special care in running pilot crowdsourcing experiments;
to build a fully crowdsourced training set;
to reach a satisfactory performance in the automatic classification;
to find a suitable mapping to Wikidata for the final datasets.

Opportunities

Presentations and networking at the scheduled events are crucial to engage data donors from third-party Open Data organizations;
during the development of StrepHit, the team is brought to contribute to external software via standard social coding practices. These are tremendous opportunities that may have a great impact: for instance, we have submitted a pull request to the popular Python documentation generator Sphinx, in order to support the Mediawiki syntax.

Renewal

We are definitely considering a renewal of this IEG to extend StrepHit capabilities towards widespread languages other than English (i.e., the current implementation). The project leader has both linguistic and NLP skills to foresee the implementation for Spanish, French, and Italian.

Grantee reflection

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

The IEG/IdeaLab hangout sessions during the project proposal period were really useful and motivating;
the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers;
monthly reports are a nice way to keep fine-grained track of the progress;
iterative planning is essential to face everyday's technical issues (mostly code libraries and Web services downtimes);
the team should always take into account completely unexpected changes.