Grants:IEG/StrepHit: Wikidata Statements Validation via References/Timeline

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Individual Engagement Grants This project is funded by an Individual Engagement Grant

proposal people timeline & progress finances midpoint report final report

Individual Engagement Grants This Individual Engagement Grant is renewed

renewal scope timeline & progress finances midpoint report final report


Timeline for StrepHit: Wikidata Statements Validation via References[edit]

Timeline Date
Development Corpus April 11 2016
Candidate Relations Set April 11 2016
StrepHit Pipeline Beta June 11 2016
Production Corpus July 11 2016
Web Sources Knowledge Base July 11 2016


Overview[edit]

Monthly updates[edit]

Each update will cover a 1-month time span, starting from the 11th day. For instance, January 2016 means January 11th to February 11th 2016.

January 2016[edit]

Dissemination activities[edit]

  • Jan 15: Kick-off seminar at FBK, Trento, Italy
  • Jan 20: Talk at the event Web 3.0, il potenziale del web semantico e dei dati strutturati, Lugano, Switzerland

Sources identification[edit]

We identified 3 candidate domains that may serve as good use cases for the project:

  • Biographies
  • Companies
  • Biomedical literature
Domain Reasons
Biographies
  • plenty of existing data
  • broad coverage
  • potentially easy to find valuable primary sources
  • perfect fit for the current prototype
Companies
  • relatively biased domain
  • ad-prone content
  • the company edits the page on the company itself
  • low-quality data
Biomedical
  • great primary source, i.e., PubMed
  • proof of usage for an Open Access corpus
  • complex implementation

We have gathered feedback from different communities (Wikimedians following our seminars, GLAM), which seem by far to prefer the biographical domain. Hence, we have selected it and harvested the list of primary sources.

Biographies[edit]

The mix'n'match tool maintains a list of biographical catalogues, which can serve as reliable sources candidates. The table below displays an informal analysis of the catalogues available in English:

Source URL Comments Candidate?
Oxford Dictionary of National Biography [1] subscription needed to access the full text; full version available on Wikisource maybe
Dictionary of Welsh Biography [2] Support Support
Dictionary of art historians [3] Support Support
The Royal Society [4] few entries (< 1,600) Support Support
Members of the European Parliament [5] almost no raw text Oppose Oppose
Thyssen-Bornemisza museum [6] Support Support
BBC your paintings [7] Support Support
National Portrait Gallery [8] almost no raw text Oppose Oppose
Stanford Encyclopedia of Philosophy [9] not only biographies (encyclopedia) Support Support
A Cambridge Alumni Database [10] lots of abbreviations Oppose Oppose
Australian dictionary of biographies [11] Support Support
General Division of the Order of Australia [12] GA candidate.svg Semi-structured
The Union List of Artist Names [13] Structured data (RDF) with public endpoint GA candidate.svg Structured
AcademiaNet [14] GA candidate.svg Semi-structured
Appletons' Cyclopædia of American Biography [15] from Wikisource Support Support
Artsy [16] commercial web site, as suggested by Spinster Oppose Oppose
British Museum [17] Structured data (RDF) with public endpoint GA candidate.svg Structured
Bénézit Dictionary of Artists [18] subscription needed maybe
French theatre of the seventeenth and eighteenth centuries [19] few data GA candidate.svg Semi-structured
Catholic Hierarchy Bishops [20] microdata GA candidate.svg Semi-structured
China Vitae [21] lots of items have no actual biography Support Support
Cultural Objects Name Authority [22] GA candidate.svg Semi-structured
Cooper Hewitt [23] API available (seems not to return all the text that appears in a person's page though) Support Support
Design&Art Australia Online [24] must browse to the biography tab for full text Support Support
Database of Scientific Illustrators [25] GA candidate.svg Semi-structured
The Dictionary of Ulster Biography [26] Support Support
Encyclopedia Brunoniana [27] not only biographies (encyclopedia) Support Support
Early Modern Letters Online [28] no biographies Oppose Oppose
Global Anabaptist Mennonite Encyclopedia Online [29] third-party wiki Support Support
Genealogics person ID [30] secondary source resulting from personal research Oppose Semi-structured
The Hermitage - Authors [31] no biographies Oppose Oppose
LoC artists [32] no biographies Oppose Oppose
MOMA [33] no biographies Oppose Oppose
MSBI [34] short utterances (may be hard to parse) GA candidate.svg Semi-structured
MUNKSROLL [35] Support Support
Metallum bands [36] Support Support
National Gallery of Art [37] no biographies Oppose Oppose
National Gallery of Victoria [38] no biographies Oppose Oppose
National Library of Ireland [39] no biographies Oppose Oppose
Notable Names Database [40] Support Support
Belgian people and things [41] no biographies Oppose Oppose
Open Library [42] no biographies Oppose Oppose
ORCID [43] biographies may be missing maybe
OpenPlaques [44] no biographies Oppose Oppose
Project Gutenberg [45] no biographies Oppose Oppose
National Library of Australia [46] actually links to other catalogues Oppose Oppose
Smithsonian American Art Museum [47] no biographies Oppose Oppose
Structurae persons [48] GA candidate.svg Semi-structured
Theatricalia [49] no biographies Oppose Oppose
Web Gallery of Art [50] lots of embedded frames in pages; commercial web site, as suggested by Spinster Oppose Oppose
Parliament UK [51] GA candidate.svg Semi-structured
Catholic Encyclopedia (1913) [52] Wikisource; not only biographies (encyclopedia) Support Support
Baker's Biographical Dictionary of Musicians [53] full raw text available Support Support
RKDartists [54] suggested by Spinster GA candidate.svg Semi-structured

The table below instead shows a list of Wikisource sources, as per wikisource:Category:Biographical_dictionaries and wikisource:Wikisource:WikiProject_Biographical_dictionaries. @Nemo bis: many thanks for suggesting the links.

Source URL Comments Candidate?
Dictionary of National Biography [55] Support Support
History of Alabama and Dictionary of Alabama Biography [56] almost no data (except for Montgomery) Oppose Oppose
American Medical Biographies [57] Support Support
A Biographical Dictionary of Ancient, Medieval, and Modern Freethinkers [58] everything in one page, may be tricky to parse Support Support
The Dictionary of Australasian Biography [59] Support Support
Dictionary of Christian Biography and Literature to the End of the Sixth Century [60] Support Support
A Dictionary of Artists of the English School [61] quite incomplete (only A, F, K); one page per letter, mat be tricky to parse Support Support
A Short Biographical Dictionary of English Literature [62] Support Support
Dictionary of Greek and Roman Biography and Mythology [63] Support Support
The Indian Biographical Dictionary (1915) [64] Support Support
Modern English Biography [65] really few data Support Support
Who's Who, 1909 [66] 2 persons maybe
Who Was Who (1897 to 1916) [67] almost nothing in Wikisource, but full text available at archive.org Support Support
A Dictionary of Music and Musicians [68] not only biographies Support Support
Men-at-the-Bar [69] lots of abbreviations Support Support
A Naval Biographical Dictionary [70] Support Support
Makers of British botany [71] few people, very long biographies maybe
Biographies of Scientific Men [72] few people. very long biographies maybe
A Chinese Biographical Dictionary [73] Support Support
Who's Who in China (3rd edition) [74] Support Support
A Compendium of Irish Biography [75] no data Oppose Oppose
Chronicle of the law officers of Ireland [76] one page per chapter, may be tricky to parse Support Support
A biographical dictionary of eminent Scotsmen [77] no data Oppose Oppose
Dictionary of National Biography, 1901 supplement [78] need to check the intersection with the original one Support Support
Dictionary of National Biography, 1912 supplement [79] need to check the intersection with the original one Support Support
Woman's Who's Who of America, 1914-15 [80] really few data Support Support
The Biographical Dictionary of America [81], [82], [83], [84], [85], [86], [87], [88], [89], [90] almost nothing in Wikisource, but full text available at archive.org Support Support
Historical and biographical sketches [91] few people, very long biographies maybe
Cartoon portraits and biographical sketches of men of the day [92] Support Support
Men of the Time, eleventh edition [93] Support Support

Relevant Wikidata Properties Statistics[edit]

The table below catches 3 different usage signals of Wikidata properties relevant to the biographical domain, namely:

This list of properties may serve as a valid starting point for the Candidate Relations Set milestone.

Label ID Frequency Ranking Unsourced Statements % External Use Domain Comments Lexical Unit Frame Range FE Candidate?
country P17 04th 48% yes places unsuitable domain Oppose Oppose
sex or gender P21 06th 70% yes persons,

animals

use semi-structured scraped data Support Support
date of birth P569 09th 37% yes person,

organism

use semi-structured scraped data bear Support Support
given name P735 10th 92% no person use semi-structured scraped data Support Support
occupation P106 14th 79% yes person be (very risky)

work

Being_employed Position Support Support
country of citizenship P27 16th 71% yes person,

term

come from

originate

People_by_origin Origin Support Support
date of death P570 21th 37% yes person,

organism

use semi-structured scraped data die Support Support
place of birth P19 23th 18% yes person use semi-structured scraped data bear Support Support
official name P1448 26th almost 0% yes all items tricky one, may use semi-structured scraped data Being_named Name Support Support
place of death P20 32th 35% yes person use semi-structured scraped data die Support Support
educated at P69 44th 77% yes person study

educate train learn

Education_teaching Institution Support Support
languages spoken or written P1412 55th 32% no person no FrameNet data, should use a custom frame speak

write

Support Support
position held P39 60th 82% yes person the mapping seems reasonable, but conflicts with P106 work Being_employed Position Support Support
award received P166 62th 81% yes person,

organization, creative work

win Win_prize Prize Support Support
member of political party P102 67th 60% yes people that are politicians Support Support
family name P734 68th 70% yes persons use semi-structured scraped data Support Support
creator P170 73th 39% yes N.A. inverse property (person is the range) create

co-create develop establish found generate make produce set up synthesize

Intentionally_create Created_entity (domain)

Creator (range)

Support Support
author P50 75th 34% yes work inverse property (person is the range)

sub-property of creator for written works constrain to domain type = Written work

same as creator same as creator same as creator Support Support
director P57 77th 18% yes work inverse property (person is the range)

seems a sub-property of creator for motion pictures, plays, video games constrain to domain types = (Movie, Play, VideoGame)

direct

co-direct

Behind_the_scenes Production (domain)

Artist (range)

Support Support
member of P463 87th 92% yes person belong Membership Group Support Support
participant of P1344 89th 92% no human,

group of humans, organization

engage

participate take part

Participation Event Support Support
employer P108 97th 92% yes human commission

employ

Employing Employer Support Support

February 2016[edit]

Week 1[edit]

  • The development corpus is already in a good shape, with 700,000 items ca. scraped from 50 sources. 180,000 items ca. contain raw text biographies to feed the NLP pipeline (cf. https://github.com/Wikidata/StrepHit/issues/13#issuecomment-185314081);
  • the corpus analysis module baseline is implemented, and currently yields 20,000 verb lemmas ca.;
  • we are checking whether the relevant Wikidata properties shown above can be triggered by our verb lemmas: if so, they will definitely serve as the first 21 candidate relations. The remaining 29 will be extracted according to the lexicographical and statistical rankings;
  • updating the Wikidata properties table above;

Week 2[edit]

After inspecting the set of verb lemmas, we found lots of noise, mainly caused by the default tokenization logic of the POS-tagging library we used. Therefore, we implemented our own tokenizer, to be leveraged by all modules. A second run of the corpus analysis yielded 7,600 verb lemmas ca.: less items with much more quality.

Verb Rankings[edit]

The final output of the corpus analysis module are 2 rankings, one based on lexicographical evidence (i.e., TF/IDF), and one on statistical evidence (i.e., standard deviation). Cf. https://github.com/Wikidata/StrepHit/issues/5#issuecomment-188293650 for more technical details.

For each ranking, we intersected the top 50 lemmas with FrameNet data: the data can be found at https://github.com/Wikidata/StrepHit/issues/5#issuecomment-189376446

Week 3[edit]

  • We observed that our scrapers may also contain semi-structured data, which can be a very valuable source of statements. Hence, we have been working on a Wikidata dataset: we plan to upload it to the primary sources tool backend instance, hosted at the Wikimedia Tool Labs, and announce it to the community;
  • the entity linking facility is implemented, and currently supports the Dandelion Entity Extraction API;
  • the crowdsourcing annotation module gets underway, and currently interacts with the CrowdFlower API for posting annotation jobs and pulling results;

Week 4[edit]

  • started working on the extraction of sentences from the corpus: they will get sampled and serve as seeds for the training and test sets, as well for the actual classification;
  • brainstorming session to come up with three extraction strategies:
  • n2n (default), i.e., many sentences per many LUs. This entails that the same sentence is likely to be extracted multiple times;
  • 121, i.e., one sentence per LU. This entails that a single sentence will be extracted only once;
  • syntactic (to be implemented), i.e., extraction based on dependency parsing. We argue that this may be useful to split long complex sentences.
  • We have contacted the primary sources tool maintainers, and requested them to grant us either the access to the specific machine at Tool Labs, or a token for the /import service (undocumented in the tool homepage, but documented in Google's codebase);
  • currently, we are waiting for their answer, and will upload the semi-structured dataset as soon as we are granted the access;

March 2016[edit]

Week 1[edit]

  • testing the sentence extraction module;
  • experimenting extraction strategies:
    • basic ones (n2n, 121) are noisy;
    • synctatic is computationally intensive.
  • computing corpus statistics: sources, biography distribution;
  • working on the semi-structured dataset:
    • resolving honorifics more reliably.
  • caching facilities: general-purpose and entity linking caching.

Week 2[edit]

Week 3[edit]

  • second pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/87
  • first crowdsourcing job pilots:
    • input data created with the 3 extraction strategies;
    • failed, worker get confused by (a) too long sentences, and (b) difficult labels returned by FrameNet.
  • working on the integration of the crowdsourcing platform CrowdFlower:
    • dynamic generation of the worker interface based on input data;
    • automate the flow via the API;
    • interacting with the CrowdFlower help desk to fix issues in the API.
  • working on the scientific article revision;
  • investigating the relevance of the FrameNet frame repository:
    • FEs will be annotated if they can be mapped to Wikidata properties, otherwise they are useless.
  • implementing a simple FE to Wikidata properties matcher:
    • functions to retrieve Wikidata property IDs, full entity metadata, and labels and aliases only;
    • extract FEs only if they map to Wikidata properties via exact matching of labels and aliases.

Week 4[edit]

  • working on the simplification of the annotation jobs:
    • filter numerical FEs, which should instead be classified directly with a rule-based strategy;
    • let extra FEs be annotated too, not only core ones;
    • FEs with no mapping to Wikidata should not be skipped, labels should be rather made understandable.
  • testing and documenting parallel processing facilities;
  • new sentence extraction strategy: grammar-based.

April 2016[edit]

Week 1[edit]

  • midpoint report written;
  • working on the technical documentation:
  • corpus stats: length distribution of biographies;
  • working on the scientific article revision.

Week 2[edit]

  • quasi full time work on the scientific article revision;
  • dissemination: participating to the HackAtoka hackathon at SpazioDati, Trento
    • implemented a rule-based classifier for the companies domain.

Week 3[edit]

Week 4[edit]

May 2016[edit]

Week 1[edit]

Week 2[edit]

  • bug fixing based on the SOD hackathon feedback (cf. #Week_4_3):
    • DSI and rkd.nl sources scraping;
    • changed wrong Wikidata property mapping with possibly high impact;
  • work on the numerical expressions (typically dates) normalization:
    • regular expressions to capture them;
    • transformation rules to fit the Wikidata data types;
    • tests;
  • first version of the supervised classifier.

Week 3[edit]

Week 4[edit]

  • major unplanned outcome: entities that could not be resolved to Wikidata IDs during dataset serialization may serve as new Wikidata Items.
Action: the final list of unresolved entities will be proposed to the community;
  • entities that are places should not undergo the annotation, but rather be directly classified;
  • lexical database improvements:
    • FE-to-Wikidata property mappings;
    • marked FEs that should become the subjects of the output statement;
  • plug a gazetteer as an extra set of features for the supervised classifier;
  • resolving countries of citizenship from nationalities;
  • prepare the StrepHit pipeline 1.0 beta release.

June 2016[edit]

Week 1[edit]

  • StrepHit pipeline version 1.0 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.0-beta
  • the ULAN scraper now harvests URLs to human-readable resources;
  • support for multiple subjects in the dataset serialization;
  • improvements to the Sphinx Wikitext documentation extension;
  • automatic model selection to pick the best model for the supervised classifiers.

Week 2[edit]

Week 3[edit]

  • Parametrizable script for supervised training;
  • stopwords should not be features;
  • normalization of names for better entity resolution;
  • optional feature reduction facility;
  • major change in entity resolution: resolve Wikidata QIDs by looking up linked entities URIs;
  • skipping non-linked chunks in feature extraction;
  • optional K-fold validation in training script;
  • handle qualifiers at dataset serialization.

Week 4[edit]

  • Do not serialize frame elements that have a wrong class;
  • performance evaluation of supervised classifiers:
    • 10-fold cross validation;
    • comparison with dummy classifiers;
    • accuracy values against a gold standard of 249 fully annotated sentences;
  • correctly handling places;
  • StrepHit pipeline version 1.1 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.1-beta