Grants:Project/Hjfocs/soweego/Timeline

This project is funded by a Project Grant

Timeline for soweego[edit]

Timeline	Date
Target databases selection	September 2018
Link validator	October 2018
Link merger	February 2019
Target databases linkers	July 2019
Identifiers datasets	July 2019
Software package	July 2019

Overview[edit]

Project start date: July 9, 2018
Workboard: https://github.com/Wikidata/soweego/projects/1
Codebase: https://github.com/Wikidata/soweego

Monthly updates[edit]

Each update will cover a 1-month time span, starting from the 9th day of the current month. For instance, July 2018 means July 9th to August 8th 2018.

July 2018: target selection & small fishes[edit]

       TL;DR:

Mix'n'match is the tool for small fishes. soweego will not handle them.

The very first task of this project is to select the target databases.^[1] We see two directions here: either we focus on a few big and well known targets as per the project proposal, or we can try to find a technique to link a lot of small ones from the long tail, as suggested by ChristianKl^[2] (thanks for the precious feedback!).

We used SQID as a starting point to get a list of people databases that are already used in Wikidata, sorted in descending order of usage.^[3] This is useful to split the candidates into big and small fishes, namely the head and the (long) tail of the result list respectively. Let's start with the small fishes.

Quoting ChristianKl, it would be ideal to create a configurable tool that enables users to add links to new databases in a reasonable timeframe. Consequently, we carried out the following investigation: we considered as small fishes all the entries in SQID with an external ID datatype, used for class human (Q5), and with less than 15 uses in statements. We detail below a set of critical issues about this direction, as well as their eventual solutions.

The analysis of a small fish can be broken down into a set of steps. This is also useful to translate the process into software and to make each step flexible enough for dealing with the heterogeneity of the long tail targets. The steps have been implemented into a piece of software by MaxFrax96.^[4]

Retrieving the dump[edit]

This sounds pretty self-evident: if we aim at linking two databases, then we need access to all their entities. Since we focus on people, it is therefore necessary to download the appropriate dump for each small fish we consider.

Problem
In the real world, such a trivial step raises a first critical issue: not all the database websites give us the chance to download the dump.

Solutions

Cheap, but not scalable: to contact the database administrator and discuss dump releases for Wikidata;
expensive, but possibly scalable: to autonomously build the dump. If a valid URI exists for each entity, we can re-create the dump. However, this is not trivial to generalize: sometimes it is impossible to retrieve the list of entities, sometimes the URIs are merely HTML pages that require Web scraping. See the following examples:

Welsh Rugby Union men's player ID (P3826) needs scraping for both the list of entities and each entity;
Berlinische Galerie artist ID (P4580) needs scraping for both the list of entities and each entity;
FAI ID (P4556) needs scraping for both the list of entities and each entity;
Debrett's People of Today ID (P2255) does not seems to expose any list of people;
AGORHA event ID (P2345) does not seem to expose any list of people.

Handling the format[edit]

The long tail is roughly broken down as follows:

XML;
JSON;
RDF;
HTML pages with styling and whatever a Web page can contain.

Problem
Formats are heterogenous. We focus on open data and RDF, as dealing with custom APIs is out of scope for this investigation. We also hope that the open data trend of recent years would help us. However, a manual scan of the small fishes yielded poor results: out of 16 randomly picked candidates, only YCBA agent ID (P4169) was in RDF, and has thousands of uses in statements at the time of writing this report.

Solution
To define a way (by scripting for instance) to translate each input format into a standard project-wide one. This could be achieved during the next step, namely ontology mapping between a given small fish and Wikidata.

Mapping to Wikidata[edit]

Linking Wikidata items to target entities requires a mapping between both metadata/schemas.

Solution
The mapping can be manually defined by the community: a piece of software will then apply it. To implement this step, we also need the common data format described above.

Side note: available entity metadata
Small fishes may contain entity metadata which are likely to be useful for automatic matching. The entity linking process may dramatically improve if the system is able to mine extra property mappings. This is obvious when metadata are in different languages, but in general we cannot be sure that two different databases hold the same set of properties, if they have some in common.

Conclusion[edit]

It is out of scope for the project to perform entity linking over the whole set of small fishes. On the other hand, it may make sense to build a system that lets the community plug in new small small fishes with relative ease. Nevertheless, this would require a reshape of the original proposal, which comes with its own risks:

it is probably not a safe investment of resources;
eventual results would not be in the short term, as they would require a lot of work to create a flexible system for everybody's needs;
it is likely that the team is not facing eventual extra problems in this phase.

Mix'n'match[edit]

Most importantly, a system to plug new small fishes already exists. Mix'n'match^[5] is specifically designed for the task.^[6] Instead of reinventing the wheel, we will join efforts with our advisor Magnus Manske in his work on big fishes.^[7]

August 2018: big fishes selection[edit]

       TL;DR:

the soweego team selected 4 candidate targets:

~~BIBSYS (Q4584301). Coverage = 21%~~ discarded, see #September 2018
Discogs (Q504063). Coverage = 33%
Internet Movie Database (Q37312). Coverage = 42%
MusicBrainz (Q14005). Coverage = 35%
X (Q918). Coverage = 31%

the soweego team will join efforts with Magnus Manske's work on large catalogs.

Motivation #1: target investigation[edit]

The following table displays the result of our investigation on candidate big fishes. We computed the Wikidata item counts as follows.

Wikidata item count queries on specific classes:
- 4,502,944 humans, [1];
- 495,177 authors, [2];
- 230,962 musicians, [3];
- 74,435 bands, [4];
- 239,137 actors, [5];
- 42,164 directors, [6];
- 13,588 producers, [7];
Wikidata link count queries: use the property for identifiers:
- humans, e.g., 581,397 for LoC, [8];
- authors, e.g., 168,188 for GND, [9];
- musicians, e.g., 77,640 for Discogs, [10];
- bands, e.g., 37,158 for MusicBrainz, [11].

Resource	# entries	Reference to # entries	Dump download URL	Online access (e.g., SPARQL)	# Wikidata items with link / without link	Available metadata	Links to other sources	In mix'n'match	TL;DR:	Candidate?
[12]	7,037,189	[13]	[14]	SRU: [15], OAI-PHM: [16]	humans: 571,357 / 3,931,587; authors: 168,188 / 326,989	id, context, preferredName, surname, forename, describedBy, type, dateOfBirth, dateOfDeath, sameAs	GND, BNF, LoC, VIAF, ISNI, English Wikipedia, Wikidata	Yes (large catalogs)	Already processed by Mix'n'match large catalogs, see [17]	Oppose
[18]	> 8 millions	names authority file [19]	[20]	Not found	humans: 581,397 / 3,921,547; authors: 204,813 / 290,364	URI, Instance Of, Scheme Membership(s), Collection Membership(s), Fuller Name, Variants, Additional Information, Birth Date, Has Affiliation, Descriptor, Birth Place, Associated Locale, Birth Place, Gender, Associated Language, Field of Activity, Occupation, Related Terms, Exact Matching Concepts from Other Schemes, Sources	Not found	[21]	Already well represented in Wikidata, low impact expected	Oppose
[22]	8,738,217	[23]	[24]	Not found	actors, directors, producers: 197,626 / 104,392	name, birth year, death year, profession, movies	Not found	No	Metadata allows to run easy yet effective matching strategies; the license can be used for linking, see [25]; quite well represented in Wikidata (2/3 of the relevant subset)	Support
[26]	2,181,744	authors, found in home page	[27]	SPARQL: [28]	humans: 356,126 / 4,146,818; authors: 148,758 / 346,419	country, language, variants of name, pages in data.bnf.fr, sources and references	LoC, GND, VIAF, IdRef, Geonames, Agrovoc, Thesaurus W	Yes (large catalogs)	Seems well shaped; already processed by Mix'n'match large catalogs, see [29]	Oppose
[30]	about 1.5 M	dataset described at [31]	[32]	SPARQL	humans: 94,009 / 4,408,935; authors: 40,656 / 454,521	depend on the links found for the ID	VIAF, GND	[33]	Underrepresented in Wikidata, small subset (47k entries) in Mix'n'match, of which 67% is unmatched, high impact expected	Strong support
[34]	About 500 k	Search for a,b,c,d... in the search window	Not found	SOLR	humans: 378,261 / 4,124,683; authors: 153,024 / 342,153	name, language, nationality, notes	[35], [36]	No	Discarded: no dump available	Oppose
[37]	417 k	[38]	[39]	API	humans: 303,235 / 4,199,709; authors: 4,966 / 490,211	PersonId, EngName, ChName, IndexYear, Gender, YearBirth, DynastyBirth, EraBirth, EraYearBirth, YearDeath, DynastyDeath, EraDeath, EraYearDeath, YearsLived, Dynasty, JunWang, Notes, PersonSources, PersonAliases, PersonAddresses, PersonEntryInfo, PersonPostings, PersonSocialStatus, PersonKinshipInfo, PersonSocialAssociation, PersonTexts	list of external sources: [40]	[41]	The database is in a proprietary format (Microsoft Access)	Weak support
[42]	About 500 k	found as per [43]	[44]	API, SRU	humans: 274,574 / 4,228,370; authors: 92,662 / 402,515	name, birth, death, identifier, SKOS preferred label, SKOS alternative labels	VIAF, GeoNames, Wikipedia, LoC Subject Headings	[45]	Lots of links to 3 external databases, but few metadata; seems to be the same as BNF.	Oppose
[46]	1,393,817	[47]	[48]	API	humans: 98,115 / 4,404,829; musicians & bands: 114,798 / 190,599	URI, Type, Gender, Born, Born in, Died, Died in, Area, IPI code, ISNI code, Rating, Wikipedia bio, Name, Discography, Annotation, Releases, Recordings, Works, Events, Relationships, Aliases, Tags, Detail	[49], [50], [51], [52], [53], [54], [55], [56], VIAF, Wikidata, English Wikipedia, YouTube/Vevo, [57], [58], resource official web page, [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75]	No	High quality data, plenty of external links, totally open source, regular dump releases	Strong support
[76]	About 1 M	People found as per [77], with restriction to people	Not found	API	humans: 128,536 / 4,374,408; authors: 42,991 / 452,186	birth, death, nationality, language, archival resources, realated resources, related external links, ark ID, SNAC ID	[78], [79]	[80]	No dump available, 99.9% already matched in Mix'n'match	Oppose
[81]	840,883	[82]	[83]	API	humans: 167,663 / 4,335,281; authors: 410 / 494,767	name, country, keywords, other IDs, education, employment, works	[84], [85]	Yes (large catalogs)	Already processed by Mix'n'match large catalogs, see [86]	Oppose
[87]	6,921,024	[88]	[89]	API	humans: 140,883 / 4,362,061; authors: 58,823 / 436,354	name, birth year, death year	Not found	[90]	Only names, birth and death dates; no dedicated pages for people entries; source code: [91]	Neutral
[92]	Not found	Not found	Not found	Not found	Not found	Not found	Not found	[93]	Seems closed. The dataset providers claimed they would publish a new site, not happened so far [94]	Oppose
[95]	336 M active	[96]	Not found	API	humans: 85,527 / 4,417,417	verified account, user name, screen name, description, location, followers, friends, tweets, listed, favorites, statuses	Plenty	No	No official dump available, but the team has collected the dataset of verified accounts. Links stem from home page URLs, should be filtered according to a white list. Underrepresented in Wikidata, high impact expected	Strong support
[97]	Not found	Not found	Not found	Not found	Not found	Not found	Not found	No	Seems it does not contain people data, only books by author	Oppose
[98]	20,255	[99]	[100]	SPARQL	humans: 81,455 / 4,421,489; authors: 34,496 / 460,681	Subject line, Homepage ID, Synonym, Broader term, Hyponym, Related words, NOTE Classification symbol (NDLC), Classification symbol (NDC 9), Classification symbol (NDC 10), Reference (LCSH), Reference (BSH 4), Source (BSH 4), Source edit history, Created date, last updated	VIAF, LoC	No	Mismatch between the actual dataset and the links in Wikidata; it extensively refers to VIAF and LoC (see [101], entry 12 of the table and [102])	Oppose
[103]	Not found	Not found	API	humans: 97,599 / 4,405,345; authors: 28,404 / 466,773	Not found	Not found	Not found		No dump available	Oppose
[104]	5,736,280	[105]	[106]	API + Python client	humans: 66,185 / 4,436,759; musicians & bands: 78,522 / 226,875	artist name, real name, short bio, aliases, releases, band membership	Plenty, top-5 frequent: [107], [108], [109], [110], [111]	[112]	CC0 license, 92% not matched in Mix'n'match, high impact expected	Strong support
[113]	1,269,331	[114]	[115], [116]	SPARQL	humans: 106,839 / 4,396,105; authors: 50,251 / 444,926	name, bio	[117], [118], [119], [120], [121], [122], [123], [124], [125]	[126]	Outdated dump (2013), low quality data, 75% not matched in Mix'n'match	Oppose

Motivation #2: coverage estimation[edit]

We computed coverage estimations over Strong support and Support candidates to assess their impact on Wikidata, as suggested by Nemo bis^[8] (thanks for the valuable comment!). In a nutshell, coverage means how many existing Wikidata items could be linked.

For each candidate, the estimation procedure is as follows.

pick a representative 1% sample of Wikidata items with no identifier for the candidate. Representative means e.g., musicians for MusicBrainz (Q14005): it would not make sense to link generic people to a catalog of musical artists;
implement a matching strategy:
- perfect = perfect matches on names, sitelinks, external links;
- similar = matches on names and external links based on tokenization and stopwords removal;
- SocialLink, as per our approach;^[9]
compute the percentage of matched items with respect to the sample.

The table below shows the result. It is worth to observe that similar coverage percentages correspond to different matching strategies: this may suggest that each candidate may require different algorithms to achieve the same goal. Our hypothesis is that higher data quality entails simpler solutions: for instance, MusicBrainz (Q14005) seems like a well structured catalog, thus the simplest strategy being sufficient.

Target	Sample	Matching strategy	# matches	% coverage
BIBSYS (Q4584301)	4,249 authors and teachers	Perfect	899	21%
Discogs (Q504063)	1,253 musicians	Perfect & similar	414	33%⁽¹⁾
Internet Movie Database (Q37312)	1,022 actors, directors and producers	Perfect	432	42%
MusicBrainz (Q14005)	1,100 musicians	Perfect	388	35%
X (Q918)	15,565 living humans	SocialLink	4,867⁽²⁾	31%

(1) using perfect matching strategy only: 4.6%
(2) out of which 609 are confident matches

September 2018[edit]

We manually assessed small subsets of the matches obtained after the coverage estimations. Given the scores and the evaluation, we decided to discard BIBSYS (Q4584301). The main reasons besides the mere score follow.

The dump is not synchronized with the online data;
- identifiers in the dump may not exist online:
- cross-catalog links in the dump may not be the same as online;
the dump suffers from inconsistency:
- the same identifier may have multiple links, thus flawing the link-based matching strategy;
- links from different catalogs may have different quality, e.g., one may be correct, the other not;
online data can also be inconsistent. A match may be correct, but the online identifier may have a wrong cross-catalog link.

We report below a first round of evaluation that estimates the performance of already implemented matchers over the target catalogs. Note that MusicBrainz (Q14005) was evaluated more extensively thanks to MaxFrax96's thesis work.

Target	Matching strategy	# samples	Precision
BIBSYS (Q4584301)	Perfect links	10	50%
Discogs (Q504063)	Similar links	10	90%
Discogs (Q504063)	Similar names	32	97%
Internet Movie Database (Q37312)	Perfect names	10	70%
MusicBrainz (Q14005)	Perfect names	38	84%
MusicBrainz (Q14005)	Perfect names + dates	32	100%
MusicBrainz (Q14005)	Similar names	24	71%
MusicBrainz (Q14005)	Perfect links	71	100%
MusicBrainz (Q14005)	Similar links	102	99%
X (Q918)	SocialLink	67	91%

Technical[edit]

Baseline matchers finalized;
d:User:Soweego_bot account created;
request for the bot flag approved: d:Wikidata:Requests_for_permissions/Bot/soweego_bot;
added first set of identifier statements from baseline matchers;
started work on Internet Movie Database (Q37312), computed coverage estimation;
Validator component:
- delete invalid statements not complying with criterion 1;
- first working version of validation criterion 2 (links).

Dissemination[edit]

MaxFrax96 successfully defended his bachelor thesis^[10] on soweego. Congratulations!
Lc_fd joined the project as a volunteer developer. Welcome!

October 2018[edit]

During this month, the team devoted itself to software development, with tasks broken down as follows.

Application package[edit]

This is how the software is expected to ship. Tasks:

packaged soweego in 2 Docker containers:
1. test launches a local database instance to enable work on a target catalog dump extraction and import;
2. production feeds the shared Toolforge large catalogs database;
let a running container see live changes in the code.

Validator[edit]

This component is responsible for monitoring the divergence between Wikidata and a given target catalog. It implements bullet point 3 of the project review committee recommendations^[11] and performs validation of Wikidata content based on 3 main criteria:^[12]

existence of target identifiers;
agreement with the target on third-party links;
agreement with the target on "stable" metadata.

Tasks:

existence-based validation (criterion 1):
- first run over MusicBrainz (Q14005);
- gather target catalog data through database queries;
full implementation of the link-based validation (criterion 2).

Importer[edit]

This component extracts a given target catalog data dump, cleans it and imports it in Magnus_Manske's database on Toolforge. It follows ChristianKl's suggestion^[13] and is designed as a general-purpose facility for the developer community to import new target catalogs. Tasks:

worked on MusicBrainz (Q14005):
- split dump into musicians and bands;
- extraction and import of musicians and bands;
- extraction and import of links.

Ingestor[edit]

This component is a Wikidata bot that uploads the linker and validator output. Tasks:

deprecate identifier statements not passing validation;
handle statements to be added: if the statement already exists in Wikidata, just add a reference node.

Utilities[edit]

Complete URL validation: pre-processing, syntax parsing and resolution;
URL tokenization;
text tokenization;
match URLs known by Wikidata as external identifiers and convert them accordingly.

November 2018[edit]

The team focused on the importer and linker modules.

Importer[edit]

Worked on Discogs (Q504063):
- split dump into musicians and bands;
- extraction and import of musicians and bands;
- extraction, validation and import of links;
- extraction and import of textual data.
major effort on building full-text indices on the Toolforge database:
- the Python library we use does not natively support them;
- investigated alternative solutions, i.e., https://github.com/mengzhuo/sqlalchemy-fulltext-search;
- managed to implement them from the initial library;
Refinements for MusicBrainz (Q14005).

Linker[edit]

Added baseline strategies to the importer workflow. They now consume input from the Toolforge database;
adapted and improved the similar name strategy, which leverages full-text indices on the Toolforge database;
preparations for the baseline datasets.

Validator[edit]

First working version of the metadata-based validation (criterion 3).

Ingestor[edit]

Add referenced third-party external identifier statements from link-based validation;
add referenced described at URL (P973) statements from link-based validation;
add referenced statements from metadata-based validation.

Dissemination[edit]

The project leader attended WikiCite 2018, see WikiCite 2018#Attendees;
joined the SPARQL jam session during day 3, discussed with Fuzheado alternative methods to slice Wikidata subsets via SPARQL;^[14]
new connections: Miriam_(WMF) from WMF research team, Jkatz_(WMF) from WMF readers department, Giovanni from Turing Institute UK, Susannaanas from Open Knowledge Finland, Michelleif from Stanford University;
synchronized with Tpt, T_Arrow, Adam_Shorland_(WMDE), Smalyshev_(WMF), Maxlath, LZia_(WMF), Dario_(WMF), Denny, Sannita.

December 2018[edit]

We focused on 2 key activities:

research on probabilistic record linkage;^[15]
packaging of the complete soweego pipeline.

Probabilistic record linkage[edit]

Deterministic approaches are rule-based linking strategies. and represent a reasonable baseline. On the other hand, probabilistic ones leverage machine learning algorithms and are known to perform effectively.^[16] Therefore, we expect our baseline to serve as the set of features for probabilistic methods.

First exploration and hands on the recordlinkage library:^[17]
understood how the library applies the general workflow: cleaning, indexing, comparison, classification, evaluation;
published a report that details the required implementation steps;^[18]
started the first probabilistic linkage experiment, i.e., using the naïve Bayes algorithm;^[19]
recordlinkage extensively employs DataFrame objects from the well known pandas Python library:^[20] investigation and hands on it;
started work on the training set building:
- gathered the Wikidata training set live from the Web API;
- gathered the target training set from the Toolforge database;
- converted both to suitable pandas dataframes;
custom implementation of the cleaning step;
indexing implemented as blocking on the target identifiers;
started work on feature extraction.

Pipeline packaging[edit]

Finalized work on full-text indices on the Toolforge database;
adapted perfect name and similar link baseline strategies to work against the Toolforge database;
built an utility to retrive mappings between Toolforge database tables and SQL Alchemy entities;
completed the similar link baseline strategy;
linking based on edit distances now works with SQL Alchemy full-text indices;
baseline linking can now run from the command line interface;
various Docker improvements:
- set up volumes in the production instance;
- allow custom configuration in the test instance;
- set up the execution of all the steps for the final pipeline.

January 2019[edit]

Happy new year! We are pleased to announce a new member of the team: Tupini07. Welcome on board! Tupini07 will work on the linker for Internet Movie Database (Q37312). The development activities follow.

IMDb[edit]

Clustered professions related to music;
reached out to IMDb licensing department;
understood how the miscellaneous profession is used in the catalog dump.

Probabilistic linker[edit]

Investigated Naïve Bayes classification in the recordlinkage Python library;
worked on feature extraction;
grasped performance evaluation in the recordlinkage Python library;
completed the Naïve Bayes linker experiment;
engineered the vector space model feature;
gathered Wikidata aliases for dataset building;
discussed how to handle feature extraction in different languages.

Baseline linker[edit]

Assessed similar URLs link results;
piped linker output to the ingestor;
read input data from the Wikidata live stream;
worked on birth and death dates linking strategy.

Importer[edit]

Imported MusicBrainz (Q14005) URLs;
added support for multiple dump files;
extracted ISNI code from MusicBrainz (Q14005) artist attributes.

Package[edit]

Installed less in Docker;
set up the final pipeline as a Docker container;
hitting a segmentation fault when training in Docker container on a specific machine;
improved Docker configuration.

February 2019[edit]

The team fully concentrated on software development, with a special focus on the probabilistic linkers.

Probabilistic linker[edit]

Handled missing data from Wikidata and the target;
parsed dates at preprocessing time;
started work on blocking via queries against the target full-text index;
understood custom blocking logic in the recordlinkage Python library;
resolved object QIDs;
dropped data columns containing missing values only;
included negative samples in the training set;
enabled dump of evaluation predictions to CSV;
started work on scaling up the whole probabilistic pipeline:
- implemented chunk processing techniques;
- parallelized feature extraction;
- avoided redundant input/output operations on files when gathering target datasets.

Importer[edit]

The IMDb importer is ready;
handled connection issues with the target database engine;
made the expensive URL resolution functionality optional;
fixed a problem causing the MusicBrainz import to fail;
improved logging of the MusicBrainz dump extractor;
added batch insert functionality;
added import progress tracker;
extra logging for the Discogs dump extractor;
enabled bulk insertion of the SQL Alchemy Python library;

Ingestor[edit]

Uploaded a 1% sample of the Twitter linker to Wikidata;
filtered the dataset on confident links;
resolved Twitter UIDs against usernames.

March 2019[edit]

This was a crucial month. In a nutshell:

the probabilistic linker workflow is in place;
we successfully ran it over complete imports of the target catalogs;
we uploaded samples of the linkers that performed best to Wikidata;
we produced the following evaluation reports for the Naïve Bayes (NB) and Support Vector Machines (SVM) algorithms.

Discogs NB: https://github.com/Wikidata/soweego/issues/171#issuecomment-476293971;
Discogs SVM: https://github.com/Wikidata/soweego/issues/171#issuecomment-477978766;
IMDb NB: https://github.com/Wikidata/soweego/issues/204#issuecomment-477964956;
IMDb SVM: https://github.com/Wikidata/soweego/issues/204#issuecomment-478038363
MusicBrainz NB: https://github.com/Wikidata/soweego/issues/203#issuecomment-477558282
MusicBrainz SVM: https://github.com/Wikidata/soweego/issues/203#issuecomment-478015882

Linker[edit]

Removed empty tokens from full-text index query;
prevented positive samples indices from being empty;
implemented a feature for dates;
implemented a feature based on similar names;
implemented a feature for occupations:
- gathered specific statements from Wikidata;
- ensured that occupation statements are only gathered when needed;
- enabled comparison in the whole occupation classes tree;
handled missing target data;
avoided computing features when Wikidata or target DataFrame columns are not available;
built blocking via full-text query over the whole Wikidata dataset;
built full index of positive samples;
simplified probabilistic workflow;
checked whether relevant Wikidata or target DataFrame columns exist before adding the corresponding feature;
k-fold evaluation;
ensured to pick a model file based on the supported classifiers;
filtered duplicate predictions;
first working version of the SVM linker;
avoided stringifying list values.

Importer[edit]

Fixed an issue that caused the failure of the IMDb import pipeline;
parallelized URL validation;
prevented the import of unnecessary occupations in IMDb;
occupations that are already expressed on the import table name do not get imported;
decompressed Discogs and MusicBrainz dumps are now deleted after a successful import;
avoided populating tables when a MusicBrainz entity type is unknown.

Miscellanea[edit]

Optimized full-text index queries;
the perfect name match baseline now runs bulk queries;
set up the Wikidata API login with the project bot;
progress bars do not disappear anymore.

April 2019[edit]

After introducing 2 machine learning algorithms, i.e., naïve Bayes and support vector machines, this month we brought neural networks into focus.

The major outcome is a complete run of all linkers over all whole datasets, and made the evaluation results available at https://github.com/Wikidata/soweego/wiki/Linkers-evaluation.

Linker[edit]

Decided not to handle QIDs with multiple positive samples;
added feature that captures full names;
added an optional post-classification rule that filters out matches with different names;
injected SVM linker based on libsvm, instead of liblinear. This allows to use non-linear kernels at the cost of higher training time;
first implementation of a single-layer perceptron;
added a set of Keras callbacks;
ensured a training/validation set split when training neural networks;
incorporated early stopping at training time of neural networks;
implemented a rule to enforce links of MusicBrainz entities that already have a Wikidata URL;
built a stopword list for bands;
enabled cache of complete training & classification sets for faster prototyping;
constructed facilities for hyperparameters tuning through grid search, available at evaluation and optionally at training time;
experimented architectures of multi-layer perceptrons.

Importer[edit]

Fixed a misleading log message when importing MusicBrainz relationships.

May 2019[edit]

This month was pretty packed. The team revolved around 3 main activities:

development of new linkers for musical and audiovisual works;
refactoring & documentation of the code base;
facility to upload medium-confidence results to the Mix'n'match tool.

New linkers[edit]

imported Discogs masters;
imported IMDb titles;
imported MusicBrainz releases;
implemented the musical work linker;
implemented the audiovisual work linker.

Refactor & document[edit]

Code style:
- format with black;^[21]
- remove unused imports & variables with autoflake;^[22]
- apply relevant suggestions from pylint;^[23]
refactored & documented the pipeline;
refactored & documented the importer module;
refactored & documented the ingestor module.

Mix'n'match client[edit]

Interacted with the project advisor, who is also the maintainer of the tool;
added ORM entities for mix'n'match catalog and entry DB tables.

Linker[edit]

Added string kernels as a feature for names;
completed the multi-layer perceptron;
handled too many concurrent SPARQL queries being sent when gathering the tree of occupation QIDs;
fixed parallelized full-text blocking, which made IMDb crash.

Importer[edit]

Avoid populating DB tables when a Discogs entity type is unknown.

Ingestor[edit]

Populate statements connecting works with people.

Continuous integration[edit]

Set up Travis;^[24]
added build badge to the README;
let Travis push formatted code.

June 2019[edit]

The final month was totally devoted to 5 major tasks:

deployment of soweego in production;
upload of results;
documentation;
code style;
refactoring.

Production deployment[edit]

Set up the production-ready Wikimedia Cloud VPS machine;^[25]
dry-ran and monitored production-ready pipelines for each target catalog;
structured the output folder tree;
decided confidence score thresholds;
the pipeline script now backs up the output folder of the previous run;
avoided interactive login to the Wikidata Web API;
enabled the extraction of Wikidata URLs available in target catalogs;
set up scripts for cron jobs.

Results upload[edit]

Confident links (i.e., with score above 0.8) are being uploaded to Wikidata via d:User:Soweego bot;
medium-confidence (i.e., with score between 0.5 and 0.8) links are being uploaded to Mix'n'match for curation by the community.

Documentation[edit]

Added Sphinx-compliant^[26] documentation strings to all public functions and classes;
complied with PEP 257^[27] and PEP 287;^[28]
converted and uplifted relevant pages of the GitHub Wiki into Python documentation;
customized the look of the documentation theme;
deployed the documentation to Read the Docs;^[29]
completed the validator module;
completed the Wikidata module;
completed the linker module: this activity required extra efforts, since it is soweego's core;
main command line documentation;
full README.

Code style[edit]

Complied with PEP 8^[30] and Wikimedia^[31] conventions;
added type hints^[32] to public function signatures.

Refactoring[edit]

fixed pylint errors and relevant warnings;
reduced code complexity;
applied relevant pylint refactoring suggestions.

[1] Grants:Project/Hjfocs/soweego#Work_package

[2] Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability

[3] Select datatype set to ExternalId, Used for class set to human Q5

[4] ttps://github.com/MaxFrax/Evaluation

[5] ttps://tools.wmflabs.org/mix-n-match/

[6] ttp://magnusmanske.de/wordpress/?p=471

[7] ttp://magnusmanske.de/wordpress/?p=478

[8] Grants_talk:Project/Hjfocs/soweego#Coverage_statistics

[9] ttps://iswc2017.semanticweb.org/wp-content/uploads/papers/MainProceedings/441.pdf

[10] ttps://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf

[11] Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision

[12] ttps://github.com/Wikidata/soweego/issues/19#issuecomment-413622924

[13] Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability

[14] ttps://etherpad.wikimedia.org/p/WikiCite18Day3sparql

[15] :Record_linkage

[16] ttp://axon.cs.byu.edu/~randy/pubs/wilson.ijcnn2011.beyondprl.pdf

[17] ttps://recordlinkage.readthedocs.io

[18] ttps://github.com/Wikidata/soweego/wiki/Notes-on-the-recordlinkage-Python-library

[19] ttps://github.com/Wikidata/soweego/issues/146

[20] ttps://pandas.pydata.org/

[21] ttps://black.readthedocs.io/

[22] ttps://pypi.org/project/autoflake/

[23] ttps://pylint.readthedocs.io/

[24] ttps://travis-ci.com/

[25] ttps://tools.wmflabs.org/openstack-browser/project/soweego

[26] ttps://www.sphinx-doc.org/

[27] ttps://www.python.org/dev/peps/pep-0257/

[28] ttps://www.python.org/dev/peps/pep-0287/

[29] ttps://soweego.readthedocs.io/

[30] ttps://www.python.org/dev/peps/pep-0008/

[31] ttps://www.mediawiki.org/wiki/Manual:Coding_conventions/Python

[32] ttps://docs.python.org/3/library/typing.html

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]