Grants:Project/Hjfocs/soweego/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2017-18 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/Hjfocs/soweego.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learnings from the grantee's first 6 months.

Summary[edit]

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

ethic advisors on board: thanks Piotrus and CristianCantoro for your availability;
target selection finalized: Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005), and X (Q918);
the importer ships the targets into the freely accessible large catalogs database on Toolforge;
the validator is a watchdog that detects inconsistencies between Wikidata and the targets;
a set of baseline linking strategies emulates the first steps that a human would take to link a Wikidata person with a target one;
the ingestor is a Wikidata bot that starts small to scale up;
dissemination to the scientific community through a BSc thesis and to the Wikimedia community through WikiCite_2018.

Methods and activities[edit]

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Project management[edit]

A GitHub repository under the Wikidata organization umbrella is the principal location for managing the whole project.^[1] We opted for it because everything we need is nicely integrated, namely:
- tasks administration;
- workboard;
- embedded Wiki for informal documentation;
- software version control with Git;
- user-friendly code review system;
a specific Phabricator project^[2] acts as a mirror to GitHub. We mainly use it for interaction with the Wikimedia Cloud Service team, i.e., system administration tasks;
the project leader holds bi-weekly meetings to report on progress and to plan tasks on the short term, usually in front of the workboard;^[3]
yellow Post-its carry impromptu ideas and fine-grained technical issues.

Outreach[edit]

The project leader allocates substantial effort in building connections with third-party strategic partners, so far the Discogs (Q504063) and the MusicBrainz (Q14005) teams;
he is also responsible for systematic communication with relevant members of the target Wikimedia communities;
the project has attracted a number of local volunteers, thanks to dissemination activities at the university of Trento, such as flyers and a bachelor thesis defense.

Community feedback & sustainability[edit]

We strive to integrate comments from the project review committee and the community members. Specifically, we have focused on:
- finding an ethic advisor;
- a detailed analysis of the target catalog candidates;
- addressing major feature requests, namely:
  - a plan for syncing when external database and Wikidata diverge;^[4]
  - A tool (...) where normal users can add new databases in a reasonable timeframe;^[5]
we drive software architectural choices keeping the long term in mind. This has translated into the following activities:
- tight partnership with Magnus Manske's Mix'n'match tool;^[6]
- use of the centralized Toolforge database;^[7]
- walkthroughs on the Github wiki to facilitate new technical contributions.^[8]

Midpoint outcomes[edit]

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

Ethic advisor[edit]

The respect of privacy is a fundamental principle of the Wikimedia movement: the use case of this project involves Wikidata items about people, and the review committee suggested to have an ethic advisor on board.^[9] We have taken this recommendation very seriously and started reaching out to relevant community members since the very start of the project.

We are pleased to have filled this role, and also an extra one:

Piotrus has kindly accepted to act as our main ethic advisor. The supervision is currently proceeding smoothly;^[10]
CristianCantoro has volunteered to support Piotrus as an ethic co-advisor.^[11]

We would like to express our deep gratitude here.

Target selection[edit]

The first key task was the selection of target catalogs that would pilot the implementation of the whole project. This was a very demanding activity, because we tried to integrate all the community feedback on this topic. Specifically, we used the following points as a basis for our investigation:

analysis of the long tail of small catalogs, as suggested by ChristianKl;^[5]
exploration of large catalogs to join efforts with our technical advisor Magnus Manske's work;^[12]
coverage estimation, raised by Nemo bis.^[13]

The outcomes are:

the small fishes report;^[14]
the big fishes report;^[15]
coverage estimation over candidate big fishes.^[16]

Importer[edit]

This is the first component of the soweego pipeline: it is responsible for the download, preprocessing, and import of target catalog dumps. Together with our technical advisor Magnus Manske, we decided to place soweego and the strategic partner tool Mix'n'match^[6] side by side. While Mix'n'match caters for small catalogs, soweego supports the import of large catalogs.^[12] With the long-term community adoption in mind, we chose to make the import result available in a shared database on the Toolforge infrastructure: in this way, all users with a Wikimedia developer account^[17] can access the data and build applications on top of it.

In addition, we implemented related community feedback,^[5] where ChristianKl advocated for "A tool (...) where normal users can add new databases in a reasonable timeframe". This required to work toward more general-purpose architectural choices, and resulted in what we consider a reasonable trade-off, where a technical user can plug in a new target catalog/database with relatively low effort. We published a tutorial on the GitHub wiki to facilitate newcomers.^[18]

Validator[edit]

This component is in charge of assessing whether Wikidata and a given target catalog disagree on entities and/or statements. Our goal here is to fully address the third point of the committee caveats.^[4] Since the central ambition of this project is to improve the quality of Wikidata, we believe this component can yield an additional positive impact.

We implement so through validation of Wikidata content around 3 criteria^[19] that answer as many questions:

existence of target identifiers, i.e., does a Wikidata external identifier still exist in the target catalog?;
consensus with the target on third-party links, i.e., do Wikidata and the target catalog share the same links on a given entity?;
consensus with the target on a subset of "stable" statements , i.e., do Wikidata and the target catalog share the same set of stable statements on a given entity?.

Baseline linking strategies[edit]

The linker is the heart of soweego: it is accountable for linking Wikidata to a given target. So far, we developed a set of rule-based strategies, namely:

exact match on full names;
combined exact match on full names and birth/death dates;
exact match on URLs;
similar match on preprocessed full names;
similar match on preprocessed URLs.

During the target selection phase, we conducted preliminary evaluation result for coverage estimation purposes.^[16]^[20]

Ingestor[edit]

This module is liable for uploading the linker and validator results to Wikidata. We focused on output to be handled by a bot, i.e., d:User:Soweego_bot. The outcomes are:

the first task covers the linker output, i.e., to add referenced identifier statements. Requested and approved by the community;^[21]
the second task covers the validator criterion 2, i.e., link-based. Requested and approved by the community;^[22]
we performed a first run of test edits;^[23]
successful feedback loop with the community, as per reactions.^[24]^[22]

More specifically, d:User:Soweego_bot executed edits as follows.

Addition of referenced statements, namely:
- external identifiers from baseline linkers;
- third-party external identifiers from link-based validation;
- described at URL (P973) claims from link-based validation;
- several claims from metadata-based validation;
deprecation of external identifiers not passing validation.

Wikimedians having lunch at the Internet Archive (Q461) headquarters after WikiCite_2018

Dissemination[edit]

Bachelor thesis on the baseline linkers by MaxFrax96;^[25]
WikiCite_2018:^[26]
- project leader's lightning interview recorded by Konrad_Foerstner for Open Science Radio;^[27]
- SPARQL jam session held by Fuzheado;^[28]
- new connections: Miriam_(WMF), Jkatz_(WMF), Susannaanas, Michelleif;
- old friends: Tpt, T_Arrow, Adam_Shorland_(WMDE), Smalyshev_(WMF), Maxlath, LZia_(WMF), Dario_(WMF), Denny, Sannita;
Liaison with the Discogs (Q504063) and MusicBrainz (Q14005) folks, ongoing private discussion about contributing dead links to their databases.

Finances[edit]

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far?

Yes.

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

We do not expect any major changes.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges[edit]

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

Finding a trade-off between in-scope goals and long-term sustainability is probably the most delicate challenge;
from a software perspective, this translates in architectural decisions that may lead to a larger implementation effort;
it is not possible to setup a fully working development environment on the Wikimedia Cloud Services VPS machines,^[29] due to the complexity of the infrastructure;
the retention of volunteer software contributors is cumbersome: folks may sound very enthusiastic when they first approach the project, then they may just disappear.

What is working well[edit]

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Next steps and opportunities[edit]

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a renewal of this grant at the end of your project, please also mention this here.

The chief next step will be the implementation of linkers based on artificial intelligence (machine learning);
we will pursue an incremental approach: start with simple methods, evaluate their effectiveness, then raise complexity as needed. Specific algorithms follow:
1. naïve Bayes;^[30]
2. support vector machines;^[31]
3. neural networks;^[32]
the baseline linking strategies will serve as the feature set for the machine learning-based ones;
we will try to find the most relevant community outreach event for dissemination;
while the validator component is originally out of scope, work on it is paving the way to further chances for data quality improvements: therefore, we are definitely taking into consideration a renewal of this grant.

Grantee reflection[edit]

We’d love to hear any thoughts you have on how the experience of being a grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 6 months?

During the proposal phase, the community review was an incredibly challenging step, due to the impressive amount of feedback we collected;
we tried our best to address the comments and rethink the proposal, and we particularly enjoyed this moment;
as always, the community is what makes Wikimedia projects so powerful: we feel part of this society, and try to apply the same principles to this project.

References[edit]

[1] ttps://github.com/Wikidata/soweego/

[2] :project/view/3476/

[3] ttps://github.com/Wikidata/soweego/projects/1

[r2-4] Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision

[tdb-5] Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability

[mnm-6] ttps://tools.wmflabs.org/mix-n-match/

[7] wikitech:Help:Toolforge/Database#User_databases

[8] ttps://github.com/Wikidata/soweego/wiki

[9] Grants_talk:Project/Hjfocs/soweego#Aggregated_feedback_from_the_committee_for_soweego

[10] Grants_talk:Project/Hjfocs/soweego/Timeline

[11] Grants:Project/Hjfocs/soweego#Participants

[lc-12] ttp://magnusmanske.de/wordpress/?p=471

[13] Grants_talk:Project/Hjfocs/soweego#Coverage_statistics

[14] Grants:Project/Hjfocs/soweego/Timeline#July_2018:_target_selection_&_small_fishes

[15] Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

[ce-16] Grants:Project/Hjfocs/soweego/Timeline#Motivation_#2:_coverage_estimation

[17] wikitech:Help:Create_a_Wikimedia_developer_account

[18] ttps://github.com/Wikidata/soweego/wiki/Import-a-new-database

[19] ttps://github.com/Wikidata/soweego/issues/19#issuecomment-413622924

[20] Grants:Project/Hjfocs/soweego/Timeline#September_2018

[21] :Wikidata:Requests_for_permissions/Bot/soweego_bot

[bot-22] :Wikidata:Requests_for_permissions/Bot/Soweego_bot_2

[23] :Special:Contributions/Soweego_bot

[24] :User_talk:Hjfocs

[25] ttps://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf

[26] WikiCite_2018#Attendees

[27] ttp://www.openscienceradio.org/2019/01/06/osr134-wikicite-2018-enjoy-the-community-en/?t=14:30

[28] ttps://etherpad.wikimedia.org/p/WikiCite18Day3sparql

[29] wikitech:Portal:Cloud_VPS

[30] :Naive_Bayes_classifier

[31] :Support-vector_machine

[32] :Artificial_neural_network

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]