soweego is a machine learning system that connects Wikidata to large-scale third-party catalogs.
It takes a set of Wikidata items and a given catalog as input, and links them through record linkage techniques based on supervised learning.
The main output is a dataset of Wikidata and third-party catalog identifier pairs.
The story so far
soweegoWikidata bot uploaded hundreds of thousands confident links;
- medium-confident ones are in Mix'n'match (Q28054658) for curation;
- there is strong community support;
- outcomes fit the Wikidata development roadmap.
We see two principal growth directions:
- the validator component;
- addition of new third-party catalogs.
The former was originally out of the initial Project Grant scope. Nevertheless, we developed a prototype to address a key suggestion from the Wikimedia research team: "Develop a plan for syncing when external database and Wikidata diverge". The latter is a natural way to expand beyond the initial use case, thus increasing impact through more extensive coverage. Contributors engagement will be crucial for this task.
Why: the problem
soweego complements the Wikidata development roadmap with respect to the Increase data quality and trust part.
It aims at addressing three open challenges that zoom in from a high-level perspective:
- missing feedback loops between Wikidata data donors and re-users;
- lack of methodical efforts to keep Wikidata in sync with third-party catalogs/databases;
- under-usage of the statement ranking system, with few bots performing most of the edits.
These challenges are intertwined among each other: synchronizing Wikidata to a given external database is a precondition to enable a feedback loop between both communities. At the same time, sync results can impact the ranking system usage.
How: the solution
We synchronize Wikidata to a given target catalog at a given point in time through a set of validation criteria.
- existence: whether a target identifier found in a given Wikidata item is still available in the target catalog;
- links: to what extent all URLs available in a Wikidata item overlap with those in the corresponding target catalog entry;
- metadata: to what extent relevant statements available in a Wikidata item overlap with those in the corresponding target catalog entry.
The application of these criteria to our running example translates into the following actions.
- Elvis Presley (Q303) has a MusicBrainz identifier 01809552, which does not exist in MusicBrainz anymore.
Action = mark the identifier statement with a deprecated rank;
- Elvis Presley (Q303) has 7 URLs, MusicBrainz 01809552 has 8 URLs, and 3 overlap.
Action = add 5 URLs from MusicBrainz to Elvis Presley (Q303) and submit 4 URLs from Wikidata to the MusicBrainz community;
- Wikidata states that Elvis Presley (Q303) was born on January 8, 1935 in Tupelo, while MusicBrainz states that 01809552 was born in 1934 in Memphis.
Action = add 2 referenced statements with MusicBrainz values to Elvis Presley (Q303) and notify 2 Wikidata values to the MusicBrainz community.
In case of either full or no overlap in criteria 2 and 3, the Wikidata identifier statement should be marked with a preferred or a deprecated rank respectively. Note that community discussion is essential to refine these criteria.
What: things done
soweego 1 has an experimental validator module that implements the aforementioned criteria.
If you know how to use a command line and feel audacious, you can install
soweego, import a target catalog, and try the validator out.
Besides that, major contributions in terms of content addition are:
- roughly 255,000 confident identifier statements uploaded by the
soweegobot, totalling 482,000 Wikidata edits;
- around 126,000 medium-confident identifiers submitted to Mix'n'match (Q28054658) for curation.
- G1: take the
soweegovalidator component from experimental to stable;
- G2: submit validation results to the target catalog providers;
- G3: engage the Wikidata community via effective communication of
- G4: expand
soweegocoverage to additional target catalogs.
- O1: production-ready validator module, implementing criteria acknowledged by the community;
- O2: datasets to enable feedback loops on the Wikidata users side, namely
- automatically ranked identifier dataset, as a result of validation criteria actions;
- entity enrichment statement dataset, based on available target data;
- O3: datasets to enable feedback loops on the target catalog providers side, namely rotten URLs and additional URLs & values;
- O4: engagement tools for Wikidata users, including visualization of
soweegodatasets and data curation tutorials;
- O5: procedure to plug a new target catalog that minimizes programming efforts.
- Feedback loop between target data donors and Wikidata users:
as a target catalog consumer,
soweegocan shift the target maintenance burden. Edits performed by the
soweegobot (properly communicated through visualization for instance) close the loop from the Wikidata users side. A use case that emerged during the development of
soweego1 is the sanity check over target URLs: this yielded a rotten URLs dataset, which can be submitted to the target community;
- checks against target catalogs, which entail the enrichment of Wikidata items upon available data, especially relationships among entries. Use cases encountered in
soweego1 target catalogs include:
- automatic ranking of statements: a potentially huge impact on the Wikidata truthy statements dumps, which are both widely used as the "official" ones and underpin a vast portion of the Query Service.
The following numerical metrics are projections over all target catalogs currently supported by
soweego and all experimental validation criteria, based on estimates for a single catalog (i.e., Discogs (Q504063)) and a single criterion.
Note that the actual amount of total statements will depend on data available in Wikidata and in target catalogs.
- Validator datasets (O2):
- 250k ranked statements. Estimate: 21k identifier statements to be deprecated, as a result of criterion 1;
- 120k new statements. Estimate: 10k statements to be added or referenced, as a result of criterion 2;
- 440k rotten URLs. Estimate: 110k URLs, as a result of the sanity check at import time;
- 128k extra values. Estimate: 16k values, as a result of criterion 3.
From a qualitative perspective, a project-specific request for comment will act as a survey to collect feedback.
The 3 shared metrics are as follows.
- 50 total participants: sum of target catalog community fellows, Wikidata users facilitating the feedback loop, and contributors to the
- 25 newly registered users: to be gathered from the Mix'n'match (Q28054658) user base;
- 370k content pages created or improved: sum of Wikidata statements edited by the
|M2||Feedback loop, data providers side||G2||3-12||25%|
|M3||Feedback loop, data users side||G3||3-12||25%|
Note that some milestones overlap in terms of timespan: this is required to exploit mutual interactions among them.
|M1.1||Validation criteria||Refine criteria through community discussion||35%|
|M1.2||Automatic ranking||Submit validation criteria actions on Wikidata statement ranks||20%|
|M1.3||Automatic enrichment||Leverage target catalog relationships to generate Wikidata statements||20%|
|M1.4||Interaction with constraints check||Intercept reports of this Wikibase extension to improve
|M2.1||Rotten URLs||Contribute rotten URLs to target catalog providers||30%|
|M2.2||Extra URLs & content||Submit additional URLs (criterion 2) and values (criterion 3) from Wikidata to target catalog providers||70%|
||Improve communication of Wikidata edits made by
|M3.2||Data curation guidelines||Explain how to curate Wikidata and Mix’n’match contributions made by
|M3.3||Real-world evaluation||Switch from in vitro to in situ evaluation of the
|M4.1||New domains||Extend support of target catalogs beyond people and works (initial use case)||20%|
|M4.2||Codebase generalization||Minimize domain-dependent logic||30%|
|M4.3||Simple import||Squeeze the effort needed to add support for a new catalog||50%|
The total amount requested is 80,318 €.
|Project lead||Responsible for the full project implementation||Full time (40 hrs/week), 12 PM||52,735 €|
|Core system architect||Technical operations head||Full time, 12 PM||12,253 €|
|Research assistant||In charge of the applied machine learning components||Part time (20 hrs/week), 6 PM||14,330 €|
|Dissemination||Expenses to attend 2 relevant community events||One shot||1,000 €|
Gross salaries of the human resources are computed upon average estimates based on roles and locations. Gross labor rates per hour follow.
Note that the project lead will serve as the main grantee and will appropriately allocate the funding.
We identify the following relevant communities, which are both inside and outside the Wikimedia landscape.
- Wikidata development team;
- Mix'n'match (Q28054658) users;
- Target catalog owners:
Besides the currently supported target catalogs, we would like to engage Royal Botanic Gardens, Kew (Q18748726) through a formal collaboration with the Intelligent Data Analysis team.
The organization maintains a set of biodiversity catalogs, such as Index Fungorum (Q1860469).
The team is exploring directions to leverage and disseminate their collections.
Moreover, this would let
soweego scale up to a totally different domain.
Hence, we believe the collaboration would yield a 3-fold benefit: for Wikidata, Kew, and
We acknowledge the receipt of the e-mail sent by the Project Grant program officers, regarding WMF requirements updates on the current COVID-19 health emergency. We detail below the guidelines and how this proposal fully complies with them:
- travel and/or offline events are a minor focus of this proposal
- The whole #Work package does not include any offline events. On the other hand, a #Budget line includes participation to 2 community events. This line does not explicitly mention specific offline events: the project lead is responsible for their selection.
- we can complete the core components of the proposed work plan without offline events or travel
- Absolutely, 100% of the #Work package is dedicated to the technical development of the project. All the tasks can be carried out in an online/remote setting.
- we are able to postpone any planned offline events or travel until WMF guidelines allow for them, without significant harm to the goals of this project
- As mentioned above, there are no specific offline events planned. The project lead is responsible for choosing relevant ones: this choice will be postponed until WMF guidelines allow for it.
- how this project would be impacted if travel and offline events prove unfeasible throughout the entire life of this project
- This would not be a problem, since we are able to fully convert the dissemination of the project output into online forms. For instance, this would translate in the allocation of more effort on activities M3.1 and M3.2 (see #Work package).
- see for instance https://tools.wmflabs.org/mix-n-match/#/group/Music
- see more in #What:_things_done
- see third main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
- 4 catalogs, see Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
- see the feedback loops with data re-users block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
- see the checks against 3rd party databases block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
- see for instance d:Special:Contributions/PreferentialBot
- the examples are fictitious and do not reflect the actual data
- plenty of queries use the truthy prefix
wdt, see for instance d:Wikidata:SPARQL_query_service/queries/examples#Showcase_Queries
- see #How:_the_solution
- person months
- corresponds to 25% of the salary. The rest is funded by the hosting university
- corresponds to 50% of the salary. The rest is funded by the hosting university
- see #Participants
- Marco Fossati
- Hjfocs is a research scientist with a double background in natural languages and information technology. He holds a PhD in computer science at the University of Trento (Q930528).
- His profile is highly hybrid and can be defined as an enthusiastic leader of applied research in natural language processing for the Web, backed by a strong dissemination attitude and an innate passion for open knowledge, all blended with software engineering skills and a deep affection for human languages.
- He is currently focusing on Wikidata data quality and has been leading
soweegosince the very first prototype, as well as the StrepHit project, both funded by the Wikimedia Foundation.
- Emilio Dorigatti
- Edorigatti is a PhD candidate at the Ludwig Maximilian University of Munich (Q55044), where he is applying machine learning methods to the design of personalized vaccines for HIV and cancer, in collaboration with the Helmholtz Zentrum München (Q878592).
- He holds a BSc in computer science and a double MSc in data science, with a minor in innovation and entrepreneurship. He also has several years of working experience as a software developer, data engineer, and data scientist. He is mostly interested in Bayesian inference, uncertainty quantification, and Bayesian Deep Learning.
- Emilio was a core team member of the StrepHit project.
- Massimo Frasson
- MaxFrax96 has been a professional software engineer since more than 5 years. He has mainly worked as an iOS developer and game developer at Belka.
- He has always had a strong interest in software architecture and algorithm performance. He is currently attending a MSc in computer science at the University of Milan (Q46210) to deepen his knowledge on data science.
- Massimo has been involved into
soweegosince the early stages, and has made key contributions to its development.
- Advisor Volunteering my experience from Mix'n'match, amongst others Magnus Manske (talk) 10:47, 17 February 2020 (UTC)
- Volunteer I'm part of OpenMLOL, which has a a property in Wikidata (https://www.wikidata.org/wiki/Property:P3762) as the identifier of an author in openMLOL. I'm trying to standardize our author ID with the Wikidata ID. .laramar. (talk) 16:18, 20 February 2020 (UTC)
- Volunteer Volunteer Back ache (talk) 10:29, 2 March 2020 (UTC)
The links below reference notifications to relevant mailing lists and Wiki pages. They are sorted in descending order of specificity.
- Wikidata: https://lists.wikimedia.org/pipermail/wikidata/2020-February/013842.html
- Wikidata project chat: d:Wikidata:Project_chat#soweego_2_proposal
- Wikidata weekly summary: d:Wikidata:Status_updates/2020_02_24
- Wikidata Telegram channel: https://t.me/joinchat/AZriqUj5UagVMHXYzfZFvA
- Wiki research: https://lists.wikimedia.org/pipermail/wiki-research-l/2020-February/007127.html
- AI: https://lists.wikimedia.org/pipermail/ai/2020-February/000296.html
- Wikimedia: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-February/094292.html
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project by clicking the blue button in the infobox, or edit this section directly. (Other constructive feedback is welcome on the discussion page).
- soweego 1: Grants:Project/Hjfocs/soweego#Endorsements
- soweego 1.1: Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements
- Very cool project idea. 2001:4CA0:0:F235:2486:8ED5:C3AA:C914 09:30, 27 January 2020 (UTC)
- Connections between DBs. Frettie (talk) 15:59, 12 February 2020 (UTC)
- Important update/sync mechanism for large third-party catalogs Magnus Manske (talk) 10:20, 17 February 2020 (UTC)
- Helps linking item around the world ;) Crazy1880 (talk) 17:16, 19 February 2020 (UTC)
- As Wikidata matures and becomes more important for knowledge graphs and the products (commercial or not) built upon them, it also becomes more important to keep its quality high. This project contributes to Wikidata's quality by detecting obsolete links between Wikidata and 3rd party databases, discovering new links between Wikidata and 3rd party databases, and helping synchronize facts between Wikidata and 3rd party databases. (Nicolastorzec (talk) 00:43, 20 February 2020 (UTC)
- Matching existing items is our biggest bottleneck in integrating new data sources. Any technological help in this regard is a good thing. The team also looks strong. 99of9 (talk) 01:35, 20 February 2020 (UTC)
- sounds reasonable and project site is well designed 220.127.116.11 05:42, 20 February 2020 (UTC)
- Link curation is hard and this will continue to make it easier and more efficient. StudiesWorld (talk) 11:55, 20 February 2020 (UTC)
- As we import more catalogs, we need more tools to improve the matching process. This will help. - PKM (talk) 21:44, 20 February 2020 (UTC)
- To complete previous work and provide more automated tools to data catalogs. Sabas88 (talk) 14:28, 21 February 2020 (UTC)
- This is useful, a benefit for Wikidata as well as the other catalogues, and makes it easy and fun to participate in the creation of knowledge records. Sebastian Wallroth (talk) 09:04, 22 February 2020 (UTC)
- I've always been a supporter of the project, and continue to be one! Sannita - not just another it.wiki sysop 17:52, 22 February 2020 (UTC)
- Any technology that helps linking item between DBs and makes easier to participate it's a good idea Tiputini (talk) 18:07, 23 February 2020 (UTC)
- I particularly like the sync-ing with third party databases. 18.104.22.168 15:22, 24 February 2020 (UTC)
- Essential for keeping Wikidata updated and usable for rapidly changing databases.-Nizil Shah (talk) 02:14, 26 February 2020 (UTC)
- This is potentially valuable. To get the most of it, though, the catalogs generated for Mix 'n' Match must be improved (better identifiers and descriptions to allow decisions without *always* having to click through). Some reflection is also due on why relatively little work has been done with MnM so far. Finally, I'd like to see a clear plan for maintenance and sustainability of the project beyond the project lead's current academic context. Will community members be able to keep running the tool? Ijon (talk) 13:07, 26 February 2020 (UTC)
- The proposal promises a good extension of the first proposal, and the first proposal has really shown it's worth. Identifiers and mappings are an important part of Wikidata, and will become crucial for quality efforts. This is an extremely helpful step to further enable a quality checking approach against external data sources. I agree with Ijon that a focus should be put on making it sustainable so that the project results can be kept active without the project ongoing in the future. denny (talk) 15:02, 26 February 2020 (UTC)
- As Magnus Manske. Epìdosis 15:28, 26 February 2020 (UTC)
- I'd love to see it action. As a regular MnM user, I believe feedback loop and Wikidata content validation are essential and will also be very useful for data donors.--HakanIST (talk) 19:09, 28 February 2020 (UTC)
- I would like to see Wikidata landscape linked to other external catalogs and also learn how this process can be further improved. John Samuel 15:13, 29 February 2020 (UTC)
- Looks certainly promising. ESM (talk) 06:27, 2 March 2020 (UTC)
- It would save many human/volunteer time. Maybe at the beginning will need some human review, I would like to hear how this revisions will be. Please take on account data repositories outside USA and Europe, and non-English-based alphabets, don't get the data biased. Salvador (talk) 05:10, 3 March 2020 (UTC)
- Mix'n'Match is a great tool, but needs good input. The ChEBI database is a good example where name matching does not work, and the matching should happen on other properties of the entities in ChEBI being matched to Wikidata (particularly: the InChIKey). If soweego can fill that gap, that would be awesome. Egon Willighagen (talk) 11:29, 4 March 2020 (UTC)
- Support This would be a critical addition to the mission and work of Wikidata! Great work so far! Todrobbins (talk) 22:33, 4 March 2020 (UTC)
- Its necessary for Wikidata; we need a tool to quickly and easy add statements from sourced databases. Matlin (talk) 10:05, 5 March 2020 (UTC)
- Support Important tool for establishing Wikidata as a hub for different databases. --Nw520 (talk) 13:29, 5 March 2020 (UTC)
- Support Sounds like a good way of handling large datasets. Richard Nevell (WMUK) (talk) 14:46, 5 March 2020 (UTC)
- Support this looks like a great development for GLAM-wiki projecs too! Marta Arosio (WMIT) (talk) 07:27, 9 March 2020 (UTC)
- third party databases are super important for wikidata! Icebob99 (talk) 20:58, 11 March 2020 (UTC)
- Support Obvious win, promotes w:WP:V and w:MOS:BUILD. Feeding high-confidence linkages back to donor databases will both nourish them and help to highlight any issues of falsely high confidence. LeadSongDog (talk) 15:45, 12 March 2020 (UTC)
- connecting data sources is super useful Alexzabbey (talk) 15:21, 13 March 2020 (UTC)
- Linking identifiers is arguably the most important role of Wikidata, and soweego can do it more efficiently than any human. Vahurzpu (talk) 17:17, 13 March 2020 (UTC)
- Support MargaretRDonald (talk) 15:15, 15 March 2020 (UTC)
- This could be a good way to improve Wikidata on a large scale. Also, academic libraries are interested in this sort of thing and could help with data-gathering. Rachel Helps (BYU) (talk) 20:55, 16 March 2020 (UTC)
- The work on soweego has been very useful so far and I'd love to see the work on it continue. I really appreciate that it aligns well with our roadmap and the current priorities for Wikidata and the focus on increasing the quality of its data. I have confidence in the people behind this proposal to deliver on their promises. --Lydia Pintscher (WMDE) (talk) 13:30, 17 March 2020 (UTC)
- Reduced human maintenance effort Frenzie (talk) 20:08, 20 March 2020 (UTC)
- Would help make processes more efficient, and reduce human labour! Support! TheFrog001 (talk) 14:01, 22 March 2020 (UTC)
- Support Interesting project Afernand74 (talk) 21:40, 25 March 2020 (UTC)
- Very important for Wikidata and Wikimedia ecosystem in general. Tubezlob (talk) 10:03, 29 March 2020 (UTC)
- Support I love using Mix'n'Match, I teach it during my Wikidata workshops and this will make it even better! Powerek38 (talk) 05:21, 1 April 2020 (UTC)
- Support I am satisfied with the in progress Grants:Project/Hjfocs/soweego and support to keep this going in this proposal. Blue Rasberry (talk) 12:48, 1 April 2020 (UTC)
- The distinguishing of the Wiki entities and their interconnection to other databases is an important thing! I'm glad someone's taking care of it. Kommerz (talk) 09:57, 3 April 2020 (UTC)
- Support A useful project, as connecting with other databases is really important. --Marcok (talk) 15:19, 3 April 2020 (UTC)
- Support Wikidata needs that kind of tools to improve more and more it's contents. Bye, Elisardojm (talk) 07:52, 4 April 2020 (UTC)
- This would very much help the cooperation between catalog holders and the open source data community. Beireke1 (talk) 14:25, 6 April 2020 (UTC)
- Support Great initiative to facilitate cooperation between third-party catalogs/databases and Wikidata community. Sam.Donvil (talk) 14:42, 6 April 2020 (UTC)
- Support Huge support for Soweego 2 from me! The proposal addresses the most difficult and time consuming issues that come up when trying to import external datasets and keep them in sync with Wikidata. This will definitely increase Wikidata quality and connectivity with the rest of the web, while saving countless hours of repetitive maintenance work for community members. NavinoEvans (talk) 12:22, 8 April 2020 (UTC)
- Similar to Freebase Review queue and would like to see this developed again somehow for Wikidata. Soweego 2 is a good beginning effort for allowing community review of potential mass uploads through external tools such as OpenRefine, etc. Perhaps a dedicated Tag in the queue could be added to Soweego 2 for OpenRefine to be used for the uploads, so we know that uploads came from OpenRefine tool users. Thadguidry (talk) 21:04, 8 April 2020 (UTC)
- Support: sounds useful. Nomen ad hoc (talk) 07:27, 9 April 2020 (UTC).
- Support: sounds like a good tool. 2800:A4:3169:A000:A065:E1A:54B:9A9A 19:34, 9 April 2020 (UTC)
- Support: Strong support. Looks like a cool project. Ranjithsiji (talk) 14:29, 10 April 2020 (UTC)
- This project sounds like a logic next step, first start with addiding data (semi-automatically e.g. Mix'n'match, second step start an automatic feedback loop through AI/machine learning on data extracted from multiple source to keep on improving Wikidata and other sources. Setting up such an feedback sycle and making the software open source, will be valuable for many projects that can use it as a template example or join the data web by linking their own data sources. NAZondervan (talk) 08:57, 13 April 2020 (UTC)
- Support: Olea (talk) 17:14, 15 April 2020 (UTC)
- Support: Like tears in rain (talk) 16:03, 16 April 2020 (UTC)
- Support: Important for keeping and improving the data in Wikidata and keep said data synchronized with outside databases. Tm (talk) 13:47, 18 April 2020 (UTC)
- Support:We need to improve the reliabilty of Wikidata items also with automated tools that work like human users, so this tool could be very useful for this purpose. Mess (talk) 08:51, 25 April 2020 (UTC)
- Support: It's an important project, as it tackles the issue of synchronizing Wikidata with partner databases, such as MusicBrainz. Beat Estermann (talk) 17:08, 1 May 2020 (UTC)
- Mapping WikiData QIDs to third party open data sources such as Musicbrainz is a no brainer. It's beneficial to the entire open web infrastructure and useful in countless ways. Audiodude (talk) 02:21, 2 December 2020 (UTC)