Grants:Project/Hjfocs/soweego 2/Timeline

From Meta, a Wikimedia project coordination wiki


Timeline for Hjfocs[edit]

Timeline Date
Validator July 2022
Feedback loop, data providers side November 2022
Feedback loop, data users side November 2022
New catalogs November 2022


Overview[edit]

Monthly updates[edit]

Please note that the following sections span one month, starting from the 5th day of the current one. For instance, July 2021 stands for July 5th to August 4th 2021.

July 2021[edit]

The very first task was the refinement of validation criteria, as proposed in Grants:Project/Hjfocs/soweego_2#How:_the_solution. We started the discussion with the community on the Wikidata chat and mailing list:

While we think that consensus was reached on criterion 2, i.e., links validation, we plan to leave the discussion open until agreement on the automatic ranking actions is achieved.

Besides that, we started technical work with a focus on the validator component:

  • refresh target catalog imports;
  • use a URL blacklist;
  • output catalog IDs;
  • improve extraction of Wikidata identifiers from URLs.

With respect to links validation, we implemented a suggestion by Azertus raised during the discussion:

You could generate some statistics on the URLs that could be added in a second phase, like prevalence of domains. Based on that list, new properties could be proposed or domains could be whitelisted, etc.

The following sub-sections hold frequency statistics about URLs that could not be automatically converted to valid identifiers. We submitted them for community discussion at d:Wikidata:Project_chat#URLs_statistics_for_Discogs_(Q504063)_and_MusicBrainz_(Q14005).

Most frequent Web domains[edit]

This table displays Web domains that occur more than 1,000 times in Discogs (Q504063) and MusicBrainz (Q14005): some should actually map to known identifiers, some may be candidates for new Wikidata properties. See Grants:Project/Hjfocs/soweego_2/Stats for less frequent ones.

Domain Frequency Comment Property candidate
isni.oclc.org 54007 d:Property:P213#P1793 regex contains spaces, URLs don't Oppose Oppose
www.youtube.com 18395 user namespace, not to be confused with channel (d:Property:P2397) Support Support
itunes.apple.com 17468 artist URLs, not to be confused with existing iTunes properties Support Support
www.musik-sammler.de 13541 Support Support
www.myspace.com 12063 add optional www to d:Property:P3265#P8966 Oppose Oppose
www.bbc.co.uk 10596 musical work review URLs, not to be confused with existing BBC properties Support Support
www.metal-archives.com 7703 URLs not matching d:Property:P1952#P1630: consider adding the new value https://www.metal-archives.com/bands/$1 Oppose Oppose
muzikum.eu 5548 Support Support
lyrics.wikia.com 4153 Support Support
www.generasia.com 3360 Support Support
plus.google.com 2084 obsolete URLs Oppose Oppose
nla.gov.au 1962 Support Support
www.reverbnation.com 1811 Support Support
musicmoz.org 1747 Support Support
www.45cat.com 1632 artist URLs, not to be confused with seven inches (d:Property:P9083) Support Support
web.archive.org 1213 can be used as d:Property:P1065 value, plus d:Property:P2960 and d:Property:P485 Oppose Oppose
www.purevolume.com 1106 Support Support
www.amazon.com 1076 URLs not matching d:Property:P6276#P1630: consider adding the new value https://www.amazon.com/-/e/$1 Oppose Oppose

Discogs Band[edit]

Domain Frequency Examples
www.myspace.com 7876 1. URL, record; 2. URL, record; 3. URL, record;
www.youtube.com 2647 1. URL, record; 2. URL, record; 3. URL, record;
www.reverbnation.com 541 1. URL, record; 2. URL, record; 3. URL, record;
instagram.com 487 1. URL, record; 2. URL, record; 3. URL, record;
web.archive.org 483 1. URL, record; 2. URL, record; 3. URL, record;
www.twitter.com 363 1. URL, record; 2. URL, record; 3. URL, record;
www.metal-archives.com 270 1. URL, record; 2. URL, record; 3. URL, record;
facebook.com 217 1. URL, record; 2. URL, record; 3. URL, record;
plus.google.com 181 1. URL, record; 2. URL, record; 3. URL, record;
www.soundcloud.com 143 1. URL, record; 2. URL, record; 3. URL, record;
bandzone.cz 139 1. URL, record; 2. URL, record; 3. URL, record;
www.ProgArchives.com 137 1. URL, record; 2. URL, record; 3. URL, record;
itunes.apple.com 129 1. URL, record; 2. URL, record; 3. URL, record;
www.AllMusic.com 129 1. URL, record; 2. URL, record; 3. URL, record;
www.audioculture.co.nz 121 1. URL, record; 2. URL, record; 3. URL, record;
open.Spotify.com 107 1. URL, record; 2. URL, record; 3. URL, record;
de-de.facebook.com 102 1. URL, record; 2. URL, record; 3. URL, record;

Discogs Musician[edit]

Domain Frequency Examples
www.myspace.com 4186 1. URL, record; 2. URL, record; 3. URL, record;
www.youtube.com 1415 1. URL, record; 2. URL, record; 3. URL, record;
repertoire.bmi.com 644 1. URL, record; 2. URL, record; 3. URL, record;
instagram.com 406 1. URL, record; 2. URL, record; 3. URL, record;
adp.library.ucsb.edu 380 1. URL, record; 2. URL, record; 3. URL, record;
films.discogs.com 275 1. URL, record; 2. URL, record; 3. URL, record;
web.archive.org 268 1. URL, record; 2. URL, record; 3. URL, record;
www.twitter.com 264 1. URL, record; 2. URL, record; 3. URL, record;
www.bach-cantatas.com 202 1. URL, record; 2. URL, record; 3. URL, record;
www.ascap.com 188 1. URL, record; 2. URL, record; 3. URL, record;
musicianbio.org 162 1. URL, record; 2. URL, record; 3. URL, record;
www.drummerworld.com 147 1. URL, record; 2. URL, record; 3. URL, record;
www.soundcloud.com 133 1. URL, record; 2. URL, record; 3. URL, record;
plus.google.com 130 1. URL, record; 2. URL, record; 3. URL, record;
www.famousbirthdays.com 125 1. URL, record; 2. URL, record; 3. URL, record;
www.radioswissjazz.ch 125 1. URL, record; 2. URL, record; 3. URL, record;

Musicbrainz Band[edit]

Domain Frequency Examples
itunes.apple.com 7218 1. URL, record; 2. URL, record; 3. URL, record;
isni.oclc.org 6825 1. URL, record; 2. URL, record; 3. URL, record;
www.youtube.com 6757 1. URL, record; 2. URL, record; 3. URL, record;
www.metal-archives.com 5024 1. URL, record; 2. URL, record; 3. URL, record;
www.musik-sammler.de 3299 1. URL, record; 2. URL, record; 3. URL, record;
www.bbc.co.uk 2513 1. URL, record; 2. URL, record; 3. URL, record;
muzikum.eu 2384 1. URL, record; 2. URL, record; 3. URL, record;
lyrics.wikia.com 2165 1. URL, record; 2. URL, record; 3. URL, record;
www.purevolume.com 939 1. URL, record; 2. URL, record; 3. URL, record;
musicmoz.org 935 1. URL, record; 2. URL, record; 3. URL, record;
d-nb.info 905 1. URL, record; 2. URL, record; 3. URL, record;
www.45cat.com 891 1. URL, record; 2. URL, record; 3. URL, record;
www.reverbnation.com 826 1. URL, record; 2. URL, record; 3. URL, record;
plus.google.com 694 1. URL, record; 2. URL, record; 3. URL, record;
www.45worlds.com 560 1. URL, record; 2. URL, record; 3. URL, record;
www.generasia.com 536 1. URL, record; 2. URL, record; 3. URL, record;
www.spirit-of-metal.com 533 1. URL, record; 2. URL, record; 3. URL, record;
www.amazon.com 417 1. URL, record; 2. URL, record; 3. URL, record;
www.7digital.com 332 1. URL, record; 2. URL, record; 3. URL, record;
www.progarchives.com 269 1. URL, record; 2. URL, record; 3. URL, record;
www.spirit-of-rock.com 230 1. URL, record; 2. URL, record; 3. URL, record;
www.pandora.com 227 1. URL, record; 2. URL, record; 3. URL, record;
store.cdbaby.com 220 1. URL, record; 2. URL, record; 3. URL, record;
www.killfromtheheart.com 192 1. URL, record; 2. URL, record; 3. URL, record;
uk.7digital.com 170 1. URL, record; 2. URL, record; 3. URL, record;
web.archive.org 153 1. URL, record; 2. URL, record; 3. URL, record;
www.livefans.jp 129 1. URL, record; 2. URL, record; 3. URL, record;
www.zenial.nl 123 1. URL, record; 2. URL, record; 3. URL, record;
us.7digital.com 121 1. URL, record; 2. URL, record; 3. URL, record;
www.sonymusic.co.jp 120 1. URL, record; 2. URL, record; 3. URL, record;
nla.gov.au 109 1. URL, record; 2. URL, record; 3. URL, record;
www.onkyomusic.com 108 1. URL, record; 2. URL, record; 3. URL, record;
cafe.daum.net 104 1. URL, record; 2. URL, record; 3. URL, record;
www.cdjapan.co.jp 102 1. URL, record; 2. URL, record; 3. URL, record;

Musicbrainz Musician[edit]

Domain Frequency Examples
isni.oclc.org 47182 1. URL, record; 2. URL, record; 3. URL, record;
itunes.apple.com 10035 1. URL, record; 2. URL, record; 3. URL, record;
www.youtube.com 7575 1. URL, record; 2. URL, record; 3. URL, record;
muzikum.eu 3157 1. URL, record; 2. URL, record; 3. URL, record;
www.bbc.co.uk 2924 1. URL, record; 2. URL, record; 3. URL, record;
nla.gov.au 1852 1. URL, record; 2. URL, record; 3. URL, record;
www.musik-sammler.de 1717 1. URL, record; 2. URL, record; 3. URL, record;
www.metal-archives.com 1498 1. URL, record; 2. URL, record; 3. URL, record;
lyrics.wikia.com 1141 1. URL, record; 2. URL, record; 3. URL, record;
www.generasia.com 1103 1. URL, record; 2. URL, record; 3. URL, record;
plus.google.com 1079 1. URL, record; 2. URL, record; 3. URL, record;
ibdb.com 869 1. URL, record; 2. URL, record; 3. URL, record;
www.rockabilly.nl 759 1. URL, record; 2. URL, record; 3. URL, record;
www.45cat.com 646 1. URL, record; 2. URL, record; 3. URL, record;
anison.info 558 1. URL, record; 2. URL, record; 3. URL, record;
www.amazon.com 542 1. URL, record; 2. URL, record; 3. URL, record;
www.ibdb.com 496 1. URL, record; 2. URL, record; 3. URL, record;
www.encyclopedisque.fr 454 1. URL, record; 2. URL, record; 3. URL, record;
musicmoz.org 436 1. URL, record; 2. URL, record; 3. URL, record;
www.reverbnation.com 365 1. URL, record; 2. URL, record; 3. URL, record;
soundtrackcollector.com 319 1. URL, record; 2. URL, record; 3. URL, record;
store.cdbaby.com 317 1. URL, record; 2. URL, record; 3. URL, record;
www.bach-cantatas.com 298 1. URL, record; 2. URL, record; 3. URL, record;
ocremix.org 283 1. URL, record; 2. URL, record; 3. URL, record;
www.findagrave.com 283 1. URL, record; 2. URL, record; 3. URL, record;
www.rocky-52.net 251 1. URL, record; 2. URL, record; 3. URL, record;
rcs-discography.com 221 1. URL, record; 2. URL, record; 3. URL, record;
utaitedb.net 213 1. URL, record; 2. URL, record; 3. URL, record;
www.junodownload.com 189 1. URL, record; 2. URL, record; 3. URL, record;
pomus.net 179 1. URL, record; 2. URL, record; 3. URL, record;
web.archive.org 179 1. URL, record; 2. URL, record; 3. URL, record;
www.worldcat.org 175 1. URL, record; 2. URL, record; 3. URL, record;
www.7digital.com 161 1. URL, record; 2. URL, record; 3. URL, record;
stage48.net 156 1. URL, record; 2. URL, record; 3. URL, record;
www.facebook.com 149 1. URL, record; 2. URL, record; 3. URL, record;
tower.jp 144 1. URL, record; 2. URL, record; 3. URL, record;
imusti.com 136 1. URL, record; 2. URL, record; 3. URL, record;
www.audionetwork.com 124 1. URL, record; 2. URL, record; 3. URL, record;
anidb.net 123 1. URL, record; 2. URL, record; 3. URL, record;
www.purevolume.com 120 1. URL, record; 2. URL, record; 3. URL, record;
www.onkyomusic.com 118 1. URL, record; 2. URL, record; 3. URL, record;
www.naxos.com 118 1. URL, record; 2. URL, record; 3. URL, record;
www.qim.com 116 1. URL, record; 2. URL, record; 3. URL, record;
www.directlyrics.com 115 1. URL, record; 2. URL, record; 3. URL, record;
www.todotango.com 112 1. URL, record; 2. URL, record; 3. URL, record;
play.google.com 110 1. URL, record; 2. URL, record; 3. URL, record;
www.sonymusic.co.jp 109 1. URL, record; 2. URL, record; 3. URL, record;
www.classicalarchives.com 106 1. URL, record; 2. URL, record; 3. URL, record;
www.cmt.com 104 1. URL, record; 2. URL, record; 3. URL, record;
www.musicapopular.cl 103 1. URL, record; 2. URL, record; 3. URL, record;
operabase.com 102 1. URL, record; 2. URL, record; 3. URL, record;

Musicbrainz Musical Work[edit]

Domain Frequency Examples
www.musik-sammler.de 8519 1. URL, record; 2. URL, record; 3. URL, record;
www.bbc.co.uk 5096 1. URL, record; 2. URL, record; 3. URL, record;
www.generasia.com 1705 1. URL, record; 2. URL, record; 3. URL, record;
www.metal-archives.com 870 1. URL, record; 2. URL, record; 3. URL, record;
lyrics.wikia.com 842 1. URL, record; 2. URL, record; 3. URL, record;
pitchfork.com 574 1. URL, record; 2. URL, record; 3. URL, record;
www.nme.com 414 1. URL, record; 2. URL, record; 3. URL, record;
musicmoz.org 376 1. URL, record; 2. URL, record; 3. URL, record;
www.spirit-of-metal.com 356 1. URL, record; 2. URL, record; 3. URL, record;
thesession.org 277 1. URL, record; 2. URL, record; 3. URL, record;
soundtrackcollector.com 241 1. URL, record; 2. URL, record; 3. URL, record;
exclaim.ca 176 1. URL, record; 2. URL, record; 3. URL, record;
www.avclub.com 164 1. URL, record; 2. URL, record; 3. URL, record;
www.progarchives.com 158 1. URL, record; 2. URL, record; 3. URL, record;
www.inmusicwetrust.com 155 1. URL, record; 2. URL, record; 3. URL, record;
www.angrymetalguy.com 138 1. URL, record; 2. URL, record; 3. URL, record;
stage48.net 135 1. URL, record; 2. URL, record; 3. URL, record;
web.archive.org 130 1. URL, record; 2. URL, record; 3. URL, record;
drownedinsound.com 130 1. URL, record; 2. URL, record; 3. URL, record;
www.popmatters.com 123 1. URL, record; 2. URL, record; 3. URL, record;
www.rollingstone.com 105 1. URL, record; 2. URL, record; 3. URL, record;

August 2021[edit]

We consider that community discussion around validation criteria has reached a satisfactory level, with the latest updates on automatic ranking discussed here: d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_4.

Outreach[edit]

Criterion 2: URLs[edit]

Criterion 3: biographical data[edit]

  • Give priority to Wikidata in case of more precise date values;
  • dump shared statements, to be used as references in Wikidata;
  • resolve QIDs of place strings coming from target catalogs;
  • dump Wikidata values not available in target catalogs.

Feedback loops with data providers[edit]

We decided to make a very early first step towards this key project goal, by reigniting the discussion with catalog owners. More specifically, we:

  • ran full validation of Discogs URLs;
  • submitted artist rotten URLs to relevant team members at Discogs;
  • ran full validation of MusicBrainz URLs;
  • submitted artist rotten URLs to relevant team members at MusicBrainz.

We really look forward to enabling feedback loops with them.

September 2021[edit]

Outreach[edit]

Soweego at Wikidata data quality days

Criterion 2[edit]

The soweego bot is keeping ingesting third-party identifiers. As a result, we are receiving community feedback on the project lead's talk page, especially due to problematic MusicBrainz URLs. We devoted a consistent amount of time to regularly address them:

Criterion 3[edit]

As an outcome of d:Wikidata:Events/Data_Quality_Days_2021 discussions, we agreed that biographical data is delicate and should be reviewed before being ingested. The d:Wikidata:Mismatch_Finder tool seems the ideal candidate: it is in active development, and we contributed a sample real-world dataset coming from MusicBrainz validation.

Feedback loops with data providers[edit]

Discussion follow-ups with target catalog owners about rotten URLs:

  • Discogs stated they currently don't have mechanisms to remove rotten URLs;
  • they may leverage the dataset we submitted to notify their users about the issue, following a crowdsourced paradigm;
  • MusicBrainz decided to start building their own URL checks, given the large amount of rotten ones we submitted;
  • we pointed them to relevant pieces of the soweego code base that are in charge of such checks;
  • in the MusicBrainz database, a specific field marks URLs as ended, and we should take it into account.

Technical[edit]

  • Replaced pip with Conda for dependency management;
  • bumped Python and all dependencies to their latest version;
  • handled pywikibot timeouts caused by high lags of Wikidata Query Service servers;
  • backed up the soweego VPS;
  • deleted the Debian Stretch instance, to be deprecated soon;
  • spawned a fresh Debian Bullseye one.

October 2021[edit]

Outreach[edit]

Criterion 2[edit]

Feedback loops with data providers[edit]

Updates from target catalog owners about rotten URLs:

  • the database owner at Discogs stated he can't perform a direct action and remove the dataset we provided;
  • he needs to schedule development time to implement the removal automation;
  • discussion with Discogs users is in progress.

Technical[edit]

  • Removed pip requirements;
  • bumped all versions of project dependencies;
  • replaced Travis with pre-commit for continuous integration;
  • don't fail builds after pre-commit autofixes;
  • refined the script for low-level claims deletion.

Extension request[edit]

Request #1[edit]

New start date[edit]

July 5, 2021

New end date[edit]

July 4, 2022

Rationale[edit]

The main grantee is currently involved in a third-party research project on a full time basis. As a result, we would like to shift the actual start date of this project, to ensure that the whole team is fully engaged.

Approval[edit]

Noting here that the grant extension request to July 4, 2022 was approved by program officer, Mjohnson (WMF) in January 2021. -- JTud (WMF), Grants Administrator (talk) 22:26, 13 September 2021 (UTC)

Request #2[edit]

New end date[edit]

November 4, 2022

Rationale[edit]

Starting January 2022, the main grantee will be involved into a research project with a 40% commitment (where 100% stands for full time). As a result, we computed the additional time needed to ensure the planned commitment for soweego 2.

Side note: the new project has tight connections to Wikidata and will be carried out together with fellow Wikimedian Daniel Mietchen, so we foresee some overlap with soweego 2, and really look forward to mutual benefits.

Approval[edit]

Noting here that the grant extension request to November 4, 2022 has been approved. The new midpoint report is due by January 30, 2022 and new final report due date is December 4, 2022. -- JTud (WMF), Grants Administrator (talk) 22:26, 13 September 2021 (UTC)

Request #3[edit]

New end date[edit]

November 19, 2021

Rationale[edit]

Starting January 2022, the main grantee will join WMF as a full-time staff. This request supersedes the previous one.

Acknowledgement[edit]

Thank you the update, User:Hjfocs, and congratulations on your new role at the Foundation! I confirm that your project grant is officially 'closed' in our records. -- All the best, JTud (WMF), Grants Administrator (talk) 19:31, 10 December 2021 (UTC)