Grants talk:Project/DBpedia/GlobalFactSyncRE

For direct feedback or questions please contact us via gfs@infai.org

Archived Discussions

In order to keep the talk page clearly laid out we will archive older discussions here:

https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE/Archive_1

Interfacing with Wikidata's data quality issues in certain areas

Latest comment: 4 years ago12 comments4 people in discussion

@SebastianHellmann: Sorry if this is the wrong time or place to bring this up (possibly I'm slightly too late). However, I think it's appropriate to mention this somewhere, since it does affect a lot of Wikidata items as well as the work to be done as a result of the grant.

Many book, song and album statements on Wikidata are non-conforming – they do not match the currently preferred item models (see d:WD:BOOKS and d:WD:MUSIC), which are based on the FRBR model. For example, data that only applies to an edition of a book (e.g. number of pages, ISBN) may be misplaced and used instead on the item about the written work itself, particularly if an item for the specific edition does not exist yet.
In addition, items are inconsistently structured because only some items have been edited to match the models, and infoboxes from Wikipedia articles lump together data about entities which are treated as different in other databases (e.g. singles, compositions and recordings). Essentially, most of those hundreds of thousands of items represent the amalgamations shown in the infoboxes rather than the actual entities.
Other items do not have these issues, either because they do not represent creative works or different approaches have been taken in modelling those works (e.g. video games published on multiple operating systems are usually modeled using one item). However, there may be similar issues, such as some Wikipedias having an article for "radiology" and others having an article for "radiologist".
The current item structure may be detrimental to the work done as part of the grant, because the items are not consistent and e.g. certain Wikipedia articles for songs describe multiple recordings (and therefore have multiple infoboxes), so it would probably be in everyone's interest to clean up these items. If the references are added too early or are added incorrectly, it will take a little more effort to properly fix all of the statements.
Furthermore, data for these sorts of creative works (aside from review scores and genres, which might not even be noted in infoboxes to begin with) does rely heavily on primary sources such as streaming services, publishers and chart archives, or can be completely unverifiable because it's based on the article author's analysis of the work (even statements like "this song is written in English" can be absurdly difficult to actually verify). It's plausible that an import of references might not be sufficiently beneficial; in addition, because much of the data was imported from the English Wikipedia, which discourages the use of references in infoboxes, there might not be much relevant data to begin with. Importing that data directly from the primary sources might constitute copyright infringement or infringement of database rights.
Fixing items for both books and musical works would involve creating new items, transferring statements to the new items, and creating links between the items. Fixing items for musical works would additionally involve fixing links from other items and adding other items, as well as importing other statements (e.g. data from Wikipedia, MusicBrainz and other databases, mainly for song covers and soundtrack albums). MusicBrainz (which is CC0) could be used as a source for more structured musical work data, although it is by no means complete or infallible and may rely on Wikipedia's data in some cases.
It should be possible to automate much of this repair process (especially since most of the relevant statements were added through imports to begin with), but it's likely beyond my ability even if it's possible.

I realize that much of this is likely to be outside the scope of the grant, but it is a problem that will probably need to be addressed at some point, possibly by fixing the respective items beforehand or by leaving those items to be improved separately. Jc86035 (talk) 10:16, 14 April 2019 (UTC)Reply

@Jc86035: Hi, you are not too late, this is the perfect time as we are starting preparations for the project. In order to track progress, we are looking for 10 sync targets and your suggestions sound quite good. We should narrow them down a bit to make them more measurable, maybe focus on single classes or properties or a set of infoboxes across languages. This is in order to get a decent state in the beginning and then measure improvements on concrete problems.

regarding the scope: in principle what you are writing is in scope of the project. However, some of the problems, e.g. the granularity mismatch mentioned in some of your points are probably too hard to solve completely. So expectation-wise the project should deliver some clear improvements, i.e. helping to fix and sync data where the granularity of ids is the same and maybe categorizing/managing these that are vague between sources (this is basically what you suggested at the end).

so, yes, we can have a look at d:WD:BOOKS and d:WD:MUSIC as two of the sync targets. These are much harder than other other targets like monuments or persons due to their difficult individuation. Here, the goal would be to sync/disambiguate entities in addition to syncing references and properties.

I will go over your comments one more time to refactor action points out of it. Do you have some input on where to start cleaning like very concretely? SebastianHellmann (talk) 05:54, 29 April 2019 (UTC)Reply

@SebastianHellmann: I don't know where you would start. However, I'm most familiar with the situation for singles, as I haven't done a lot of work around books.

Probably the easiest singles to fix would be items like Boys Don't Cry (Q3020026), with a fairly limited set of statements and a MusicBrainz release group ID (although an item with a release group ID and nothing else might be easier). The release group indicates that you would need to import at least four and at most nine new items to model the releases (3), recordings (4) and compositions (2) (you could technically omit the recording and composition items for the B-side, but this would leave the release item incomplete; the other five come from adding all three releases instead of just one). The original item would become the composition, or the sitelinks would be moved to the appropriate composition item. You would also have to fix all of the statements which link to the original item; for example, all tracklist (P658) statements on albums would have to be fixed, and all of the chronologies would need to be repaired (I'm not sure if the approach at d:Q55977453#P179 is the best one but I've used that). Unfortunately, as I've noted previously, MusicBrainz is not perfect and errors and duplications (particularly for recordings) are not uncommon. (MusicBrainz might also have some database rights issues, but I digress.)

Furthermore, if the original item has a YouTube link (or similar), you would have to create an item for the music video and fix all of the respective statements; you might have to create items for songwriters, music video directors, other personnel, albums, album editions (to repair tracklist (P658) statements properly) and discographies without Wikidata items; and you might have to import and cross-check data from other databases (e.g. importing and cross-checking ISRCs from Spotify/IFPI/MusicBrainz). (In further imports and in repairs to existing data, you might also have to create items for time signatures, chord progressions and chords. And for other items, singles might have been re-released; singles with multiple performers could have been covered, resulting in a difficult-to-parse performer (P175)*; singles might not be linked to MusicBrainz; performer (P175) might be used to indicate personnel instead of main/featured artists; charted in (P2291) statements might be broken and it might not be clear whether they're for the single or the main recording; etc.)

Considering the complexity and how manually doing it takes about half an hour per single, it really depends on whether you think this is achievable (I suspect human input may be needed in some cases). I think it would be easier for books and albums, since usually there are only editions/releases and works/albums to deal with, and infoboxes may still amalgamate information from different editions/releases (e.g. number of pages, ISBN). However, for books you would need to deal with translations and Wikisource links, and for albums you would need to deal with track lists, genres, awards, personnel and re-releases (dealing with track lists could also necessitate repairing singles, or at least it would necessitate creating separate items for the tracks and fixing related statements).

If you think these are good objectives to pursue (particularly in the case of written works with multiple editions), I think it would be beneficial to ask other Wikidata contributors who manually fix data in these areas for their advice. Jc86035 (talk) 08:56, 29 April 2019 (UTC)Reply

* Regarding performer (P175) specifically, I think it would be useful at some point to create a new property to unambiguously indicate main and featured artists, instead of using a generic and confusing property. Unfortunately, when I tried to propose a property for this purpose, the discussion was (possibly accidentally) bludgeoned/derailed. I think it would still be beneficial to propose another property for this purpose, but I haven't done that yet. Jc86035 (talk) 08:56, 29 April 2019 (UTC)Reply

@Jc86035: Boys Don't Cry (Q3020026) seems like a good starting point. There is tons of conflicting information, i.e. publication date is either 15/3 or 16/3. Also there is Moulin Rouge and Wink as different performers. Duration is 8:14 (ja) and 8:18 (fr) and missing in Wikidata. We can already detect this partially with the prototype: https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F2nrbo&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FreleaseDate&src=general and https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F2nrbo&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Fruntime&src=general but this can be improved a lot. SebastianHellmann (talk) 13:22, 29 April 2019 (UTC)Reply

@SebastianHellmann: Oh, I didn't even notice that the Wikipedia article had an infobox for only the cover and not the original. That complicates things. A better example for a simple single item might be Deeper Shade of Blue (Q5250598) or Love Will Find a Way (Q6691501). (Most singles in Wikidata are not connected to MusicBrainz.)

For the publication date, I would expect internationally released singles to have multiple release dates depending on region up until the early 2010s. MusicBrainz's interface does usually attach a region to each release date (XW is used to indicate worldwide releases). However, for more recent releases, it could be appropriate to distinguish between "released worldwide at midnight in each country" and "released worldwide simultaneously" (I don't think this is done yet).

For the different performers, it would be necessary to create a different set of singles, releases, recordings and compositions for the original Moulin Rouge single (which doesn't have a MusicBrainz release group). (Otherwise, it would be difficult for the hypothetical data consumer to distinguish between a single released by both Wink and Moulin Rouge and two single released by each group on separate occasions; and it would be difficult for Wikidata editors to add information pertaining only to one of the singles.) In this case, the song was also translated when performed by Wink, so two different composition items would be needed.

Duration is a more complicated issue. Is the silence at the end of the song part of the track? Different releases (e.g. CDs vs. vinyl, iTunes vs. Spotify, FLAC vs. AAC) might even have different listed track lengths. I believe MusicBrainz deals with this by arbitrarily choosing one of them. Jc86035 (talk) 13:43, 29 April 2019 (UTC)Reply

For "Boys Don't Cry"'s release date specifically, the discrepancy seems to be because the Japanese Wikipedia article (15 March) was edited to match the English and French Wikipedia articles (16 March), and Wikidata was not updated at the same time. In the Japanese Wikipedia article, "15 March" was introduced in 2012 and replaced in 2017 by two different unregistered users. I've updated Wikidata so that all five sources are in agreement. Jc86035 (talk) 13:57, 29 April 2019 (UTC)Reply

See also d:Topic:Us7pj53kkf6vs36g#flow-post-us7s7qq2wmh9434g and d:Topic:Uxsiyqueknvmvfsm. An issue that will come up is that some properties do not (yet) have exact inverses, resulting in the use of generic properties to add inverse statements (e.g. published in (P1433)/part of (P361) to tracklist (P658)) or overly complicated qualifier usage (e.g. d:Q1886329#P175). I don't know how well DBpedia can represent this data (particularly statement is subject of (P805) and of (P642) qualifiers). A software change or the creation of new properties may improve the situation. Jc86035 (talk) 14:12, 29 April 2019 (UTC)Reply

""I've updated Wikidata so that all five sources are in agreement." Thanks for this. So, the prototype already caused one manual edit ;) now, we only need a faster way to (1) discover and fix, (2) a better way to track edits caused by GlobalfactSync (as we are fans of hard evaluation, but it is difficult) and (3) a million actual edits or so.

Regarding inverses, this is a technical triviality. P's are given and then they can also be shown inversely. The P's should be marked inverse or functional or inversefunctional. I guess this is not our concern so much as it is an implementation detail for Wikimedia staff and maybe consensus among community. The DBpedia ontology models the way infoboxes are structured and then adds inverse and other information to it. The actual problem which is relevant here is that there is a differing entity granularity between Wikipedia, Wikidata and Musicbrainz. We can address this issue. I think it is a good one and quite helpful

we recently wrote this paper: https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf . So we are technically able to merge MusicBrainz data into GlobalFactSync. We will do so, so you can also sync facts with MusicBrainz more easily. We could also merge your spotify ISRC's or further data you would like to compare. Personally, I find the Album/Singles/Bands/Artists much simpler than FRBR and books. While still being a magnitude more complex than monuments or persons, it seems to be a magnitude less complex like books/documents. SebastianHellmann (talk) 17:30, 1 May 2019 (UTC)Reply

Hi SebastianHellmann if you are interested in generating Wikidata edits and tracking them, I would suggest implementing a tool similar to HarvestTemplates which would extract references (as mentioned during the application phase). This would generate a lot of interest from the Wikidata community. Like HarvestTemplates you could set it up to work with EditGroups, so that your edits are tracked: https://tools.wmflabs.org/editgroups/?tool=harvesttemplates . If you focus on merging data outside the Wikimedia ecosystem, as you seem to have been doing so far, the risk is that the Wikimedia community does not feel like the project benefited them in a significant way. − Pintoch (talk) 14:25, 6 August 2019 (UTC)Reply

Hi Pintoch, thank you for your feedback! During our last team telco we were actually just discussing this topic of implementing a tool such as HarvestTemplates. Once we have a concrete strategy/a prototype/something to discuss we will ping you to get your thoughts on it if you don't mind. Tina Schmeissner (talk) 16:28, 6 August 2019 (UTC)Reply

@Tina Schmeissner: exciting! Yes do let me know if I can help in any way. − Pintoch (talk) 11:48, 7 August 2019 (UTC)Reply

So what's the plan for population numbers?

Latest comment: 4 years ago7 comments3 people in discussion

So what's the plan for population numbers? Jura1 (talk) 16:33, 25 August 2019 (UTC)Reply

@Jura1: Thanks to your previous feedback, we checked and progress has been made with develop a classification of how difficult certain properties are, which helps us with planning a lot, i.e. we probably need different approaches for different complexities, see the preliminary study for SyncTargets#problem:_four_layers_of_complexity. An easy properties for example is the height of NBA basketball players. 1. you can distinguish them by name in 99% of the cases. 2. They have exactly one value for "height", which does not change over time, 3. most of the values for them have been copied from two websites (main sources) 4. we can safely assume that this value is between 1.70m and 2.30m most of the time (so we can easier know that the value is inches or meters). Populationcount is more difficult: 1. we actually don't know the exact reference area for the count (inner/outer city+/-surrounding areas), this introduces some vagueness, like different populationcounts could both be correct. 2. the property has a time scope, so each value is only adequate for the point in time it was surveyed. The definition is also vague, i.e. people living in that area (incl. refugees) or real citizens. 3. no clue yet 4. we know it should be a positive number, but the size is guess work.

Here is how we will handle it:

* it is a more challenging property, but there are harder ones. So we will accept it as a sync target. There is a risk, that we will not perform so well on it. We might gain some insight on whether the interlanguage wikilinks link to the same areas though in other languages or maybe .

* success here greatly depends on three things:

1. we can not evaluate it, if we are unfocused. We need to do it for a fixed set of articles. This could be the 206 sovereign states and/or cities of 1 country. Otherwise, we can not measure completeness.

2. Whatever we choose, I think the only way to get some clarity here is to include one or several outside authoritative sources with a time scope. Cities and countries are easy to link by name only as well (few duplicates for notable cities within one country). We have a good start here as DBpedia is the best linked database in the world with links taken from Wikipedia/Wikidata and community contributions (algorithms and curated links) plus we can extract references and cite templates now. But if we have a specific, good source, e.g. an official, reliable, opendata country census, even better.

3. We need help with selecting the target for syncing. Wikipedia and Wikidata work differently with different people and processes, so we need to pick one and try to optimise usefulness for them. You also mentioned that some data is kept on Commons. That is also fine as long as we pick one. Preferably where people struggle and are aware of the problems and can describe them clearly. Like one goal could be to show updates whenever new census data is available.

Can you help us pick good options for these three points? If we succeed for a small target, I am sure we can generalise it across the WikiVerse SebastianHellmann (talk) 21:07, 25 August 2019 (UTC)Reply

@SebastianHellmann: sorry for not responding earlier. I noticed this just now.

I had a look at your presentation and liked the sample with elevators better. If it is something that is fairly stable, but with different references for different items, it's typically hard to gather otherwise and import from various Wikipedias is worthwhile. Coverage for height might be bad in Wikidata as I don't think HarvestTemplates would handle units. Obviously, if there is just one or two websites, a direct import might be a more effective approach.

As mentioned earlier, I didn't think population numbers were a particularly good usecase, but it seems that you moved away from it. Jura1 (talk) 12:29, 24 September 2019 (UTC)Reply

@Jura1: Currently, we are still studying properties of properties. So we didn't move away from population count, but we study how difficult it is in comparison to other properties. User:Lewoniewski and User:KrzysztofWecel found e.g. https://bdl.stat.gov.pl/api/v1/data/by-unit/023016264011?format=json&var-id=72305 which displays all census data for Poznan and is done by the "official" source. So in this case, it is possible to get up-to-date, complete and authoritative data for all Polish cities. Seems tempting to explore that. We will try to innovate here so that integration into Wikipedia (latest value) goes easier. Other than that we are looking at HarvestTemplates, but currently we are analysing frequency of all references sources: http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/stats/

Going from this we already found: https://ssd.jpl.nasa.gov/sbdb_query.cgi as a public domain database and top reference cited for planets almost a million times in many Wikipedias. and we also discovered that IMDB has very good JSON-LD for each movie and actor, see a code example in the study

In our opinion, if we get good info on 30 out of the 50 top sources, it might already cover a huge part, especially, if properties are easy. It would help cover the basic ground like all asteroids/planets, all movies, all Polish cities. It is not normative, but it would help to keep things updated with less effort. There could still be conflicting values worthy of discussion or having an alternative value. SebastianHellmann (talk) 06:06, 24 October 2019 (UTC)Reply

@SebastianHellmann: How will the sources be selected? As you may know, IMDb is not trusted as a source by many Wikipedias because it mainly comprises user-generated content; I suspect it may only seem "accurate" because a lot of the data has been copied directly from Wikipedia (thus, errors in Wikipedia could be replicated on IMDb). Census data obviously doesn't have this issue, but it may be necessary to ask Wikipedia editors to judge the data sources (even if only to prevent the data from being filled with more references that Wikipedias have to filter out to meet their own guidelines). w:en:WP:RS and w:en:WP:RSP may be helpful for this. Jc86035 (talk) 11:46, 4 November 2019 (UTC)Reply

@Jc86035: Adding sources to GFS is not harmful per see. In general, adding a source means there is an additional value in https://global.dbpedia.org/id/5BMjH with clear provenance. So as a first step, we all have better insight, where data comes from and can compare. We selected the sources non-systematically, but as guidance we did a quick domain count of all references, see here: http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/stats/ But you are totally right, there are many circular references, e.g. citypopulation.de is used in 71074 references, but they also say that they use WP/WD as source... If anything, we will make this more overt. So we can identify circular copies and also tackle whole properties, if the source is good.

NB: we also added some MusicBrainz properties and data (for you) and also the data from the German National Library and some other sources. It will be online in about a week and then we can analyse and also add more sources. SebastianHellmann (talk) 13:38, 4 November 2019 (UTC)Reply

@SebastianHellmann: Okay, that makes sense. Thanks for clarifying how the IMDb data will be used; I was assuming in asking the question that you would be adding some of the sources to the Wikidata statements. Jc86035 (talk) 14:07, 4 November 2019 (UTC)Reply