Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets

problem: four layers of complexity[edit]

identity check / check for ambiguity -- Are we talking about the same entity? There's some overall ambiguity, e.g., Hamburg = city in Germany, but also 24 cities/places with the name Hamburg located in the United States, 3 other places elsewhere in the world, 3 entries of Hamburg linked to animals and plants, 7 entries for ships, 1 for the Hamburg port, BUT they all have separate wiki entries.
fixed vs. varying property -- Some properties vary depending on nationality (e.g., release dates for different countries), point in time (e.g., population count)
reference -- Depending on the entity's identity check and the property's fixed or varying state the reference might vary. Also, for some targets no query-able online reference might be available.
normalization / conversion of values -- Depending on language/nationality of the article some properties have varying units (e.g., currency, metric vs imperial system, ...).

Keep in mind:

Properties (targets) can be divided into object properties (e.g., employer, NBA player birthplace, name of company CEO) and data properties (population count, NBA player height, release year for album, single, or game)
Data properties belong to one of the xml data types and depending on whether e.g. a double precision or a float type is chosen the last digits of a number might differ
Data properties that were measured or calculated depend by default on the methodology used
Different references might use different methodologies (e.g., data acquisition for the US census changes with the years)
Dates and times depend on the location on the globe and its respective time zone (e.g. Tokio is 16hs ahead of LA), worst case szenario there could be a maximum difference of 1 day if time zone or location is not specified - best: use UTC (which provides some means of normalization)

Interest from the Wikiverse:

Wikidata users are concerned with 1 property across all articles
Wikipedia users are concerned with all properties of 1 article

Solution: start with easy sync targets

NBA Players[edit]

ambiguity
- clearly defined group of people
- Currently no active players with same name
- BUT there used to be e.g. 4 Charles Smiths (each has an individual English Wikipedia page, though)

property variability:
- team and number can change due to trades and free agency
- career highlights and awards can change, become more over time
- Stats - change after every game and are thus only provided for retired players
- trade deadline in February, new season starts in July, free agency starts in July)
- after each NBA draft (around June) there will be new players

reference:
- https://www.nba.com/players/ - only active players, though
  - using google structured data tool: no data available/detected
- https://www.basketball-reference.com/ - Eng. wiki uses it for stats
  - e.g., https://www.basketball-reference.com/players/m/millspa02.html
  - using google structured data tool: https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fwww.basketball-reference.com%2Fplayers%2Fm%2Fmillspa02.html
- https://en.hispanosnba.com/
  - according to google structured data tool you can extract data
  - you can see all active players

normalization issues:
- height and weight - metric vs imperial system
- choice of xml data type can have an impact on exact height or weight (dependency on reference and its chosen data type)

to check for completeness:
- all active players (30 teams - 450)
- all players of the eastern or western conference (225)
- all players in one division (Atlantic,Central, Southwest, Northwest, Pacific, Southwest - each division has 5 teams - 75)
- all players of one team (15)

wiki-project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association

GOOD EXAMPLE: Amino Al-Farouq - discrepancies in height, weight, place of birth
- NBA.com: 6ft9, 2.06m / 220lbs, 99.8kg, no info on place of birth
- basketball-reference.com: 6-9, 206cm / 220lbs, 99kg, Atlanta, Georgia
- birthplace - en, fr, WD: Atlanta, Georgia
- birthplace de, es: Stone Mountain, Georgia

language	weight	in FCF?
en	220lbs / 100kg	yes
fr	220lbs / 100kg	no
es	216lbs / 98kg	no
it	100kg	no
pt	220lbs / 98kg	yes
pl	98kg	no
WD	100kg	yes

QUESTIONS: @Mrvnhfr: @JohannesFre:
- Why is there no entry in the FCF for fr, es, pl, it?
- Where are the entries for lbs?
- Why are there two different predicates for weight?

Video Games[edit]

E.g., DOOM3

ambiguity in the broader sense (DOOM (the franchise) vs DOOM (the 1993 video game) vs DOOM (the 2016 video game) vs DOOM (the 2005 movie)) - in Wikipedia/Wikidata entities can be singled out clearly, but needs to be checked for other sources

property variability:
- different publishers for different regions and platforms
- different composers for different platforms
- different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU)

reference: multitude of different sources
- https://www.gameinformer.com/ - but no relevant info extractable via GSDT
- https://www.gamereactor.de/ - but no relevant info extractable via GSDT

normalization: no issues

note: the infobox refers to the respective game, but oftentimes the article also includes the whole game series and its impact, often in the abstract already

Films[edit]

ambiguity:
- potential issues: remakes

property variability:
- release date for individual countries
- revenue depends on date
- different dubbing actors for releases in different countries

reference:
- IMDb
- https://www.boxofficemojo.com/ has revenue, but according to GSDT no data available

normalization:
- potential for mixup of currency for budget and revenue
- Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) - e.g., $463.4 million vs $ 463.406.268

notes: infoboxes can be extensive, including cast, dubbing actors, individual release dates depending on country/language

IMDB[edit]

IMDB publishes very good data in the HTML using JSON-LD, example:

Rambo Last Blood, see IMDB data

<script type="application/ld+json">{


 "@context": "http://schema.org",
 "@type": "Movie",
 "url": "/title/tt1206885/",
 "name": "Rambo: Last Blood",
 "image": "https://m.media-amazon.com/images/M/MV5BNTAxZWM2OTgtOTQzOC00ZTI5LTgyYjktZTRhYWM4YWQxNWI0XkEyXkFqcGdeQXVyMjMwNDgzNjc@._V1_.jpg",
 "genre": [
   "Action",
   "Adventure",
   "Thriller"
 ],
 "contentRating": "R",
 "actor": [
   {
     "@type": "Person",
     "url": "/name/nm0000230/",
     "name": "Sylvester Stallone"
   },
   {
     "@type": "Person",
     "url": "/name/nm0891895/",
     "name": "Paz Vega"
   },
   {
     "@type": "Person",
     "url": "/name/nm0673824/",
     "name": "Sergio Peris-Mencheta"
   },
   {
     "@type": "Person",
     "url": "/name/nm0056770/",
     "name": "Adriana Barraza"
   }
 ],
 "director": {
   "@type": "Person",
   "url": "/name/nm0344496/",
   "name": "Adrian Grunberg"
 },
 "creator": [
   {
     "@type": "Person",
     "url": "/name/nm1006730/",
     "name": "Matthew Cirulnick"
   },
   {
     "@type": "Person",
     "url": "/name/nm0000230/",
     "name": "Sylvester Stallone"
   },
   {
     "@type": "Person",
     "url": "/name/nm0330108/",
     "name": "Dan Gordon"
   },
   {
     "@type": "Person",
     "url": "/name/nm0000230/",
     "name": "Sylvester Stallone"
   },
   {
     "@type": "Person",
     "url": "/name/nm0606251/",
     "name": "David Morrell"
   },
   {
     "@type": "Organization",
     "url": "/company/co0173285/"
   },
   {
     "@type": "Organization",
     "url": "/company/co0002572/"
   },
   {
     "@type": "Organization",
     "url": "/company/co0360646/"
   },
   {
     "@type": "Organization",
     "url": "/company/co0266645/"
   },
   {
     "@type": "Organization",
     "url": "/company/co0053925/"
   },
   {
     "@type": "Organization",
     "url": "/company/co0758254/"
   }
 ],
 "description": "Rambo: Last Blood is a movie starring Sylvester Stallone, Paz Vega, and Sergio Peris-Mencheta. Rambo must confront his past and unearth his ruthless combat skills to exact revenge in a final mission.",
 "datePublished": "2019-09-18",
 "keywords": "fifth part,sequel,rambo,cartel,rambo character",
 "aggregateRating": {
   "@type": "AggregateRating",
   "ratingCount": 22686,
   "bestRating": "10.0",
   "worstRating": "1.0",
   "ratingValue": "6.7"
 },
 "review": {
   "@type": "Review",
   "itemReviewed": {
     "@type": "CreativeWork",
     "url": "/title/tt1206885/"
   },
   "author": {
     "@type": "Person",
     "name": "UteProud"
   },
   "dateCreated": "2019-10-09",
   "inLanguage": "English",
   "name": "Home Alone on Steroids",
   "reviewBody": "The first thing I thought while watching this was \u0027Home Alone on steroids\u0027. Then I saw that IMDb has a video of an interview with Stallone posted that\u0027s asking him about if he got inspired from Home Alone. Lol. So it wasn\u0027t just me. The movie seemed a little short to me. Rambo is brutal and there were definitely some creative deaths. Pretty graphic too, which is awesome. If you like the Rambo movies, you\u0027ll like this one. The acting was ok. Not spectacular or anything. I feel like I would\u0027ve rated this higher if he upped the body count for about another 30 mins. Or maybe I\u0027m too used to the John Wick films...",
   "reviewRating": {
     "@type": "Rating",
     "worstRating": "1",
     "bestRating": "10",
     "ratingValue": "7"
   }
 },
 "duration": "PT1H29M",
 "trailer": {
   "@type": "VideoObject",
   "name": "Teaser Trailer #2",
   "embedUrl": "/video/imdb/vi2974596121",
   "thumbnail": {
     "@type": "ImageObject",
     "contentUrl": "https://m.media-amazon.com/images/M/MV5BY2Y0MThhOGMtNGNlNC00NTlhLTkyMWMtMzNhOWU0NjViMTVmXkEyXkFqcGdeQWFybm8@._V1_.jpg"
   },
   "thumbnailUrl": "https://m.media-amazon.com/images/M/MV5BY2Y0MThhOGMtNGNlNC00NTlhLTkyMWMtMzNhOWU0NjViMTVmXkEyXkFqcGdeQWFybm8@._V1_.jpg",
   "description": "John Rambo must confront his past and unearth his ruthless combat skills to exact revenge in a final mission.",
   "uploadDate": "2019-08-20T16:54:21Z"
 }

}</script>

More difficult / complex sync targets:

Music albums[edit]

ambiguity:
- "High Voltage" by AC/DC - 1975 (Australia) vs 1976 (international)
  - German wiki: only 1 page for both albums (with two infoboxes) https://de.wikipedia.org/wiki/High_Voltage - links to 1976 albun in Eng and French
  - English wiki: https://en.wikipedia.org/wiki/High_Voltage_(1975_album) & https://en.wikipedia.org/wiki/High_Voltage_(1976_album)
  - French wiki: https://fr.wikipedia.org/wiki/High_Voltage_(album_australien) & https://fr.wikipedia.org/wiki/High_Voltage
  - also: https://en.wikipedia.org/wiki/High_Voltage_(song)
property variability:
- release date might vary depending on country [[1]]
references: https://musicbrainz.org/
normalization: no issues

Music singles[edit]

ambiguity: singles with same name but different artists
- "Stairway to Heaven" https://en.wikipedia.org/wiki/Stairway_to_Heaven_(disambiguation), but clearly separated
- "God is a DJ" by Faithless and by Pink, but clearly separated
- ambiguity within the pages, which refer to the song, but the infobox describes the single, and within the abstract "song" and "single" are mentioned interchangeably
- BUT: "Every Single is a Song but not every Song is a Single." - https://www.quora.com/Whats-the-difference-between-a-song-and-a-single
- "Boys don't cry" - song by multiple artists - https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song) is page for the song by Moulin Rouge but with infobox from song by Wink, and it's linked to other language wikis also referring to the Wink song
property variability:
- release date might vary depending on country [[2]]
  - e.g., "The real Slim Shady" from "The Marshall Maters LP": April 15, 2000 (English wiki) vs May 16, 2000 (German wiki), but no reference to a country
- music videos for different countries (e.g., UK vs US version, https://en.wikipedia.org/wiki/Want_U_Back)
references: MusicBrainz
normalization: no issues
notes:
- wiki-project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Songs

Cloud Types[edit]

ambiguity: clearly defined topic, no ambiguities
property variability: no issue
reference: no easily accessible data formats
- international cloud atlas https://cloudatlas.wmo.int/explanatory-remarks-and-special-clouds-cumulus.html
- https://www.clouds-online.com/
- https://library.wmo.int/pmb_ged/wmo_407_en-v1.pdf
normalization: metric vs imperial system
notes:
- differentiation between genera, species, varieties, supplementary features and accessory clouds - once you go further than the 10 genera it becomes quite complex
- most languages don't have an infobox (e.g. German), French: infobox is editable using visual editor, code, or wikidata
- how do we compare the cloud symbols?
- wiki-project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Meteorology
- using GFS Data Browser it seems that mappings need to be improved (only "type", "label", and "see also" are available)

Cars[edit]

e.g. Volkswagen Polo https://en.wikipedia.org/wiki/Volkswagen_Polo
ambiguity: the car has various models over the years, the wiki page "VW Polo" describes the car type in general and its various models, the infobox describes the car type VW Polo (not a certain model), some of the models also have individual wiki pages, it seems that there is no ambiguity
property variability: no issues
references:
- https://www.autoevolution.com
- https://www.cars-data.com/en/
normalization: metric vs imperial system for car dimensions and weight for some car models

e.g. BMW M5:
ambiguity:
- the German wiki page https://de.wikipedia.org/wiki/BMW_M5 has multiple infoboxes for the general car model and for its various generations, all in one page
- same with the English page - How does the extraction work if there are multiple infoboxes?
property variability: no issues
references:
- https://www.autoevolution.com
- https://www.cars-data.com/en/
normalization: metric vs imperial system for car dimensions and weight for some car models

Organizations / Companies[edit]

e.g., BMW, IBM, Bank of America, Bayer
ambiguity: no issues, companies and e.g., their product are clearly separated
- I beg to differ: company disambiguation can be a hard problem, depending on how much data is present and what are the source datasets. Specifically, official trade registers are often polluted by small/ inactive/ variant entities and it's not easy to distinguish the interesting/ large entities amongst the total.
- Disambiguating between companies, divisions, products, and companies after significant events (merger/acquisition, renaming, reorganization) is also not so easy.
- Eg many people assume a stock ticker is a fairly unique identifier. But it can be sold together with the exchange seat to another company... --Vladimir Alexiev (talk) 06:39, 22 August 2019 (UTC)

Hi @Vladimir Alexiev:, thank you for your input! We're currently looking deeper into individual sync targets and any advice or information is very much appreciated. Tina Schmeissner (talk) 08:37, 22 August 2019 (UTC)

property variability: production output, revenue and properties referring to monetary value depend on a point in time
references: no database to harvest information about organizations or companies
normalization: potential for mixup of currencies

Relevant: https://www.wikidata.org/wiki/Wikidata:WikiProject_Companies

cities - property: population count[edit]

ambiguity
- cities them self are easy to link by name only, only very few duplicates for notable cities within one country
- for pop. count it is important to know the area (inner/outer city+/-surrounding areas) - different counts could be accurate depending on reference area
- who is being counted (citizens, temporary workers, refugees)

property variability:
- time dependence
- dependent on measurement methodology (and thus on reference)

reference:
- United States Census - according to Google structured data tool data can be extracted, e.g. for Baltimore city

normalization issues:
- population density #/km² or #/sq_mi

notes:
- to check for completeness:
  - 206 sovereign states
  - cities of 1 country
- e.g. Grimma:
  - property: population - in the German Wikipedia https://de.wikipedia.org/wiki/Grimma data for this property is derived from a PDF document storing the key for this municipality (the key is given in the infobox template), therefore we are not able to derive any information from the German Wiki page for this property

Employer[edit]

WD description: person or organization for which the subject works or worked
ambiguity:
- no...?

property variability:
- interlinked with employment period
- previous employments should be 'fix'
- current employment: start date - present
- ideally functional property with respect to time, BUT there are people that work two jobs and therefore have 2 employers at the same time!

reference:
- LinkedIn - Structured Data Testing Tool shows multiple employers (e.g., Sebastian works for DBpedia Association, AKSW/Kilt and University of Leipzig)

normalization issues:
- no units, no currency
- do names of employers vary depending on language or country?
- intentional vs extensional context, especially with subsidiary companies (e.g., "Sebastian works for InfAI" vs "Sebastian works for DBpedia" extensional context where the DBpedia Association is affiliated with InfAI)

Geo Coordinates[edit]

Ambiguity: no
Property variability:
- pos#lat for Leipzig
- pos#long for Leipzig
- Accuracy of GPS: ~10m (4 decimal places)
- Precision = the number of decimal places
- Accuracy = conformance with reality
- GFS-enabled smartphones are accurate to within a ~5m radius (under ideal conditions) https://www.gps.gov/systems/gps/performance/accuracy/
Reference: GeoNames - All lat/long coordinates in WGS84 (World Geodetic System 1984)
Normalization:
- Choice of precision (number of digits)
- Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) 12.37475 vs 12.3748 vs 12.374722222222223

Nobel Price[edit]

ambiguity:
property variability:
reference:
normalization issues:

Difficult sync targets (due to ambiguity issues)[edit]

So far I could not find a target group that would present major issues, especially with respect to ambiguity.
If you move away from these concrete and materialistic targets to more theoretical topics such as mathematics (or mathematical equations), linguistics, or natural sciences as a target, which might have a bigger potential for ambiguity, you find that the wiki pages don't have infoboxes anymore. (Some have nav-boxes or sidebars for orientation/overview.)

What we've learned so far[edit]

there are wiki pages that have multiple infoboxes for some languages (especially for cars)
reference extraction can extract information from pages with multiple infoboxes
fact extraction can also extract information from pages with multiple infoboxes, but during the fusion this will all be handled as 1 infobox which will cause problems (e.g., two different release dates and fusion might pick the older one)
there are instances when information is not stored directly in the infobox but via some kind of sub-routine, which renders the information non-extractable for us (see Grimma exapmle)

Tina Schmeissner (talk) 10:47, 12 July 2019 (UTC)