Jump to content

Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets

From Meta, a Wikimedia project coordination wiki

problem: four layers of complexity

[edit]
  1. identity check / check for ambiguity -- Are we talking about the same entity? There's some overall ambiguity, e.g., Hamburg = city in Germany, but also 24 cities/places with the name Hamburg located in the United States, 3 other places elsewhere in the world, 3 entries of Hamburg linked to animals and plants, 7 entries for ships, 1 for the Hamburg port, BUT they all have separate wiki entries.
  2. fixed vs. varying property -- Some properties vary depending on nationality (e.g., release dates for different countries), point in time (e.g., population count)
  3. reference -- Depending on the entity's identity check and the property's fixed or varying state the reference might vary. Also, for some targets no query-able online reference might be available.
  4. normalization / conversion of values -- Depending on language/nationality of the article some properties have varying units (e.g., currency, metric vs imperial system, ...).


Keep in mind:

  • Properties (targets) can be divided into object properties (e.g., employer, NBA player birthplace, name of company CEO) and data properties (population count, NBA player height, release year for album, single, or game)
  • Data properties belong to one of the xml data types and depending on whether e.g. a double precision or a float type is chosen the last digits of a number might differ
  • Data properties that were measured or calculated depend by default on the methodology used
  • Different references might use different methodologies (e.g., data acquisition for the US census changes with the years)
  • Dates and times depend on the location on the globe and its respective time zone (e.g. Tokio is 16hs ahead of LA), worst case szenario there could be a maximum difference of 1 day if time zone or location is not specified - best: use UTC (which provides some means of normalization)


Interest from the Wikiverse:

  • Wikidata users are concerned with 1 property across all articles
  • Wikipedia users are concerned with all properties of 1 article


Solution: start with easy sync targets

NBA Players

[edit]
  • ambiguity
    • clearly defined group of people
    • Currently no active players with same name
    • BUT there used to be e.g. 4 Charles Smiths (each has an individual English Wikipedia page, though)
  • property variability:
    • team and number can change due to trades and free agency
    • career highlights and awards can change, become more over time
    • Stats - change after every game and are thus only provided for retired players
    • trade deadline in February, new season starts in July, free agency starts in July)
    • after each NBA draft (around June) there will be new players
  • normalization issues:
    • height and weight - metric vs imperial system
    • choice of xml data type can have an impact on exact height or weight (dependency on reference and its chosen data type)


  • to check for completeness:
    • all active players (30 teams - 450)
    • all players of the eastern or western conference (225)
    • all players in one division (Atlantic,Central, Southwest, Northwest, Pacific, Southwest - each division has 5 teams - 75)
    • all players of one team (15)


  • GOOD EXAMPLE: Amino Al-Farouq - discrepancies in height, weight, place of birth
    • NBA.com: 6ft9, 2.06m / 220lbs, 99.8kg, no info on place of birth
    • basketball-reference.com: 6-9, 206cm / 220lbs, 99kg, Atlanta, Georgia
    • birthplace - en, fr, WD: Atlanta, Georgia
    • birthplace de, es: Stone Mountain, Georgia
language weight in FCF?
en 220lbs / 100kg yes
fr 220lbs / 100kg no
es 216lbs / 98kg no
it 100kg no
pt 220lbs / 98kg yes
pl 98kg no
WD 100kg yes
  • QUESTIONS: @Mrvnhfr: @JohannesFre:
    • Why is there no entry in the FCF for fr, es, pl, it?
    • Where are the entries for lbs?
    • Why are there two different predicates for weight?


Video Games

[edit]

E.g., DOOM3

  • ambiguity in the broader sense (DOOM (the franchise) vs DOOM (the 1993 video game) vs DOOM (the 2016 video game) vs DOOM (the 2005 movie)) - in Wikipedia/Wikidata entities can be singled out clearly, but needs to be checked for other sources
  • property variability:
    • different publishers for different regions and platforms
    • different composers for different platforms
    • different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU)
  • normalization: no issues
  • note: the infobox refers to the respective game, but oftentimes the article also includes the whole game series and its impact, often in the abstract already


Films

[edit]
  • ambiguity:
    • potential issues: remakes
  • property variability:
    • release date for individual countries
    • revenue depends on date
    • different dubbing actors for releases in different countries
  • normalization:
    • potential for mixup of currency for budget and revenue
    • Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) - e.g., $463.4 million vs $ 463.406.268
  • notes: infoboxes can be extensive, including cast, dubbing actors, individual release dates depending on country/language

IMDB

[edit]

IMDB publishes very good data in the HTML using JSON-LD, example:


More difficult / complex sync targets:

Music albums

[edit]

Music singles

[edit]

Cloud Types

[edit]

Cars

[edit]
  • e.g. Volkswagen Polo https://en.wikipedia.org/wiki/Volkswagen_Polo
  • ambiguity: the car has various models over the years, the wiki page "VW Polo" describes the car type in general and its various models, the infobox describes the car type VW Polo (not a certain model), some of the models also have individual wiki pages, it seems that there is no ambiguity
  • property variability: no issues
  • references:
  • normalization: metric vs imperial system for car dimensions and weight for some car models



Organizations / Companies

[edit]
  • e.g., BMW, IBM, Bank of America, Bayer
  • ambiguity: no issues, companies and e.g., their product are clearly separated
    • I beg to differ: company disambiguation can be a hard problem, depending on how much data is present and what are the source datasets. Specifically, official trade registers are often polluted by small/ inactive/ variant entities and it's not easy to distinguish the interesting/ large entities amongst the total.
    • Disambiguating between companies, divisions, products, and companies after significant events (merger/acquisition, renaming, reorganization) is also not so easy.
    • Eg many people assume a stock ticker is a fairly unique identifier. But it can be sold together with the exchange seat to another company... --Vladimir Alexiev (talk) 06:39, 22 August 2019 (UTC)
Hi @Vladimir Alexiev:, thank you for your input! We're currently looking deeper into individual sync targets and any advice or information is very much appreciated. Tina Schmeissner (talk) 08:37, 22 August 2019 (UTC)
  • property variability: production output, revenue and properties referring to monetary value depend on a point in time
  • references: no database to harvest information about organizations or companies
  • normalization: potential for mixup of currencies

Relevant: https://www.wikidata.org/wiki/Wikidata:WikiProject_Companies

cities - property: population count

[edit]
  • ambiguity
    • cities them self are easy to link by name only, only very few duplicates for notable cities within one country
    • for pop. count it is important to know the area (inner/outer city+/-surrounding areas) - different counts could be accurate depending on reference area
    • who is being counted (citizens, temporary workers, refugees)
  • property variability:
    • time dependence
    • dependent on measurement methodology (and thus on reference)
  • normalization issues:
    • population density #/km2 or #/sq_mi
  • notes:
    • to check for completeness:
    • e.g. Grimma:
      • property: population - in the German Wikipedia https://de.wikipedia.org/wiki/Grimma data for this property is derived from a PDF document storing the key for this municipality (the key is given in the infobox template), therefore we are not able to derive any information from the German Wiki page for this property


Employer

[edit]
  • WD description: person or organization for which the subject works or worked
  • ambiguity:
    • no...?
  • property variability:
    • interlinked with employment period
    • previous employments should be 'fix'
    • current employment: start date - present
    • ideally functional property with respect to time, BUT there are people that work two jobs and therefore have 2 employers at the same time!
  • reference:
    • LinkedIn - Structured Data Testing Tool shows multiple employers (e.g., Sebastian works for DBpedia Association, AKSW/Kilt and University of Leipzig)
  • normalization issues:
    • no units, no currency
    • do names of employers vary depending on language or country?
    • intentional vs extensional context, especially with subsidiary companies (e.g., "Sebastian works for InfAI" vs "Sebastian works for DBpedia" extensional context where the DBpedia Association is affiliated with InfAI)


Geo Coordinates

[edit]
  • Ambiguity: no
  • Property variability:
  • Reference: GeoNames - All lat/long coordinates in WGS84 (World Geodetic System 1984)
  • Normalization:
    • Choice of precision (number of digits)
    • Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) 12.37475 vs 12.3748 vs 12.374722222222223

Nobel Price

[edit]
  • ambiguity:
  • property variability:
  • reference:
  • normalization issues:


Difficult sync targets (due to ambiguity issues)

[edit]
  • So far I could not find a target group that would present major issues, especially with respect to ambiguity.
  • If you move away from these concrete and materialistic targets to more theoretical topics such as mathematics (or mathematical equations), linguistics, or natural sciences as a target, which might have a bigger potential for ambiguity, you find that the wiki pages don't have infoboxes anymore. (Some have nav-boxes or sidebars for orientation/overview.)


What we've learned so far

[edit]
  • there are wiki pages that have multiple infoboxes for some languages (especially for cars)
  • reference extraction can extract information from pages with multiple infoboxes
  • fact extraction can also extract information from pages with multiple infoboxes, but during the fusion this will all be handled as 1 infobox which will cause problems (e.g., two different release dates and fusion might pick the older one)
  • there are instances when information is not stored directly in the infobox but via some kind of sub-routine, which renders the information non-extractable for us (see Grimma exapmle)


Tina Schmeissner (talk) 10:47, 12 July 2019 (UTC)