Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

problem: four layers of complexity[edit]

  1. identity check / check for ambiguity -- Are we talking about the same entity? There's some overall ambiguity, e.g., Hamburg = city in Germany, but also 24 cities/places with the name Hamburg located in the United States, 3 other places elsewhere in the world, 3 entries of Hamburg linked to animals and plants, 7 entries for ships, 1 for the Hamburg port, BUT they all have separate wiki entries.
  2. fixed vs. varying property -- Some properties vary depending on nationality (e.g., release dates for different countries), point in time (e.g., population count)
  3. reference -- Depending on the entity's identity check and the property's fixed or varying state the reference might vary. Also, for some targets no query-able online reference might be available.
  4. normalization / conversion of values -- Depending on language/nationality of the article some properties have varying units (e.g., currency, metric vs imperial system, ...).

Solution: start with easy sync targets


NBA Players[edit]

  • ambiguity
    • clearly defined group of people
    • there are no two active players currently with the same name
    • so far there used to be e.g. 4 Charles Smiths (each has an individual English Wikipedia page, though)
  • property variability: no
    • but for active players: accuracy of stats depends on when data was least updated
    • also, player trades > franchise might change (trade deadline in February, new season starts in July, free agency starts in July)
    • after each NBA draft (around June) there will be new players
  • normalization issues: height and weight - metric vs imperial system
  • notes:
    • to check for completeness:
      • all active players (30 teams - 450)
      • all players of the eastern or western conference (225)
      • all players in one division (Atlantic,Central, Southwest, Northwest, Pacific, Southwest - each division has 5 teams - 75)
      • all players of one team (15)
    • best infoboxes: English and French wiki
    • wiki-project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association
    • GOOD EXAMPLE: Amino Al-Farouq - discrepancies in height, weight, place of birth
      • NBA.com: 6ft9, 2.06m / 220lbs, 99.8kg, no info on place of birth
      • basketball-reference.com: 6-9, 206cm / 220lbs, 99kg, Atlanta, Georgia
      • birthplace - en, fr, WD: Atlanta, Georgia
      • birthplace de, es: Stone Mountain, Georgia
language weight in FCF?
en 220lbs / 100kg yes
fr 220lbs / 100kg no
es 216lbs / 98kg no
it 100kg no
pt 220lbs / 98kg yes
pl 98kg no
WD 100kg yes
  • QUESTIONS: @Mrvnhfr: @JohannesFre:
    • Why is there no entry in the FCF for fr, es, pl, it?
    • Where are the entries for lbs?
    • Why are there two different predicates for weight?

Video Games[edit]

E.g., DOOM3

  • ambiguity in the broader sense (DOOM (the franchise) vs DOOM (the 1993 video game) vs DOOM (the 2016 video game) vs DOOM (the 2005 movie)), but entities can be singled out clearly
  • property variability: different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU)
  • reference: multitude of different sources
  • normalization: no issues
  • note: the infobox refers to the respective game, but oftentimes the article also includes the whole game series and its impact, often in the abstract already


Films[edit]

  • ambiguity:
    • potential issues: remakes - but seems to be no actual issue
  • property variability: release date for individual countries, revenue depends on date
  • reference:
  • normalization: potential for mixup of currency for budget and revenue
  • notes: infoboxes can be extensive, including cast, dubbing actors, individual release dates


More difficult / complex sync targets:

Music albums[edit]

Music singles[edit]


Cloud Types[edit]


Cars[edit]

  • e.g. Volkswagen Polo https://en.wikipedia.org/wiki/Volkswagen_Polo
  • ambiguity: the car has various models over the years, the wiki page "VW Polo" describes the car type in general and its various models, the infobox describes the car type VW Polo (not a certain model), some of the models also have individual wiki pages, it seems that there is no ambiguity
  • property variability: no issues
  • references:
  • normalization: metric vs imperial system for car dimensions and weight for some car models



Organizations / Companies[edit]

  • e.g., BMW, IBM, Bank of America, Bayer
  • ambiguity: no issues, companies and e.g., their product are clearly separated
    • I beg to differ: company disambiguation can be a hard problem, depending on how much data is present and what are the source datasets. Specifically, official trade registers are often polluted by small/ inactive/ variant entities and it's not easy to distinguish the interesting/ large entities amongst the total.
    • Disambiguating between companies, divisions, products, and companies after significant events (merger/acquisition, renaming, reorganization) is also not so easy.
    • Eg many people assume a stock ticker is a fairly unique identifier. But it can be sold together with the exchange seat to another company... --Vladimir Alexiev (talk) 06:39, 22 August 2019 (UTC)
Hi @Vladimir Alexiev:, thank you for your input! We're currently looking deeper into individual sync targets and any advice or information is very much appreciated. Tina Schmeissner (talk) 08:37, 22 August 2019 (UTC)
  • property variability: production output, revenue and properties referring to monetary value depend on a point in time
  • references: no database to harvest information about organizations or companies
  • normalization: potential for mixup of currencies

Relevant: https://www.wikidata.org/wiki/Wikidata:WikiProject_Companies

cities - property: population count[edit]

  • ambiguity
    • cities them self are easy to link by name only, only very few duplicates for notable cities within one country
    • for pop. count it is important to know the area (inner/outer city+/-surrounding areas) - different counts could be accurate depending on reference area
    • who is being counted (citizens, temporary workers, refugees)
  • property variability:
    • time dependence
  • normalization issues:
    • population density #/km2 or #/sq_mi
  • notes:
    • to check for completeness:
    • e.g. Grimma:
      • property: population - in the German Wikipedia https://de.wikipedia.org/wiki/Grimma data for this property is derived from a PDF document storing the key for this municipality (the key is given in the infobox template), therefore we are not able to derive any information from the German Wiki page for this property

Difficult sync targets (due to ambiguity issues)[edit]

  • So far I could not find a target group that would present major issues, especially with respect to ambiguity.
  • If you move away from these concrete and materialistic targets to more theoretical topics such as mathematics (or mathematical equations), linguistics, or natural sciences as a target, which might have a bigger potential for ambiguity, you find that the wiki pages don't have infoboxes anymore. (Some have nav-boxes or sidebars for orientation/overview.)


What we've learned so far[edit]

  • there are wiki pages that have multiple infoboxes for some languages (especially for cars)
  • reference extraction can extract information from pages with multiple infoboxes
  • fact extraction can also extract information from pages with multiple infoboxes, but during the fusion this will all be handled as 1 infobox which will cause problems (e.g., two different release dates and fusion might pick the older one)
  • there are instances when information is not stored directly in the infobox but via some kind of sub-routine, which renders the information non-extractable for us (see Grimma exapmle)


Tina Schmeissner (talk) 10:47, 12 July 2019 (UTC)