Grants:Project/DBpedia/GlobalFactSyncRE/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2018-19 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/DBpedia/GlobalFactSyncRE.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learning from the grantee's first 6 months (June-November 2019).

Summary

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

During the first two months we published a kick-off note with our initial findings. We had our first successful edit based on our early prototype.
After the publication of the kick-off note, we went into three directions:
- collecting more feedback
- extracting references from Wikipedia infoboxes and developing microservices as a proof of concept as well as the prototype for the GlobalFactSync Data Browser.
- studying in detail the problems resulting in a work in progress Sync Target Study
Sync targets
- As a result of the feedback and our analysis, we will focus now on bringing in appropriate external data and we will improve the speed of integrating such data.
- While we were successful in showing that we can provide good information between infoboxes of separate languages that help editors compare and improve, we also discovered, that we can be more successful by bringing in the source data directly, where available.
- Based on the study and also analysis of the stats for references used in Wikipedia and Wikidata, we will explore the following sync targets to bring additional, complete and reliable information into GlobalFactSync:
  - Musicbrainz.org for Albums, Singles
  - Open Data from the German, Dutch, French, Swiss national libraries, especially for person data
  - Geonames.org for geographical data, maybe CIA World Fact Book, which was also used as reference
  - Polish and Brazilian data from the national statistical offices
  - A complete list of Power plants

Our goals for the second half are:

for these sync targets, we will also focus on studying the effectiveness of the "improve one Wikipedia/Wikidata with another" approach
improve the speed to integrate this external data to scale to more sync targets.
compare the effectiveness between the Wikiverse internal syncing, versus the effectiveness of including external sources.
For this midpoint, we had some engineering troubles and were not able to iterate the GlobalFactSync browser in time. A new prototype will be published in January.
January will also be the beginning of the phase, where we start focused interface development with goal to extend the Wikipedia User Script, the GFS Data Browser Website, Harvest Template and investigate the plugin options in WikiBase.

Additional results:

We held team meetings at a regular basis
We participated in 3 events (WikidataCon, Wikimania, ISWC)
We had 8 volunteers sign up to support our project.

Methods and activities

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

With respect to the Project Plan on our proposal page we did experience some deviations. Overall the study part (A1) took longer than we anticipated. We do have as set of good sync targets, but are not at our target number of 10 yet (A4). We did, however, focus a lot on the integration of third-party data and references (A6), and have a solid integration method in place. Regarding the mappings (A2, A5) we decided to make that a major focus point for the second half of the project.

Over the course of the last 6 months we...

had team telcos on a regular basis (usually every 2 or 3 weeks) to discuss the current state of the project and the next steps.
worked on our first micro-services
attended the following events:
- presentation of the project idea to the participants of the 13th DBpedia Community Meeting to bring awareness to the project
- Johannes Frey presented the GFS project at Wikimania
- Marvin Hofer was at WikidataCon 2019 with a Global Fact Sync Poster
were in contact with other Wikimedia users on our talk page to discuss the project in general, and individual sync targets in particular
created a news page within our Meta-Wiki project page framework for volunteers to keep them in the loop and encourage exchange. This has lead to more volunteers signing up for our 'GFS Feedback Squad' and users leaving feedback about our sync target study.
created an overview over the most frequent references used in Wikipedia infoboxes and Wikidata, and filtered out the ones with easily downloadable and integrateable data
found 6 good sync targets (peoples birthdates - German National Library, music albums - MusicBrainz, geo coordinates - Geonames, polish cities - polish census data, power plants - global power plant database)

Midpoint outcomes

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

PreFusion dataset:

We use data from Wikipedia infoboxes of different languages, Wikidata, DBpedia, as well as third-party data and fuse them to receive one big, consolidated dataset – a PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper.

First microservices:

We deployed a set of microservices to show the current state of our toolchain.

[Initial User Interface] The GFS Data Browser is our GlobalFactSync UI prototype (available at http://global.dbpedia.org) which shows all extracted information available for one entity for different sources. It can be used to analyze the factual consensus between different Wikipedia articles for the same thing. Example: Look at the variety of population counts for Grimma. Currently this is still a prototype, as we focused mainly on the data behind it. Within the next months we will work on further improving the GFS data browser.

Extension to GFS Data Browser with quality and popularity measures from WikiRank. Example: http://dbpedia.informatik.uni-leipzig.de:9005/infobox?s=https://en.wikipedia.org/wiki/Warsaw

[PreFusion JSON API] While the UI allows simple, fast and easy browsing for one entity at a time, we also provide raw access to the underlying data (PreFusion dump). The query UI (http://global.dbpedia.org:8990 (user: read, pw: gfs) can be utilized to run simple analytical queries. Thus, we can determine the number of locations having at least one population value (1,194,007) but can also focus on examples with data quality problems (e.g. one of the 4,268 locations with more than 10 population values). Moreover, documentation about the PreFusion dataset and the download link for the data are available on the Databus website.

[Reference Data Download] We ran the Reference Extraction Service over 10 Wikipedia languages. Download dumps here.
[Reference Stat Analysis] We aggregated the domains of all the reference links found and counted them. The table below show the Top 20. Download more detailed statistics here

Domain	Count
ssd.jpl.nasa.gov	745769
minorplanetcenter.net	435616
web.archive.org	316316
citypopulation.de	71074
spider.seds.org	52800
webcitation.org	51222
allmusic.com	50178
vizier.u-strasbg.fr	47856
census.gov	37801
ned.ipac.caltech.edu	33379
geonames.usgs.gov	32679
iucnredlist.org	28704
books.google.com	26976
poczta-polska.pl	24597
simbad.u-strasbg.fr	23316
boxofficemojo.com	19230
heasarc.nasa.gov	17549
factfinder.census.gov	16385
stat.gov.pl	15961
nedipac.caltech.edu	15834

[Reference Extraction Service] Good references are crucial for an import of facts from Wikipedia to Wikidata. We are currently working with colleagues from Poznań University of Economics and Business on reference extraction for facts from Wikipedia. A current development reference extraction microservice shows all references and the location where they were spotted in the Infobox – ad hoc – for a given article: http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json ( ‘&format=tsv’ also available)

[Infobox Extraction Service] A similar ad hoc extraction of factual information from infoboxes and other Wikipedia article information is available here. This microservice displays information which can be extracted with the help of DBpedia mappings from an infobox e.g. from the German Facebook Wikipedia article: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Facebook&revid=&format=trix&extractors=mappings. See here for more options: http://dbpedia.informatik.uni-leipzig.de:9999/server/extraction/.

[ID service] Last but not least, we offer the Global ID Resolution Service. It ties together all available identifiers for one thing (i.e. at the moment all DBpedia/Wikipedia and Wikidata identifiers – MusicBrainz coming soon…) and shows their stable DBpedia Global ID.

Sync targets:

We changed our definition of sync targets from coming from inside Wikimedia to coming from external data. Currently we have 6 suitable sync targets with corresponding third-party data which provides an as-comlete-as-possible list of all respective subjects of the target group.

persons birthdates - German National Library
music albums - MusicBrainz
geo coordinates - Geonames
Polish cities - polish census data
power plants - global power plant database
Brazilian municipality codes - IBGE (Brazilian Geographics and Statistics Institute)

Finances

Instead of hiring a dedicated developer, we included 4 PhD's part time (3 on the budget, 1 in-kind) plus additional in-kind development workforce (for mappings, data quality and interface creation).

Especially, the collaboration with I2G was very effective, since they have done previous work on Wikipedia article quality and are very active in the Polish Wikipedia.

In total, we have been underspending the current half-year budget, which will allow to hire additional workforce for development in the second phase to integrate the results seamlessly into Wikipedia and Wikidata. We have listed the approximate spending in the finances tab anonymised for privacy reasons, with the exact and transparent receipt given to Wikimedia at the end of the project.

Planned budget for the second half, will include a continued collaboration with I2G and developer workforce for dedicated tasks, either as small focused contracts or internships.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

Finding suitable sync targets turned out to be more challenging than first anticipated. There are certain requirements such as the target has to be unambiguously identifiable, and we need to be able to find a complete list of data regarding the target from external sources.
Finding open source third-party data as external references for these sync targets made this task even more challenging.
We were also hoping to obtain more feedback from the community by setting up a News section, volunteer signup option and email, but so far the turnout is not as good as we would like it to be. We hope that engagement increases once we leave prototype status.
The complexity of the Wiki-verse made us look into how Wikidata and Wikipedia operate and interact in more detail.

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points.

extend list of sync targets to 10
use sync target properties to evaluate and clean mappings
integrate all third-party data and references linked to sync targets into GFS data browser
finding a suitable way to estimate the accuracy of the data coming from a certain source and integrating that measure into the GFS data browser
improve the GFS data browser following the principles of agile software development

Grantee reflection

We’d love to hear any thoughts you have on how the experience of being an grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

We enjoyed coming together as a team to work on the GlobalFactSync project and to gain deeper insight into how Wikidata and Wikipedia work. There are challenges in integrating all the data from these various sources (Wiki-verse and third-party data) into one data browser, but it is a great learning experience.

We especially value the feedback we received as it helps us shape the project better and gave us clear expectations to meet for the second half.