Grants:Project/DBpedia/GlobalFactSyncRE/Final

From Meta, a Wikimedia project coordination wiki


Report under review
This Project Grant report has been submitted by the grantee, and is currently being reviewed by WMF staff. If you would like to add comments, responses, or questions about this grant report, you can create a discussion page at this redlink.


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the grantee's project.

Part 1: The Project[edit]

Summary[edit]

The project GlobalFactSyncRE ran from June 2019 to November 2020. DBpedia, which frequently crawls and analyses over 120 Wikipedia language editions has near complete information about (1) which facts are in infoboxes across all Wikipedias (2) where Wikidata is already used in those infoboxes. GlobalFactSyncRE will extract all infobox facts and their references to produce a tool for Wikipedia editors that detects and displays differences across infobox facts in an intelligent way to help sync infoboxes between languages and/or Wikidata. The extracted references will also be used to enhance Wikidata.

Project Goals[edit]

Summary of Results[edit]

As a final outcome of the GFS project, we summarized all results in the paper Towards a Systematic Approach to Sync Factual Data across Wikipedia, Wikidata and External Data Sources, which was accepted and will be presented at the Qurator Conference 2021 (Wikimedia Germany is a an organizing partner, DBpedia/InfAI a supporting partner). A preprint is publicly available. We will refer to the paper in relation to the project in this final report.


1. Visibility and Awareness

The GlobalFactSync tool will increase visibility and awareness of data consistency across multiple resources such as Wikipedias, Wikidata, and DBpedia. We envision the tool and the accompanying website to be an enabler for these three communities to work together better in the future via this platform. Users can quickly see what data is available and where their help is needed the most. Supporting metrics and other data quality measures will allow to judge the overall progress of unification. Information provided will help Wikipedia editors to better judge the accuracy/completeness of their current infobox values for their Wikipedia edition by comparing it to other language versions and Wikidata. Furthermore, any edits made in Wikipedia infoboxes will be visible to other editors and thus, allow the spread of information among the three systems.

Goals achieved:

  • We implemented a prototype available at global.dbpedia.org . Global.dbpedia.org will be further maintained by DBpedia.
  • We implemented a user script for people to jump from any Wikipedia/Wikidata page to global.dbpedia.org to compare data across the Wikiverse
  • We analysed a large amount of data (cf. the paper for details). Overall, Wikipedia's infoboxes grew by ~50% (from 489,805,025 to 725,422,849 facts) over the last three years (WP EN even doubled). The overlap of infobox data to Wikidata for the ~200 properties, we analysed, was around 15% (cf. fig 5, page 11). The conclusion here is that Wikidata contains mostly disjoint data from Wikipedia's infoboxes and needs to grow a lot to succeed in replacing infoboxes.
  • We implemented a reference extraction, extracting 8.8 million references from 11 languages of Wikipedia, which are helpful to identify outside sources for data acquisition. The reference extraction will be merged into the main DBpedia extraction and run each month.
  • We engaged around 50 individuals, which had a technical background. Main feedback channels have been the conference presentations as well as the project talk page (a full list is given below). Overall, the feedback was very helpful to achieve the final stage of GFS, i.e. conceptual clarity, data analysed as well as a prototypical implementation.
  • Dissemination is on-going with the publication of the paper and discussion at the Qurator conference.


2. Improvement of Wikipedia Infoboxes

Wikipedia infoboxes are maintained by Wikipedians that know the guidelines and best practices of their Wikipedia language version best. The GlobalFactSync tool will leave the final decision on what to include in their infoboxes to these editors. We see the main goal of our project as a support tool that will provide better information to editors. Besides the facts shown in our prototype, DBpedia also has extensive technical information about which template is used with which values on which Wikipedia pages, which can be exploited. Editors can receive suggestions and snippets that they can copy into Wikipedia pages, which will greatly ease their editing effort. In general, we would also foster a higher degree of automation for infobox edits and an increased usage of Wikidata in Wikipedia. The integrated maintenance of mappings is a relevant step in this direction.

Goals achieved:

Deviation from goals:

Initially, the project was planned to mainly sync data internally between Wikipedia languages and Wikidata. After an intensive discussion, the decision was taken to do a more detailed analysis, before committing to this goal. The main problem we pinpointed is where data originally comes from. According to our findings the following reasons justified a deviation:

  • Better integration of Wikidata in Wikipedia's infoboxes: Our analysis shows that Wikidata is still mostly disjoint data-wise from the data needed in Wikipedia's infoboxes. As we discovered a steady growth of data entered directly in Wikipedia's infoboxes, this gap seems to be extremely difficult to fill and would require a massive upload of Wikipedias infobox data (via DBpedia) to Wikidata only to feed it back to Wikipedia.
  • Language to language: As shown in Fig 5 in the paper a transfer of data via suggestion is possible for 39 Wikipedia language editions. As we focused on the analysis, the website global.dbpedia.org lacks two major development steps: 1. the interface is not yet intuitive enough. It shows a lot of data, but does not give good guidance in what to edit or improve. 2. The data used in the interface is based on the static Wikimedia data dumps. The data is outdated quickly. Both issues can be tackled with additional effort, but were not achieved in this project. These two features are on-going and future work.

Conclusion:

Overall, we concluded that shifting around data between Wikipedias, Wikidata and DBpedia is suboptimal and could potentially create additional difficulties. The focus here shifted towards better exposing high-quality, authorative data from external sources (in particular Linked Data and the Linked Open Data Cloud). In this manner, editors can review sources and pick individual facts to be cached for display in infoboxes. Preliminary integrations have been done using the German National Library, the Dutch National Library, Musicbrainz and the Dutch Land Office (Kadaster) as well as the Polish Statistical Office (stat.gov.pl). We described the progress in detail in Towards a Systematic Approach to Sync Factual Data across Wikipedia, Wikidata and External Data Sources. In particular, we also describe criteria for sources and analysed the current references used in Wikipedia to lay a foundation for future work.



3. Improvement of Wikidata

Statements on Wikidata items are primarily edited by Wikidatans, whereas data donations (such as the Freebase dataset) are to be ingested via the Primary Sources Tool. The GlobalFactSync project will contribute to Wikidata in form of a dataset containing verified statements with respective references. These facts can then be ingested via the Primary Sources Tool in order to add missing statements to Wikidata and to add references to already existing claims. Existing statements in Wikidata which already reference DBpedia itself or a specific Wikipedia language edition can be supplemented with more reliable references, e.g. the citations found in respective Wikipedia articles. These additions will increase the completeness and trustworthiness of Wikidata statements. Beyond the data contributions created during this project the software stack will be made available for continuous application and improvement.

Deviation from goal:

During the course of the GFS project, we intensively tested the Primary Sources Tool as well as Harvest Templates. We described the high-level problems encountered in Section 3: Data Curation Mechanics – Lessons Learned, i.e. a massive upload to Wikidata leads to a high post-processing burden by Wikidata editors. In the paper, we argue for a "link and cache" approach instead of a "clean and copy" which offers a multitude of advantages. Syncing with external sources is preferred. In the paper, we analysed the sources of data used in Wikipedia/Wikidata.

Project Impact[edit]

Important: The Wikimedia Foundation is no longer collecting Global Metrics for Project Grants. We are currently updating our pages to remove legacy references, but please ignore any that you encounter until we finish.

Targets[edit]

  1. In the first column of the table below, please copy and paste the measures you selected to help you evaluate your project's success (see the Project Impact section of your proposal). Please use one row for each measure. If you set a numeric target for the measure, please include the number.
  2. In the second column, describe your project's actual results. If you set a numeric target for the measure, please report numerically in this column. Otherwise, write a brief sentence summarizing your output or outcome for this measure.
  3. In the third column, you have the option to provide further explanation as needed. You may also add additional explanation below this table.
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Visibility and Awareness: A prototype of the tool is already online and will stay online during the whole project lifetime. Measures that will guide the development are community feedback and fulfilled feature requests. Over the project duration, we expect to incorporate over 100 issues (bug fixes / feature requests) from community channels (mailing lists, wiki discussion fora and issue tracker). Another important measure of success is the number of users. We target to have over 500 unique users per month for the tool as a whole (including User Scripts and Gadgets) at project end. We expect these visitors to be core editors that can work with the tool effectively as a hub to be redirected and improve the data landscape in Wikipedia and Wikidata. We have already started to ask volunteers in helping us with translating the UserScript and deploy it on more Wikipedias. Overall, we hope to deploy the UserScript on 15 language versions. While the development of the tool was funded by this proposal, the DBpedia Association is able to provide further hosting and maintenance after project end. In the long run, DBpedia also benefits from better data and structure in Wikipedia and Wikidata, thus creating an incentive to maintain and further develop the system created here. Overall, this brings the communities closer together. The project needed more data analysis than expected as described in the project goals. Overall, we did not implement user tracking. Code and work was published as opensource on GitHub. Contact with technical users was broad and effective in particular at the several conferences.
Improvements of Infoboxes: With the GlobalFactSync tool, we will follow a pareto-efficient (20/80) approach. This means that we will target smaller Wikipedias that have less infoboxes, as well as analyse the data to see where edits on larger Wikipedias are the most effective, with possible customization for active WikiProjects. We have already received some feedback on how to optimize the tool to the interest of editors, which is very domain-specific, e.g. interest is focused on one topic/infobox or one property. Another efficient way is to compare specific templates with respect to completeness of values in Wikidata. In order to generate a specific infobox in one language from Wikidata, we can focus on completing Wikidata (with statements from other languages or by asking the community) for a given infobox. In order to demonstrate the results of GlobalFactSync, we will select 10 Sync Targets. A sync target can be: one infobox in one language, similar infoboxes across languages or a WikiProject. The sync target is reached when the infobox is in consensus with the rest of the WikiVerse and proper references exist. Also, the infobox attributes should be mapped to Wikidata properties. We can give global statistics on consensus between the different infoboxes and Wikidata. Consensus here is based on a conformity measure, e.g. how many of the values agree in the Wikiverse. The global statistic will give an overall sense of achievement and we can show what is in sync and what needs attention. Reaching the 10 sync targets will give a blueprint for other projects/interest groups to sync their area of interest. Already covered above.
Improvements of Wikidata: In this project a workflow will be set up, that generates valuable datasets and references for ingestion to Wikidata. This dataset has to be of high quality and therefore must obey the following data quality rules: facts should be backed by multiple (2+) Wikipedia language editions, there should be no or only slight (<20%) contradiction between different languages editions, facts need a reference in at least one language edition, and the references should be sufficiently described. The software created during this project will be made available for further improvement and application. As DBpedia is continuously improving its data and reference extraction capabilities, the GlobalFactSync tool chain will show its value in the long run as data is curated via the Primary Sources Tool. It is therefore of great importance to take up on the community to the development of the involved processes. We will provide a workflow to provide continuously updated data for future ingestion. Data analysis in the paper suggests that good sources should be integrated directly to achieve any significant improvement. Our work addressed the analysis of such sources and the potential for finding high quality statements.


Story[edit]

As we started the project, we initially envisioned to make easy progress as most of the team has been working with Wikipedia's infoboxes and DBpedia for almost a several years now. Although, we developed many prototypes, the most interesting and also hardest part of the project was the internal discussion. For the most part, we held weekly telcos were we sorted through all the feedback of the community and tried to structure it. As we went deeper into the topic, we found that the original project ideas would be leading to a suboptimal solution, that although it would increase data in Wikidata and Wikipedia, would not be sustainable and create, in fact, more problems later. The most inspiring moment was, when we finalised the paper, which we consider a great achievement as we managed to bring everything together in a holistic manner. This was suddenly much clearer than the confusing details we discussed in the weekly telcos, which were very grindy and had a lot of side branches that lead nowhere.

Survey(s)[edit]

n/a

Other[edit]

The results are relevant for Wikimedia research and also strategizing about the future of data in Wikimedia.

Methods and activities[edit]

Please provide a list of the main methods and activities through which you completed your project.

  • We made a small study to choose two initial sync targets.
  • We translated existing DBpedia mappings to Wikidata properties and we checked the status of mappings for the two targets from the study.
  • GFS Data browser http://global.dbpedia.org/
  • We developed a prototype.
  • We worked on our first micro-services.
  • We made a detailed analysis regarding references in Wikipedia that are not in Wikidata.
  • We developed the GlobalFactSync tool.
  • We published a [Script] available at User:JohannesFre/global.js shows links from each article and Wikidata to the Data Browser and Reference Web Service .
  • We developed a workflow to populate Wikidata via the Primary Sources Tool and through a bot.
  • A Wikimania 2019 talk was given by Johannes Frey, showcasing the GFS project and prototype.
  • A WikidataCon 2019 poster about the project was created and presented, positive feedback collected.

Project resources[edit]

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Even if you have linked to them elsewhere in this report, this section serves as a centralized archive for everything you created during your project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

  • GFS Data browser http://global.dbpedia.org/
  • Github
  • User:JohannesFre/global.js shows links from each article and Wikidata to the Data Browser and Reference Web Service
  • We created a news page within our Meta-Wiki project page framework for volunteers to keep them in the loop and encourage exchange.
  • We created an overview over the most frequent references used in Wikipedia infoboxes and Wikidata
  • Another analysis we have done is the potential gain of Wikidata for the three properties birth/deathdate and population. The supporting data is available here.
  • Data flow in GlobalFactSync is here visualized.

13th DBpedia Community Meeting in Leipzig

Wikimania 2019

WikidataCon 2019

Tweets and blog posts about the Project

Qurator Conference 2021

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well[edit]

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work[edit]

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

  • We spend a lot of time thinking and implementing ways to reach Wikipedians form several language editions and how to include volunteers, i.e. we tried having them sign up as volunteers and also implemented a ping mechanism for news or thought about a tree of people each getting feedback from different WP language editions. However, this didn't prove very effective. The best way to get feedback were mailing lists, the talk page and meetings at conferences.

Other recommendations[edit]

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.

Next steps and opportunities[edit]

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.



Part 2: The Grant[edit]

Finances[edit]

Actual spending[edit]

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them. The total amount requested is 63,000€/71,621 USD.

Item number Category Item Description Approved amount Actual funds spent notes Source of funding
1 Personnel cost Software developer time 63,000€ 64,023€ 2 PostDocs' were paid part time for development and project management This grant
2 Personnel cost Data Acquisition Support 15,500€ 15,500€ Several developers (Johannes Frey, Marvin Hofer, Milan Dojchinovski) from the DBpedia Association assisted in different phases of the project and worked on extraction and fusion supporting the project software development, data analysis and UI. DBpedia Association
3 Personnel cost Supervision 0€ 0€ Dr.-Ing. Sebastian Hellmann, DBpedia/InfAI as project lead (in-kind). Budget neutral
4 Personnel cost Supervision 0€ 0€ Prof. Krzysztof Węcel from PUEB/I2G/DBpedia Poland as project lead (in-kind). Budget neutral
5 Travel Travel budget and accommodation. 1,000€ 1,279€ Travel budget for the developer to go to the Wikimania 2019 (Sweden) and WikidataCon 2019 (Berlin). DBpedia Association
6 Equipment Laptop 1,000€ 1,000€ Used by the developer during his work. DBpedia Association
7 Server 0€ 2,000€ Used by all developers during work. Budget neutral
Project support obtained from DBpedia (planned) 17,500€
Project support obtained from DBpedia (actual spent) 17,779€
Project funding requested from Wikimedia Foundation (planned) 63,000€
Project funding spent from Wikimedia Foundation 64,023€
Total project cost 81,802€

Remaining funds[edit]

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

  • No.

If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:

Documentation[edit]

Did you send documentation of all expenses paid with grant funds to grantsadmin(_AT_)wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

  • Yes, we did.

Confirmation of project status[edit]

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

  • Yes.

Is your project completed?

Please answer yes or no.

Grantee reflection[edit]

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being a grantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the Project Grant experience? Please share it here!