From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

statusnot selected
summaryDBpedia, which frequently crawls and analyses over 120 Wikipedia language editions has near complete information about (1) which facts are in infoboxes across all Wikipedias (2) where Wikidata is already used in those infoboxes. GlobalFactSync will extract all infobox facts and their references to produce a tool for Wikipedia editors that detects and displays differences across infobox facts in an intelligent way to help sync infoboxes between languages and/or Wikidata. The extracted references will also be used to enhance Wikidata.
targetAll Wikipedia language editions + Wikidata
amount63,000€ / 78,205 USD
organization• DBpedia Association
this project needs...
created on10:14, 9 January 2018 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

Wikipedians have spent a great effort over the last decade to collect facts in infoboxes. The problems that come along with infoboxes are well known in the Wikipedia community, as infoboxes are tedious to maintain and structure and facts can not be compared with or imported from other language versions with richer infoboxes. Furthermore, many different and sometimes conflicting statements are made across languages for the same item, which are difficult to spot and to improve. While these issues have ultimately led to the establishment of Wikidata, the richness of information curated by thousands of Wikipedia editors has not yet been transferred to Wikidata. Especially, Wikidata is missing references for 2/3 of their statements (see Wikicite 2017 starting on slide 15). These missing references can already be found in Wikipedia's infoboxes and transferring these efficiently would greatly inspire trust in Wikidata.

The main problem is that there is need for a synchronisation mechanism between all the data contained in the Wikiverse, which can be broken down into these two sub-problems:

Problem 1 (WP2WP, WD2WP): Facts from Wikipedia Infoboxes are not managed on a global scale

WP2WP refers to the comparison between the infoboxes of the different language versions such as the English and German page for the same item. WD2WP adds the respective WD statement to the WP2WP comparison.

Example 1:

Example 2:

The core of this problem is a question, which can only be answered on a global scale:

  1. Regarding each infobox fact in each language version, is there better information available in another Wikipedia language version or in Wikidata?
  2. What is the best strategy for resolving the differences?
    1. Edit the infobox with information from another languages?
    2. Edit the infobox and import Wikidata?
    3. Edit Wikidata?
    4. Update infoboxes in other languages?

The challenge here is to provide a global comparative view for Wikipedia/Wikidata editors that pinpoints issues like the example given for Poznan and Ethanol.

Problem 2 (WP2WD): Manually created facts from Wikipedia infoboxes are not included in Wikidata.

While we described the WP2WP and WD2WP problem in 1, we consider the problems of Wikipedia to Wikidata (WP2WD) separately, since Wikidata plays a central role considering data. Wikidata has successfully resolved the problems with the inter-language links, though the data from infoboxes is to a large extend still missing. The Wikipedia editors are already spending much effort to link references in the article's infobox.

Example 1:

Example 2:

  • Q153 - Ethanol has 2 entries for boiling and melting point each with Fahrenheit referenced properly and Celsius refers to the Dutch Wikipedia.
  • German Wikipedia for Ethanol has a total of 27 references (13 unique) in the infobox. Both boiling and melting point are properly referenced in Celsius (different from Wikidata).
  • etc.

This problem has these challenges:

  1. Identify the relevant infoboxes that have referenced facts according to the Wikidata properties.
  2. Find the respective line in the infobox that contains the statement.
  3. Find the primary source reference from the infobox relating to that fact.


The problem we are addressing has a global impact on managing all statements in the Wikiverse

In order to support the magnitude of this proposal, we have created a prototypical website that demonstrates which direction we envision for the GlobalFactSync tool, one of the major deliverables of this proposal. We have analysed Wikidata entities Q1-Q10000 and compared core values with infoboxes from 128 language versions:

The index of GlobalFactSync Early Prototype is accessible here:

The prototype was created some months ago. Its data is outdated by now, but it will be updated/live in the course of this project. The index lists the potential statements that can be found in Wikipedia's infoboxes, but not in Wikidata. Overall, for the first 10000 Q's, we found 156,538 values in Wikipedia Infoboxes, that maybe missing in Wikidata, resulting in approximate 300 million facts for the whole Wikidata (20 million articles). An actual corrected value will be determined as part of this project.

If you look at the detail page for Poznan you can see the difference in population count for population (P1028). In total there are 18 values across all Wikipedia versions and 10 clusters with a uniformity value of 0.278. Wikidata (truthy facts) agrees only with the Russian Wikipedia:

"564951" (es) "557264" (ja) "551627" (en,gl,id,it,mk) "546829" (nl,sv) "550742" (be,sr) "544612" (ru,wikidata) "552" (el) "553564" (bg,cs) "571" (eu) "546503" (ga)

As mentioned before, we found 156,538 for potential missing values in Wikidata in the first 10,000 Q's. The number of differences in Infoboxes across languages (Wikipedia to Wikipedia) is expected to be even larger, but we do not have it fully analysed, yet.

Another analysis we have done is the potential gain of Wikidata for the three properties birth/deathdate and population. This is only for three properties, but for all 42,574,760 data items:

6823 birthDate_gain_wiki.nt
3549 deathDate_gain_wiki.nt
362541 populationTotal_gain_wiki.nt
372913 total

The supporting data is available here. The analysis shows, that Wikidata is already very complete respective birth/deathdate (only 10k missing values found). However, the Wikipedias have 362k more values for population count.

Making a detailed a analysis regarding references in Wikipedia that are not in Wikidata is part of this proposal.

What is your solution to this problem?[edit]

Background and Assets[edit]

DBpedia has crawled and extracted infobox facts from all Wikipedia language editions for the past 10 years. This information has been stored in a database to be queryable. In addition, we have published this data under CC-BY-SA on the web and thus made it freely accessible to everyone.

The main assets that DBpedia brings into this project are:

The DBpedia Information Extraction Framework (DIEF) is an open-source software with over 30 different extractors that can extract most of the information contained in Wikipedia's wikitext. The software is written in Scala and has been localized to understand all major Wikipedia editions (170 languages in total). We currently do extractions twice a year for a total of 14 billion facts, but a rewrite (using SPARK) is currently on its way to do weekly as well as realtime extraction. The software has been the state-of-the-art in information extraction from Wikipedia for over 10 years now and is forked over 200 times on GitHub.

DBpedia's data is available as dump download and also queryable via an online service. Prominent datasets include the data of all infoboxes as well as the category system, custom templates like the German persondata, redirects, disambiguation pages (all localized by our community), geocoordinates and much more.

Mappings from infoboxes to DBpedia's schema We have a mappings wiki in place at where over 300 editors maintain translations between the different infoboxes in around 50 languages. So the information, how all the values in the individual templates are called. It is to the best of our knowledge the most complete information source that unifies all existing templates in the different Wikipedia language editions. From there we can easily know that the Dutch infobox for basketballer uses 'geboortedatum' and the Dutch infobox for componist uses 'geboren' for birthDate:

These are just two examples. In total, over 77.48% of all template occurrences are mapped and 50% of all infobox attributes statistics page.

Mappings from DBpedia Schema to Wikidata properties In a separate file, we further keep a mapping of all the names to Wikidata properties. From there, we can easily triangulate which value in which infobox relates to which property in Wikidata.

Informed Decisions of Editors[edit]

The tool we are proposing as a solution (GlobalFactSync) will serve the editors of Wikipedia and Wikidata. The main goal of the tool is therefore to present good information about which facts and references are available in the Wikiverse. We have already given an example about the population count of Poznan (Fun fact: Poznan is also the seat of the Polish DBpedia chapter) and the boiling point for Ethanol, see above.

The main information an editor can quickly gain here is that there are currently 8 different values used across Wikipedia and that 5 language versions (en,gl,id,it,mk) agree on the value. It is also easy for a human to see that eu and el have had an error in the automatic extraction. While this information is already quite helpful to keep an infobox up-to-date and improve it, we can do much better in the GlobalFactSync project, because in the end the reference is a valuable criteria to decide, which value might be the correct (or latest) one. One solution here is to provide an algorithm that better sorts and presents the information in the following manner:

  • Values with references are considered better, the editor can (by his knowledge) judge the validity of the source
  • Values with dates can help to judge recency of the information
  • Outliers like el and eu can be hidden or ranked down, as they might relate to an extraction error.
  • Majority vote (i.e. which value is adopted by most Wikipedias) is a form of consensus, which is one of the principles of Wikipedia

Note: The sort order is supposed to be contextual, i.e. will be adapted for different types of information. In some cases, certain references are authoritative and are thus rated higher. In the end the editor needs to make the decision and if necessary use the talk page.

Mapping maintenance[edit]

In order to synchronize the facts of multiple language Wikipedia infoboxes and Wikidata, we need to maintain mappings between the applied templates. It is evident that Wikidata properties can serve as mapping targets for the template parameters. The existing mappings from Wikipedia templates to the DBpedia ontology schema already have a broad coverage for the bigger community Wikipedias, these can easily be converted to Wikidata properties, which gives us a good set to begin with. Hereinafter, these mappings can be maintained and continuously completed in the GlobalFactSync tool. Additionally, there are a few Wikipedia languages maintaining such mappings already for selected Wikidata-filled infoboxes, e.g.

It is yet unclear how to consider these mappings directly, but the mappings resulting from this project can be evaluated against them.

GlobalFactSync tool[edit]

Mockup of the comparison view of the GlobalFactSync tool

As a major outcome of the project, we will continuously improve the prototype, which is already online. The tool will allow a per page access and give a detailed overview over the available data and therefore serve as an access point to check data completeness, the state of the infoboxes in Wikipedia and also highlight the values that show differences between the Wikipedia versions and Wikidata.

Besides the intended effects of improving the factual data, we also see great potential in bringing all communities closer together to work towards the unified goal of providing knowledge to the world.

Data freshness: At the moment, data is loaded from a static dataset. To be effective, we will try to keep the data updated and as fresh as possible by integrating the Wikidata API and using live extraction that can analyse infoboxes on the fly. Also further APIs can be integrated such as Diffbot and StrepHit.

Interaction: Wikipedians should be able to interact in several ways. We will include proper edit links, pre-filled forms (if possible), template suggestions and better sorting (i.e. highlight and move to front most obvious and pressing differences) to allow editors to quickly work through pages. We envision to create and provide a User Script or Gadget that supports the editor with creating infoboxes. Usability feedback will be collected in several channels, so that we can improve interactivity based on the wishes of the community.

Smart fact and reference rating We will devise algorithms to rate and rank fact to sort out and suggest the facts that have the highest quality. The algorithm will try to be Pareto efficient, which means that 80% of the results can be achieved by 20% of the work for the editors. Once the Wikiverse is synchronised above a critical mass, we hope that the collaboration between all three communities is so established that other projects and ideas will spawn. We will also rank references. Language of references can be found with a HTTP HEAD request and the Content-Language Header. For example, including a Catalan reference in the Catalan Ethanol article is preferred. Wikidata should have multilingual references in all languages.

Presentation to Editors via UserScript and Gadgets[edit]

Wikipedia editors can receive suggestions for infobox templates. Therefore, pre-filled infobox templates will be auto-generated similar to the approach of the Wikidata module. To generate suggestions for pre-filled infobox templates, we will use a mapping between the relevant template parameters and Wikidata property ids.

The path to adoption of the tool will be done in three phases. In the first phase, we will create UserScripts for as many Wikipedia language editions as possible. Given the UserScript is accepted by editors, we can ask the admins to upgrade them to Gadgets to reach a wider audience. If this adoption will be successful and has shown its usefulness, we will approach the VisualEditor community to discuss inclusion there. The tool we are proposing will contain a strong backend focused on data analysis that can be easily adopted by a variety of interfaces.

Data and Reference Acquisition[edit]

Data flow in GlobalFactSync

Overall, DBpedia has complementary data to the Wikidata effort in the same way that Wikipedia Infoboxes hold complementary data to Wikidata, which is the focus of this proposal. As discussed with Lydia Pintscher at the DBpedia Community Meeting in Leipzig 2016, the DBpedia could not be used for Wikidata directly as only facts are extracted from infoboxes, but not the references of these facts. It would be straightforward to import these facts via DBpedia as a middle layer into Wikidata, given the existing citations and qualifiers in Wikipedia were extracted as well. During the last 2 years, we have discussed the extraction of references within the DBpedia community. From a technical perspective the necessary adjustments to the DBpedia software are relatively easy and can be finished with few months work resulting in a high precision extraction of references, which would benefit:

  • Wikidata
  • Information provided by the GlobalFactSync Tool to Wikipedia Editors

Having an overview over all available references used in Wikipedia also provides good intel on which external datasets might be interesting for Wikidata in the future.

Note on References[edit]

In this proposal, we focus on extracting existing references from Wikipedia's infoboxes. In our previous proposal we described in detail, how to extract facts and references from the natural language in the article text. We excluded it from this proposal to be more focused. We have a collaboration with Diffbot and others who form the DBpedia Text Extraction department. Work on this has begun and can be used here as well. Diffbot also crawls the web regularly and can provide additional references for facts found on the web with high precision.

GlobalFactSync ingest (transfer of data to Wikidata)[edit]

As the data is sufficiently prepared we will in a first step set up an ingestion workflow via the Primary Source Tool, i.e. create an ingestable dataset in form of QuickStatements or Wikidata compliant RDF. Beyond these manual approaches we will evaluate the possibility to import a part of the data in bulk. Prerequisites for such selected portions are of course that the data was thoroughly evaluated and proper references exist. We are confident that the quality of the applied references will be higher than those currently provided by simple bots which so far are importing statements from Wikipedia language editions.

Project goals[edit]

1. Visibility and Awareness[edit]

The GlobalFactSync tool will increase visibility and awareness of data consistency across multiple resources such as Wikipedias, Wikidata, and DBpedia. We envision the tool and the accompanying website to be an enabler for these three communities to work together better in the future via this platform. Users can quickly see what data is available and where their help is needed the most. Supporting metrics and other data quality measures will allow to judge the overall progress of unification. Information provided will help Wikipedia editors to better judge correctness/completeness of their current infobox values for their Wikipedia edition by comparing with other language versions and Wikidata. Furthermore, any edits made in Wikipedia infoboxes will be visible to other editors and thus allow the spread of information among the three systems.

2. Improvement of Wikipedia Infoboxes[edit]

Wikipedia infoboxes are maintained by Wikipedians that know the guidelines and best practices of their Wikipedia language version best. The GlobalFactSync tool will leave the final decision on what to include in their infoboxes to these editors. We see the main goal of our project as a support tool that will provide better information to editors. Besides the facts shown in our prototype, DBpedia also has extensive technical information about which template is used with which values on which Wikipedia pages, which can be exploited. Editors can receive suggestions and snippets that they can copy into Wikipedia pages, which will greatly ease their editing effort. In general, we would also foster a higher degree of automation for infobox edits and an increased usage of Wikidata in Wikipedia. The integrated maintenance of mappings is a relevant step in this direction.

3. Improvement of Wikidata[edit]

Statements on Wikidata items are primarily edited by Wikidatans, whereas data donations (such as the Freebase dataset) are to be ingested via the Primary Sources Tool. The GlobalFactSync project will contribute to Wikidata in form of a dataset containing verified statements with respective references. These facts can then be ingested via the Primary Sources Tool in order to add missing statements to Wikidata and to add references to already existing claims. Existing statements in Wikidata which already reference DBpedia itself or a specific Wikipedia language edition can be supplemented with more reliable references, e.g. the citations found in respective Wikipedia articles. These additions will increase the completeness and trustworthiness of Wikidata statements. Beyond the data contributions created during this project the software stack will be made available for continuous application and improvement.

Project impact[edit]

How will you know if you have met your goals?[edit]

1. Visibility and Awareness[edit]

  1. Output: A prototype of the tool is already online and will stay online during the whole project lifetime. Measures that will guide the development are community feedback and fulfilled feature requests. Over the project duration, we expect to incorporate over 100 issues (bug fixes / feature requests) from community channels (mailing lists, wiki discussion fora and issue tracker). Another important measure of success is the number of users. We target to have over 500 unique users per month for the tool as a whole (including User Scripts and Gadgets) at project end. We expect these visitors to be core editors that can work with the tool effectively as a hub to be redirected and improve the data landscape in Wikipedia and Wikidata. We have already started to ask volunteers in helping us with translating the UserScript and deploy it on more Wikipedias. Overall, we hope to deploy the UserScript on 15 language versions.
  2. Outcome: While the development of the tool is funded by this proposal, the DBpedia Association is able to provide further hosting and maintenance after project end. In the long run, DBpedia also benefits from better data and structure in Wikipedia and Wikidata, thus creating an incentive to maintain and further develop the system created here. Overall, we hope that this will bring the communities closer together (we can not give a number for this, however).

2. Improvements of Infoboxes[edit]

  1. Output: With the GlobalFactSync toll, we will follow a pareto-efficient (20/80) approach. This means that we will on the one hand target smaller Wikipedias, that have less infoboxes, as well as analyse the data to see where edits on larger Wikipedias are the most effective, possible customizing for active WikiProjects. We have received some feedback already how to optimize the tool to the interest of editors, which is very domain-specific, e.g. interest in one topic/infobox or one property. Another efficient way, is to compare specific templates to completeness of values in Wikidata. In order to generate a specific infobox in one language from Wikidata, we can focus on completing Wikidata (with statements from other languages or by asking the community) for a given infobox. In order to demonstrate the results of GlobalFactSync, we will select 10 Sync Targets. A sync target can be: one infobox in one language, a similar infoboxes across languages or a WikiProject. The sync target is reached, when the infobox is in consensus with the rest of the WikiVerse and proper references exist. Also the infobox attributes should be mapped to Wikidata properties.
  2. Outcome: We can give global statistics on consensus between the different infoboxes and Wikidata. Consensus here is based on a conformity measure, e.g. how many of the values agree in the Wikiverse. The global statistic will give an overall sense of achievement and we can show what is in sync and what needs attention. Reaching the 10 sync targets will give a blueprint for other projects/interest group to sync their area of interest.

3. Improvements of Wikidata[edit]

  1. Outcomes: 2/3 of existing Wikidata statements has missing references or "illegal" references to Wikipedia. Based on the mapping of infoboxes to Wikidata and the extraction of references, we can identify references from Wikipedia for these statements specifically. On condition that these completions can be made automatically[1], we foresee to add at least 500,000 missing references from Wikipedia to already existing statements (We found 913,695 for English already, see talk page). Based on the analysis of the infoboxes and the sync targets, we set as a target that the GlobalFactSync users will add at least 100,000 missing curated statements with proper references into Wikidata. These statements will be very targeted to specifically increase completeness and trustworthiness of Wikidata for the 10 sync targets. As a secondary goal, we will also produce more high quality statements with references, that can be vetted via the Primary Sources tool or Quickstatements. As a side effect, we will provide the whole corpus of extracted references to the Wikidata community. From these references sources can be found for linking and authority control by the community.
  2. Output: In this project a workflow will be set up, that generates valuable datasets and references for ingestion to Wikidata. This dataset has to be of high quality and therefore must obey the following data quality rules: facts should be backed by multiple (2+) Wikipedia language editions, there should be no or only slight (<20%) contradiction between different languages editions, facts need a reference in at least one language edition, and the references should be sufficiently described. The software created during this project will be made available for further improvement and application. As DBpedia is continuously improving its data and reference extraction capabilities, the GlobalFactSync tool chain will show its value in the long run as data is curated via the Primary Sources Tool. It is therefore of great importance to take up on the community to the development of the involved processes. We will provide a workflow to provide continuously updated data for future ingestion.

Do you have any goals around participation or content?[edit]

In particular, we hope to increase participation around the mapping of Infobox attributes (template parameters) to Wikidata properties as this is the basis for synchronisation and reaching critical mass respective data quality and completeness.

Project plan[edit]


Our proposal involves the following tasks:

ID Title Description Month
A1 Study Conduct a small study to choose two initial sync targets, e.g. a popular WikiProject (Monuments) that can be improved in a smaller language and another target with less developed infoboxes with high variety among languages. Analyse the lack of references in Wikidata, e.g. which entities and properties are especially bad referenced. M1
A2 Mapping Generation Translate existing DBpedia mappings to Wikidata properties. Check status of mappings for the two targets from the study. M1-M3
A3 Source References for Facts Extract source references for facts within Wikipedia infoboxes. The resulting dataset (reference corpus) will be published. The task concludes with an initial ingestion ready dataset. M1-M6
A4 GlobalFactSync tool Extend the current early prototype with new features (language filter, snippet suggestion). It concludes with a more user-friendly GlobalFactSync tool and will be tested with the two targets. Based on the experience gained 8 more targets will be chosen. M2-M8
A5 Mapping Refinements Complement the existing mappings from infobox templates to Wikidata with a data-driven method based on the study. Enrich the mappings to match infobox template requirements for template suggestions. M4-M6
A6 Third-party Data and Reference Integration Integrate facts from and references to external datasets. It concludes with data and reference additions to the ingestion ready dataset. M7-M9
A7 GlobalFactSync WikiData ingest Develop a workflow to populate Wikidata via the Primary Sources Tool and through a bot. It delivers data available in the Primary Sources Tool in M10 and concludes with bot-ingestable data in M12. M10-M12
A8 GlobalFactSync Sprints Conduct two GlobalFactSync Sprints with the help of the community. Execute and evaluate one sprint in M8 and a second sprint in M11. During M8 and M11, the tool will be extended based on the feedback of the sprints and UserScripts are created. M8 & M11
A9 Community dissemination Promote the project and present the GlobalFactSync tool on different community events. The 10 sync targets will be published as success stories. M1-M12


How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

The total amount requested is 63,000€/78,205 USD.

Item number Category Item Description Number of units for 12 months Total cost notes Source of funding
1 Personnel cost Software developer time 12PM 63,000€ Full time position (40 hrs/week) This grant
2 Personnel cost Data Acquisition Support 6PM 31,500€ 2 developers from the DBpedia Association working on extraction and fusion will support the project. We expect that this will require a workload of 10h/week each. DBpedia Association
3 Travel Travel budget and accommodation. 1 2,500€ Travel budget for the developer to go to the Wikimania 2018 (Cape Town). DBpedia Association
4 Equipment Laptop 1 1,000€ Used by the developer during his work. DBpedia Association
Project support obtained from DBpedia 35,000€
Project funding requested from Wikimedia Foundation 63,000€
Total project cost 98,000€
Total amount requested 63,000€/78,205 USD

Community engagement[edit]

The community engagement strategy aims to provide communication and exchange platforms to discuss the progress of the project and to interface with users and gather their feedback. Especially, data science researcher and developers will be involved and will be asked to give feedback. The community engagement process will include postings on social media and mailing lists as well as presentation of the project results at community events and conferences.

WikiProjects/Interest Groups[edit]

While the tool we propose is very generic and can be applied to a huge number of infoboxes and Wikidata reference issues, we are planning to focus efforts on smaller communities within Wikipedia/Wikidata, where our tool can be of specific benefit. We have not yet contacted these groups. In M1 we will conduct a study to see in which areas the Wikiverse is especially underdeveloped and therefore our tool will be the most effective. Based on this study, we will try to match the respective community. Our volunteers can mediate contact to the respective languages (Greek, Czech, Macedonian and Spanish are well-represented already).

DBpedia Community Meeting Vienna[edit]

We will present this proposal at the 12th DBpedia Community Meeting in Vienna. We will continue advertising the progress of the project at following bi-annually held Community Meetings.

Wikimania 2018

We will present the tool which supports the editing of infoboxes at the Wikimania 2018 in Cape Town.

GlobalFactSync Sprint[edit]

In addition to the community events, we will send out a Call for Contribution: members of the communities are asked to make use of the Primary Source Tool to bring ingestion ready data (statements and references) to Wikidata. DBpedia has good experiences with user involvement when calling for the annual Mapping Sprint where users contributed mappings of Wikipedia templates to the DBpedia ontology.

The following communities (without any claim to completeness) will be notified and will be involved in the project:

  • Wikidatans;
  • Wikipedians;
  • DBpedians;
  • 20 language chapters of DBpedia;
  • Open Source Organizations;
  • Data Science Community;
  • Knowledge Graph Community.

Strategic Partners[edit]

We target collaboration with the following list of partners to maximize the outcomes of this project:

Get involved[edit]


Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

DBpedia Association

The DBpedia Association was founded in 2014 to support DBpedia and the DBpedia Community. Since then we are making steady progress towards professionalizing DBpedia for its users and forming an effective network out of the loosely organised DBpedia Community. The DBpedia Association is currently situated in Leipzig, Germany and affiliated with the non-profit organisation Institute for Applied Informatics (InfAI) e.V.

TBA (Software Developer)

  • Software Development, Data Science, Frontend Development
  • Skills: Scala programming, deep knowledge about DBpedia and Wikidata
  • Developer will be hired/selected from the community.

Julia Holze (DBpedia Association)

  • Organization & Community Outreach, support in organizing and spreading the GlobalFactSync Sprints

Sebastian Hellmann (DBpedia Association and AKSW/KILT) has completed his PhD thesis under the guidance of Jens Lehmann and Sören Auer at the University of Leipzig in 2014 on the transformation of NLP tool output to RDF. Sebastian is a senior member of the “Agile Knowledge Engineering and Semantic Web” AKSW research center, which currently has 50 researchers (PhDs and senior researchers) focusing on semantic technology research – often in combination with other areas such as machine learning, databases, and natural language processing. Sebastian is head of the “Knowledge Integration and Language Technologies (KILT)" Competence Center at InfAI. He also is the executive director and board member of the non-profit DBpedia Association. Sebastian is contributor to various open-source projects and communities such as DBpedia, NLP2RDF, DL-Learner and OWLG and wrote code in Java, PHP, JavaScript, Scala, C & C++, MatLab, Prolog, Smodels, but now does everything in Bash and Zsh since he discovered the Ubuntu Terminal. Sebastian is the author of over 80 peer-reviewed scientific publications (h-index of 21 and over 4300 citations) and started the Wikipedia article about Knowledge Extraction.

Magnus Knuth (DBpedia Head of Technical Development) is research member of the AKSW research group at Leipzig University and former member of the "Semantic Multimedia" research group at Hasso Plattner Institute. Magnus has proficient knowledge in data extraction pipelines and Linked Data focussing on data quality and change management.

Volunteers and interest

  • Volunteer To be determined :) Jimregan (talk) 15:32, 19 January 2018 (UTC)
  • Volunteer with testing, providing feedback, and translating Jimkont (talk) 08:35, 23 January 2018 (UTC)
  • Volunteer provide feedback, help with the development of the idea, translating M1ci (talk) 13:35, 23 January 2018 (UTC)
  • Volunteer Translation to Spanish Marianorico2 (talk) 10:15, 25 January 2018 (UTC)
  • Volunteer Translate to Greek, provide feedback. S.karampatakis (talk) 18:01, 27 January 2018 (UTC)
  • Volunteer provide feedback, help with the development of the idea and implementation, translations to Spanish Nandanamihindu (talk) 18:20, 30 January 2018 (UTC)
  • Volunteer Translate to French, provide feedback, help with the development... --Framawiki (talk) 09:56, 2 December 2018 (UTC)
  • Feedback + interest. –SJ talk  20:44, 14 November 2018 (UTC)

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Wikimedia & Wikidata Community


We have received constructive feedback from many Wikipedians:

DBpedia Community

Social Media


Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • It is important to have references for further use in the highest possible quality. Crazy1880 (talk) 16:39, 17 January 2018 (UTC)
  • Support Support Synchronization between the Wikipedias, Wikidata is one of the most relevant missing links at this time. When we are able to only concentrate on differences between projects we will be able to improve the quality and the BLP quality effectively. At this time we do not even know what needs attention and consequently "quality" efforts are often misguided and misdirected. Thanks, GerardM (talk) 11:22, 18 January 2018 (UTC)
  • All infoboxes should get data from Wikidata sooner or later, this would be a huge boost! Sabas88 (talk) 11:09, 23 January 2018 (UTC)
  • Support Support But I would also like to note that deletions from infoboxes would get more sensitive like how "Religion" and "Ethnicity" were removed from the English Wikipedia's infoboxes but that these are still present on other Wikipedia's. So we should probably take a more inclusionist approach on Wikidata that local wiki's can opt-out of, rather than delete a parameter on Wikidata. --Donald Trung (Talk 🤳🏻) 13:12, 23 January 2018 (UTC)
  • Support Support This sounds as a great idea to finally connect all Wikipedias, and especially will benefit smaller Wikipedias. --Ehrlich91 (talk) 17:20, 23 January 2018 (UTC)
  • Support Support Great project! I am especially familiar with the problem that the smaller communities face when updating the infoboxes in articles on sportspeople, which requires a lot of effort relative to the size of these communities.--Kiril Simeonovski (talk) 16:27, 24 January 2018 (UTC)
  • Support Support Excellent proposal. --Metrónomo-Goldwyn-Mayer 06:01, 27 January 2018 (UTC)
  • Support Support Very interesting proposal. A tool much needed! S.karampatakis (talk) 18:03, 27 January 2018 (UTC)
  • Support Support Looks promising. Helder 20:48, 28 January 2018 (UTC)
  • Support Support Happy to see the connection with dbpedia / wikiverse beeing enhanced and exploited. I’m totally in favor of the data comparison across project approach to improve quality and consistency. The project is well written and quite a bit of thinking and planning is visible on reading. TomT0m (talk) 14:57, 29 January 2018 (UTC)
  • Support Support DBpedia would improve Wikidata, and therefore all of the Wikimedia sister projects. This project would facilitate that. This is a great idea and this grant will be hugely impactful. -- BrillLyle (talk) 14:34, 1 February 2018 (UTC)
  • Looking forward to seeing this happen. ··· 🌸 Rachmat04 · 08:16, 2 February 2018 (UTC)
  • Support Support This is a well-written proposal that will strengthen Wikidata with the knowledge and experience gained from DBPedia. Happy to see the community knowledge dissemination and sprints included in the project plan. Dan scott (talk) 15:53, 7 February 2018 (UTC)
  • Support Support. I support this alignment of resources. YULdigitalpreservation (talk) 18:52, 6 March 2018 (UTC)
  • Support Support Slowking4 (talk) 00:08, 12 March 2018 (UTC)
  • Smart and promising concept to improve data quality and referencing in many wikipedia versions, and to achive a better connection to wikidata. This could help to make more wikipedia contributors interested in also contributing to wikidata. X black X (talk) 19:01, 10 May 2018 (UTC)
  • Support Support An essential idea, overdue + needs precisely the context that produced dbpedia in the first place. –SJ talk  19:46, 14 November 2018 (UTC)
  • Support Support <3 --Framawiki (talk) 09:55, 2 December 2018 (UTC)