Jump to content

Grants:Project/DBpedia/GlobalFactSyncRE

From Meta, a Wikimedia project coordination wiki
statusselected
GlobalFactSyncRE
summaryDBpedia, which frequently crawls and analyses over 120 Wikipedia language editions has near complete information about (1) which facts are in infoboxes across all Wikipedias (2) where Wikidata is already used in those infoboxes. GlobalFactSyncRE will extract all infobox facts and their references to produce a tool for Wikipedia editors that detects and displays differences across infobox facts in an intelligent way to help sync infoboxes between languages and/or Wikidata. The extracted references will also be used to enhance Wikidata. Click Join below to receive GFS updates via {{ping}} to your Wikiaccount.
targetWikidata and all Wikipedias. WikiCite maybe
type of granttools and software
amount63,000€ / 71,621USD
nonprofitYes, DBpedia Association, Institut für Angewandte Informatik e.V. (InfAI) was already vetted by Wikimedia
advisorSj
contact• gfs@infai.org
this project needs...
volunteer
join
endorse
created on12:34, 27 November 2018 (UTC)


GlobalFactSync (GFS) News and CrossWiki Feedback Squad!

  • Subscribe and get pinged on GFS updates by clicking the join button
  • or watch-list the News page
  • Unsubscribe by editing and removing your account from {{Probox |volunteer=
  • We follow an agile rapid prototyping approach, so some things change based on the feedback and are documented on the News page, we kept the original proposal as is, to view it scroll down ↓

Main results (so far)

[edit]

After the Kick-Off note end of July, which described our first edit and the concept better, we shaped the technical microservices and data into more concise tools that are easier to use and demo during our Wikimania presentation:

  1. We got a first edit on day 1 of the project as mentioned in the Kick-off note
  2. We classified some difficulty levels for syncing, while basketball players, video games and geo-coordinates are easy, music albums and other types are harder, we will tackle the easy ones first.
  3. User Script available at User:JohannesFre/global.js shows links from each article and Wikidata to the Data Browser and Reference Web Service
    User Script Linking to the GFS Data Browser
  4. GFS Data Browser Github now accepts any URI in subject from Wikipedia, DBpedia or Wikidata, see the Boys Don't Cry example from Kick-Off Note, Berlin/Geo-coords lat long, Albert Einstein's Religion, Not Live yet, edits/fixes are not reflected'
  5. Reference Web Service (Albert Einstein: http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json&dbpedia) extracts (1) all references from a Wikipedia page, (2) matched to the infobox parameter and (3) also extracts the fact from it. The service will remain stable, so you can use it.
  6. Furthermore, we are designing a friendly fork of HarvestTemplates to effectively import all that data into Wikidata.



Project idea

[edit]

Note: This is a resubmission from GlobalFactSync (Previous Proposal). Please see the cover letter on the talk page for all updates.

What is the problem you're trying to solve?

[edit]

Wikipedians have spent a great effort over the last decade to collect facts in infoboxes. The problems that come along with infoboxes are well known in the Wikipedia community, as infoboxes are tedious to maintain and structure and facts can not be compared with or imported from other language versions with richer infoboxes. Furthermore, many different and sometimes conflicting statements are made across languages for the same item, which are difficult to spot and to improve. While these issues have ultimately led to the establishment of Wikidata, the richness of information curated by thousands of Wikipedia editors has not yet been transferred to Wikidata. Especially, Wikidata is missing references for 2/3 of their statements (see Wikicite 2017 starting on slide 15). These missing references can already be found in Wikipedia's infoboxes and transferring these efficiently would greatly inspire trust in Wikidata.

The main problem is that there is need for a synchronisation mechanism between all the data contained in the Wikiverse, which can be broken down into these two sub-problems:

Problem 1 (WP2WP, WD2WP): Facts from Wikipedia Infoboxes are not managed on a global scale

WP2WP refers to the comparison between the infoboxes of the different language versions such as the English and German page for the same item. WD2WP adds the respective WD statement to the WP2WP comparison.

Example 1:

Example 2:

The core of this problem is a question, which can only be answered on a global scale:

  1. Regarding each infobox fact in each language version, is there better information available in another Wikipedia language version or in Wikidata?
  2. What is the best strategy for resolving the differences?
    1. Edit the infobox with information from another languages?
    2. Edit the infobox and import Wikidata?
    3. Edit Wikidata?
    4. Update infoboxes in other languages?

The challenge here is to provide a global comparative view for Wikipedia/Wikidata editors that pinpoints issues like the example given for Poznan and Ethanol.

Problem 2 (WP2WD): Manually created facts from Wikipedia infoboxes are not included in Wikidata.

While we described the WP2WP and WD2WP problem in 1, we consider the problems of Wikipedia to Wikidata (WP2WD) separately, since Wikidata plays a central role considering data. Wikidata has successfully resolved the problems with the inter-language links, though the data from infoboxes is to a large extend still missing. The Wikipedia editors are already spending much effort to link references in the article's infobox.

Example 1:

Example 2:

  • Q153 - Ethanol has 2 entries for boiling and melting point each with Fahrenheit referenced properly and Celsius refers to the Dutch Wikipedia.
  • German Wikipedia for Ethanol has a total of 27 references (13 unique) in the infobox. Both boiling and melting point are properly referenced in Celsius (different from Wikidata).
  • etc.

This problem has these challenges:

  1. Identify the relevant infoboxes that have referenced facts according to the Wikidata properties.
  2. Find the respective line in the infobox that contains the statement.
  3. Find the primary source reference from the infobox relating to that fact.

Summary

[edit]

The problem we are addressing has a global impact on managing all statements in the Wikiverse

In order to support the magnitude of this proposal, we have created a prototypical website that demonstrates which direction we envision for the GlobalFactSync tool, one of the major deliverables of this proposal. We have analysed Wikidata entities Q1-Q10000 and compared core values with infoboxes from 128 language versions:

Update While the old prototype is still accessible, we wrote something about the new prototype on the Talk Page

The index of GlobalFactSync Early Prototype is accessible here: http://downloads.dbpedia.org/temporary/globalfactsync/

The prototype was created some months ago. Its data is outdated by now, but it will be updated/live in the course of this project. The index lists the potential statements that can be found in Wikipedia's infoboxes, but not in Wikidata. Overall, for the first 10000 Q's, we found 156,538 values in Wikipedia Infoboxes, that may be missing in Wikidata, resulting in approximately 300 million facts for the whole Wikidata (20 million articles). An actual corrected value will be determined as part of this project.

If you look at the detail page for Poznan you can see the difference in population count for population (P1028). In total there are 18 values across all Wikipedia versions and 10 clusters with a uniformity value of 0.278. Wikidata (truthy facts) agrees only with the Russian Wikipedia:

"564951" (es) "557264" (ja) "551627" (en,gl,id,it,mk) "546829" (nl,sv) "550742" (be,sr) "544612" (ru,wikidata) "552" (el) "553564" (bg,cs) "571" (eu) "546503" (ga)


As mentioned before, we found 156,538 for potential missing values in Wikidata in the first 10,000 Q's. The number of differences in Infoboxes across languages (Wikipedia to Wikipedia) is expected to be even larger, but we do not have it fully analysed, yet.

Another analysis we have done is the potential gain of Wikidata for the three properties birth/deathdate and population. This is only for three properties, but for all 42,574,760 data items:

6823 birthDate_gain_wiki.nt
3549 deathDate_gain_wiki.nt
362541 populationTotal_gain_wiki.nt
372913 total

The supporting data is available here. The analysis shows, that Wikidata is already very complete with respect to birth/deathdate (only 10k missing values found). However, the Wikipedias have 362k more values for population count.

Making a detailed analysis regarding references in Wikipedia that are not in Wikidata is part of this proposal.

What is your solution to this problem?

[edit]

Background and Assets

[edit]

DBpedia has crawled and extracted infobox facts from all Wikipedia language editions for the past 10 years. This information has been stored in a database to be queryable. In addition, we have published this data under CC-BY-SA on the web and thus made it freely accessible to everyone.

The main assets that DBpedia brings into this project are:

The DBpedia Information Extraction Framework (DIEF) is an open-source software with over 30 different extractors that can extract most of the information contained in Wikipedia's wikitext. The software is written in Scala and has been localized to understand all major Wikipedia editions (170 languages in total). We currently do extractions twice a year for a total of 14 billion facts, but a rewrite (using SPARK) is currently on its way to do weekly as well as realtime extraction. The software has been the state-of-the-art in information extraction from Wikipedia for over 10 years now and is forked over 200 times on GitHub.

DBpedia's data is available as dump download and also queryable via an online service. Prominent datasets include the data of all infoboxes as well as the category system, custom templates like the German persondata, redirects, disambiguation pages (all localized by our community), geocoordinates and much more.

Mappings from infoboxes to DBpedia's schema We have a mappings wiki in place at http://mappings.dbpedia.org where over 300 editors maintain translations between the different infoboxes in around 50 languages. So the information, how all the values in the individual templates are called. It is to the best of our knowledge the most complete information source that unifies all existing templates in the different Wikipedia language editions. From there we can easily know that the Dutch infobox for basketballer uses 'geboortedatum' and the Dutch infobox for componist uses 'geboren' for birthDate:

These are just two examples. In total, over 80.73 % of all template occurrences are mapped and 50% of all infobox attributes statistics page.


Mappings from DBpedia Schema to Wikidata properties In a separate file, we further keep a mapping of all the names to Wikidata properties. From there, we can easily triangulate which value in which infobox relates to which property in Wikidata.

Informed Decisions of Editors

[edit]

The tool we are proposing as a solution (GlobalFactSync) will serve the editors of Wikipedia and Wikidata. The main goal of the tool is therefore to present good information about which facts and references are available in the Wikiverse. We have already given an example about the population count of Poznan (Fun fact: Poznan is also the seat of the Polish DBpedia chapter) and the boiling point for Ethanol, see above.

The main information an editor can quickly gain here is that there are currently 8 different values used across Wikipedia and that 5 language versions (en,gl,id,it,mk) agree on the value. It is also easy for a human to see that eu and el have had an error in the automatic extraction. While this information is already quite helpful to keep an infobox up-to-date and improve it, we can do much better in the GlobalFactSync project, because in the end the reference is a valuable criteria to decide, which value might be the correct (or latest) one. One solution here is to provide an algorithm that better sorts and presents the information in the following manner:

  • Values with references are considered better, the editor can (by his knowledge) judge the validity of the source
  • Values with dates can help to judge recency of the information
  • Outliers like el and eu can be hidden or ranked down, as they might relate to an extraction error.
  • Majority vote (i.e. which value is adopted by most Wikipedias) is a form of consensus, which is one of the principles of Wikipedia

Note: The sort order is supposed to be contextual, i.e. will be adapted for different types of information. In some cases, certain references are authoritative and are thus rated higher. In the end the editor needs to make the decision and if necessary use the talk page.

Mapping maintenance

[edit]

In order to synchronize the facts of multiple language Wikipedia infoboxes and Wikidata, we need to maintain mappings between the applied templates. It is evident that Wikidata properties can serve as mapping targets for the template parameters. The existing mappings from Wikipedia templates to the DBpedia ontology schema already have a broad coverage for the bigger community Wikipedias. These can easily be converted to Wikidata properties, which gives us a good set to begin with. Henceforth, these mappings can be maintained and continuously completed in the GlobalFactSync tool. Additionally, there are a few Wikipedia languages maintaining such mappings already for selected Wikidata-filled infoboxes, e.g.

It is yet unclear how to consider these mappings directly, but the mappings resulting from this project can be evaluated against them.

GlobalFactSync tool

[edit]
Mockup of the comparison view of the GlobalFactSync tool

As a major outcome of the project we will continuously improve the prototype, which is already online. The tool will allow a per page access and give a detailed overview over the available data. It will, therefore, serve as an access point to check data completeness, the state of the infoboxes in Wikipedia, and also highlight the values that show differences between the Wikipedia versions and Wikidata.

Besides the intended effects of improving the factual data, we also see great potential in bringing all communities closer together to work towards the unified goal of providing knowledge to the world.

Data freshness: At the moment, data is loaded from a static dataset. To be effective, we will try to keep the data updated and as fresh as possible by integrating the Wikidata API and using live extraction that can analyse infoboxes on the fly. Also further APIs can be integrated such as Diffbot and StrepHit.

Interaction: Wikipedians should be able to interact in several ways. We will include proper edit links, pre-filled forms (if possible), template suggestions and better sorting (i.e. highlight and move to front most obvious and pressing differences) to allow editors to quickly work through pages. We envision to create and provide a User Script or Gadget that supports the editor with creating infoboxes. Usability feedback will be collected in several channels, so that we can improve interactivity based on the wishes of the community.

Smart fact and reference rating: We will devise algorithms to rate and rank facts to sort out and suggest the facts that have the highest quality. The algorithm will try to be Pareto efficient, which means that 80% of the results can be achieved by 20% of the work for the editors. Once the Wikiverse is synchronised above a critical mass, we hope that the collaboration between all three communities is so established that other projects and ideas will spawn. We will also rank references. Language of references can be found with a HTTP HEAD request and the Content-Language Header. For example, including a Catalan reference in the Catalan Ethanol article is preferred. Wikidata should have multilingual references in all languages.

Presentation to Editors via UserScript and Gadgets

[edit]

Wikipedia editors can receive suggestions for infobox templates. Therefore, pre-filled infobox templates will be auto-generated similar to the approach of the Wikidata module. To generate suggestions for pre-filled infobox templates, we will use a mapping between the relevant template parameters and Wikidata property ids.

The path to adoption of the tool will be done in three phases. In the first phase, we will create UserScripts for as many Wikipedia language editions as possible. Given the UserScript is accepted by editors, we can ask the admins to upgrade them to Gadgets to reach a wider audience. If this adoption is successful and has shown its usefulness, we will approach the VisualEditor community to discuss inclusion there. The tool we are proposing will contain a strong backend focused on data analysis that can be easily adopted by a variety of interfaces.

Data and Reference Acquisition

[edit]
Data flow in GlobalFactSync

Overall, DBpedia has complementary data to the Wikidata effort in the same way that Wikipedia Infoboxes hold complementary data to Wikidata, which is the focus of this proposal. As discussed with Lydia Pintscher at the DBpedia Community Meeting in Leipzig 2016, DBpedia could not be used for Wikidata directly as only facts are extracted from infoboxes, but not the references of these facts. It would be straightforward to import these facts via DBpedia as a middle layer into Wikidata, given the existing citations and qualifiers in Wikipedia were extracted as well. During the last 2 years, we have discussed the extraction of references within the DBpedia community. From a technical perspective the necessary adjustments to the DBpedia software are relatively easy and can be finished with few months work resulting in a high precision extraction of references, which would benefit:

  • Wikidata
  • Information provided by the GlobalFactSync Tool to Wikipedia Editors

Having an overview over all available references used in Wikipedia also provides good intel on which external datasets might be interesting for Wikidata in the future.

Note on References

[edit]

In this proposal, we focus on extracting existing references from Wikipedia's infoboxes. In our previous proposal we described in detail, how to extract facts and references from the natural language in the article text. We excluded it from this proposal to be more focused. We have a collaboration with Diffbot and others who form the DBpedia Text Extraction department. Work on this has begun and can be used here as well. Diffbot also crawls the web regularly and can provide additional references for facts found on the web with high precision.

GlobalFactSync ingest (transfer of data to Wikidata)

[edit]

As the data is sufficiently prepared we will in a first step set up an ingestion workflow via the Primary Source Tool, i.e. create an ingestable dataset in form of QuickStatements or Wikidata compliant RDF. Beyond these manual approaches we will evaluate the possibility to import a part of the data in bulk. Prerequisites for such selected portions are of course that the data was thoroughly evaluated and proper references exist. We are confident that the quality of the applied references will be higher than those currently provided by simple bots which so far are importing statements from Wikipedia language editions.


Project goals

[edit]

1. Visibility and Awareness

[edit]

The GlobalFactSync tool will increase visibility and awareness of data consistency across multiple resources such as Wikipedias, Wikidata, and DBpedia. We envision the tool and the accompanying website to be an enabler for these three communities to work together better in the future via this platform. Users can quickly see what data is available and where their help is needed the most. Supporting metrics and other data quality measures will allow to judge the overall progress of unification. Information provided will help Wikipedia editors to better judge the accuracy/completeness of their current infobox values for their Wikipedia edition by comparing it to other language versions and Wikidata. Furthermore, any edits made in Wikipedia infoboxes will be visible to other editors and thus, allow the spread of information among the three systems.

2. Improvement of Wikipedia Infoboxes

[edit]

Wikipedia infoboxes are maintained by Wikipedians that know the guidelines and best practices of their Wikipedia language version best. The GlobalFactSync tool will leave the final decision on what to include in their infoboxes to these editors. We see the main goal of our project as a support tool that will provide better information to editors. Besides the facts shown in our prototype, DBpedia also has extensive technical information about which template is used with which values on which Wikipedia pages, which can be exploited. Editors can receive suggestions and snippets that they can copy into Wikipedia pages, which will greatly ease their editing effort. In general, we would also foster a higher degree of automation for infobox edits and an increased usage of Wikidata in Wikipedia. The integrated maintenance of mappings is a relevant step in this direction.

3. Improvement of Wikidata

[edit]

Statements on Wikidata items are primarily edited by Wikidatans, whereas data donations (such as the Freebase dataset) are to be ingested via the Primary Sources Tool. The GlobalFactSync project will contribute to Wikidata in form of a dataset containing verified statements with respective references. These facts can then be ingested via the Primary Sources Tool in order to add missing statements to Wikidata and to add references to already existing claims. Existing statements in Wikidata which already reference DBpedia itself or a specific Wikipedia language edition can be supplemented with more reliable references, e.g. the citations found in respective Wikipedia articles. These additions will increase the completeness and trustworthiness of Wikidata statements. Beyond the data contributions created during this project the software stack will be made available for continuous application and improvement.

Project impact

[edit]

How will you know if you have met your goals?

[edit]

1. Visibility and Awareness

[edit]
  1. Output: A prototype of the tool is already online and will stay online during the whole project lifetime. Measures that will guide the development are community feedback and fulfilled feature requests. Over the project duration, we expect to incorporate over 100 issues (bug fixes / feature requests) from community channels (mailing lists, wiki discussion fora and issue tracker). Another important measure of success is the number of users. We target to have over 500 unique users per month for the tool as a whole (including User Scripts and Gadgets) at project end. We expect these visitors to be core editors that can work with the tool effectively as a hub to be redirected and improve the data landscape in Wikipedia and Wikidata. We have already started to ask volunteers in helping us with translating the UserScript and deploy it on more Wikipedias. Overall, we hope to deploy the UserScript on 15 language versions.
  2. Outcome: While the development of the tool is funded by this proposal, the DBpedia Association is able to provide further hosting and maintenance after project end. In the long run, DBpedia also benefits from better data and structure in Wikipedia and Wikidata, thus creating an incentive to maintain and further develop the system created here. Overall, we hope that this will bring the communities closer together (we can not give a number for this, however).

2. Improvements of Infoboxes

[edit]
  1. Output: With the GlobalFactSync tool, we will follow a pareto-efficient (20/80) approach. This means that we will target smaller Wikipedias that have less infoboxes, as well as analyse the data to see where edits on larger Wikipedias are the most effective, with possible customization for active WikiProjects. We have already received some feedback on how to optimize the tool to the interest of editors, which is very domain-specific, e.g. interest is focused on one topic/infobox or one property. Another efficient way is to compare specific templates with respect to completeness of values in Wikidata. In order to generate a specific infobox in one language from Wikidata, we can focus on completing Wikidata (with statements from other languages or by asking the community) for a given infobox. In order to demonstrate the results of GlobalFactSync, we will select 10 Sync Targets. A sync target can be: one infobox in one language, similar infoboxes across languages or a WikiProject. The sync target is reached when the infobox is in consensus with the rest of the WikiVerse and proper references exist. Also, the infobox attributes should be mapped to Wikidata properties.
  2. Outcome: We can give global statistics on consensus between the different infoboxes and Wikidata. Consensus here is based on a conformity measure, e.g. how many of the values agree in the Wikiverse. The global statistic will give an overall sense of achievement and we can show what is in sync and what needs attention. Reaching the 10 sync targets will give a blueprint for other projects/interest groups to sync their area of interest.

3. Improvements of Wikidata

[edit]
  1. Outcomes: 2/3 of existing Wikidata statements has missing references or "illegal" references to Wikipedia. Based on the mapping of infoboxes to Wikidata and the extraction of references, we can identify references from Wikipedia for these statements specifically. On condition that these completions can be made automatically[1], we foresee to add at least 500,000 missing references from Wikipedia to already existing statements (We found 913,695 for English already, see talk page). Based on the analysis of the infoboxes and the sync targets, we set as a target that the GlobalFactSync users will add at least 100,000 missing curated statements with proper references into Wikidata. These statements will be very targeted to specifically increase completeness and trustworthiness of Wikidata for the 10 sync targets. As a secondary goal, we will also produce more high quality statements with references, that can be vetted via the Primary Sources tool or Quickstatements. As a side effect, we will provide the whole corpus of extracted references to the Wikidata community. From these references sources can be found for linking and authority control by the community.
  2. Output: In this project a workflow will be set up, that generates valuable datasets and references for ingestion to Wikidata. This dataset has to be of high quality and therefore must obey the following data quality rules: facts should be backed by multiple (2+) Wikipedia language editions, there should be no or only slight (<20%) contradiction between different languages editions, facts need a reference in at least one language edition, and the references should be sufficiently described. The software created during this project will be made available for further improvement and application. As DBpedia is continuously improving its data and reference extraction capabilities, the GlobalFactSync tool chain will show its value in the long run as data is curated via the Primary Sources Tool. It is therefore of great importance to take up on the community to the development of the involved processes. We will provide a workflow to provide continuously updated data for future ingestion.

Do you have any goals around participation or content?

[edit]

In particular, we hope to increase participation around the mapping of Infobox attributes (template parameters) to Wikidata properties as this is the basis for synchronisation and reaching critical mass respective data quality and completeness.


Project plan

[edit]

Activities

[edit]

Our proposal involves the following tasks:

ID Title Description Month
A1 Study Conduct a small study to choose two initial sync targets, e.g. a popular WikiProject (Monuments) that can be improved in a smaller language and another target with less developed infoboxes with high variety among languages. Analyse the lack of references in Wikidata, e.g. which entities and properties are especially bad referenced. M1
A2 Mapping Generation Translate existing DBpedia mappings to Wikidata properties. Check status of mappings for the two targets from the study. M1-M3
A3 Source References for Facts Extract source references for facts within Wikipedia infoboxes. The resulting dataset (reference corpus) will be published. The task concludes with an initial ingestion ready dataset. M1-M6
A4 GlobalFactSync tool Extend the current early prototype with new features (language filter, snippet suggestion). It concludes with a more user-friendly GlobalFactSync tool and will be tested with the two targets. Based on the experience gained 8 more targets will be chosen. M2-M8
A5 Mapping Refinements Complement the existing mappings from infobox templates to Wikidata with a data-driven method based on the study. Enrich the mappings to match infobox template requirements for template suggestions. M4-M6
A6 Third-party Data and Reference Integration Integrate facts from and references to external datasets. It concludes with data and reference additions to the ingestion ready dataset. M7-M9
A7 GlobalFactSync WikiData ingest Develop a workflow to populate Wikidata via the Primary Sources Tool and through a bot. It delivers data available in the Primary Sources Tool in M10 and concludes with bot-ingestable data in M12. M10-M12
A8 GlobalFactSync Sprints Conduct two GlobalFactSync Sprints with the help of the community. Execute and evaluate one sprint in M8 and a second sprint in M11. During M8 and M11, the tool will be extended based on the feedback of the sprints and UserScripts are created. M8 & M11
A9 Community dissemination Promote the project and present the GlobalFactSync tool on different community events. The 10 sync targets will be published as success stories. M1-M12

Budget

[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!


The total amount requested is 63,000€/71,621 USD.

Item number Category Item Description Number of units for 12 months Total cost notes Source of funding
1 Personnel cost Software developer time 12PM 63,000€ Full time position (40 hrs/week) This grant
2 Personnel cost Data Acquisition Support 6PM 15,500€ 1 developer from the DBpedia Association working on extraction and fusion will support the project. We expect that this will require a workload of 5-8h/week each. DBpedia Association
3 Travel Travel budget and accommodation. 1 1,000€ Travel budget for the developer to go to the Wikimania 2019 (Sweden). DBpedia Association
4 Equipment Laptop 1 1,000€ Used by the developer during his work. DBpedia Association
Project support obtained from DBpedia 17,500€
Project funding requested from Wikimedia Foundation 63,000€
Total project cost 80,500€
Total amount requested 63,000€/71,621USD

Community engagement

[edit]

The community engagement strategy aims to provide communication and exchange platforms to discuss the progress of the project and to interface with users and gather their feedback. Especially, data science researcher and developers will be involved and will be asked to give feedback. The community engagement process will include postings on social media and mailing lists as well as presentation of the project results at community events and conferences.

WikiProjects/Interest Groups

[edit]

While the tool we propose is very generic and can be applied to a huge number of infoboxes and Wikidata reference issues, we are planning to focus efforts on smaller communities within Wikipedia/Wikidata, where our tool can be of specific benefit. We have not yet contacted these groups. In M1 we will conduct a study to see in which areas the Wikiverse is especially underdeveloped and therefore our tool will be the most effective. Based on this study, we will try to match the respective community. Our volunteers can mediate contact to the respective languages (Greek, Czech, Macedonian and Spanish are well-represented already).

DBpedia Community Meetup

[edit]

We will present this proposal at the DBpedia Meetup on 7th of February 2019 in Prague, Czech Republic. And we will continue advertising the progress of the project at following bi-annually held Community Meetings.

Wikimania 2019

We will present the tool which supports the editing of infoboxes at the Wikimania 2019 in Stockholm, Sweden.

GlobalFactSync Sprint

[edit]

In addition to the community events, we will send out a Call for Contribution: members of the communities are asked to make use of the Primary Source Tool to bring ingestion ready data (statements and references) to Wikidata. DBpedia has good experiences with user involvement when calling for the annual Mapping Sprint where users contributed mappings of Wikipedia templates to the DBpedia ontology.

The following communities (without any claim to completeness) will be notified and will be involved in the project:

  • Wikidatans;
  • Wikipedians;
  • DBpedians;
  • 20 language chapters of DBpedia;
  • Open Source Organizations;
  • Data Science Community;
  • Knowledge Graph Community.

Strategic Partners

[edit]

We target collaboration with the following list of partners to maximize the outcomes of this project:

Get involved

[edit]

Participants

[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

DBpedia Association

The DBpedia Association was founded in 2014 to support DBpedia and the DBpedia Community. Since then we are making steady progress towards professionalizing DBpedia for its users and forming an effective network out of the loosely organised DBpedia Community. The DBpedia Association is currently situated in Leipzig, Germany and affiliated with the non-profit organisation Institute for Applied Informatics (InfAI) e.V.

TBA (Software Developer)

  • Software Development, Data Science, Frontend Development
  • Skills: Scala programming, deep knowledge about DBpedia and Wikidata
  • Developer will be hired/selected from the community.

Julia Holze (DBpedia Association)

  • Organization & Community Outreach, support in organizing and spreading the GlobalFactSync Sprints

Sandra Bartsch (InfAI)

  • Organization & Community Outreach, support in organizing and spreading the GlobalFactSync Sprints

Sebastian Hellmann (DBpedia Association and AKSW/KILT) has completed his PhD thesis under the guidance of Jens Lehmann and Sören Auer at the University of Leipzig in 2014 on the transformation of NLP tool output to RDF. Sebastian is a senior member of the “Agile Knowledge Engineering and Semantic Web” AKSW research center, which currently has 50 researchers (PhDs and senior researchers) focusing on semantic technology research – often in combination with other areas such as machine learning, databases, and natural language processing. Sebastian is head of the “Knowledge Integration and Language Technologies (KILT)" Competence Center at InfAI. He also is the executive director and board member of the non-profit DBpedia Association. Sebastian is contributor to various open-source projects and communities such as DBpedia, NLP2RDF, DL-Learner and OWLG and wrote code in Java, PHP, JavaScript, Scala, C & C++, MatLab, Prolog, Smodels, but now does everything in Bash and Zsh since he discovered the Ubuntu Terminal. Sebastian is the author of over 80 peer-reviewed scientific publications (h-index of 21 and over 4300 citations) and started the Wikipedia article about Knowledge Extraction.

Volunteers and interest

  • Volunteer To be determined :) Jimregan (talk) 15:32, 19 January 2018 (UTC)
  • Volunteer with testing, providing feedback, and translating Jimkont (talk) 08:35, 23 January 2018 (UTC)
  • Volunteer provide feedback, help with the development of the idea, translating M1ci (talk) 13:35, 23 January 2018 (UTC)
  • Volunteer Translation to Spanish Marianorico2 (talk) 10:15, 25 January 2018 (UTC)
  • Volunteer Translate to Greek, provide feedback. S.karampatakis (talk) 18:01, 27 January 2018 (UTC)
  • Volunteer provide feedback, help with the development of the idea and implementation, translations to Spanish Nandanamihindu (talk) 18:20, 30 January 2018 (UTC)
  • Feedback + interest. –SJ talk  20:44, 14 November 2018 (UTC)
  • Volunteer I have experience with building large knowledge graphs from multiple sources, including from Wikipedia and Wikidata. I can probably advise or help on several tasks if needed, including on science and data. Nicolastorzec (talk) 21:01, 30 November 2018 (UTC)
  • Advisor Helping make wikicite one of the use cases! –SJ talk  01:03, 2 December 2018 (UTC)
  • Volunteer I have been actively involved with both projects and worked a lot on creating a mapping infrastructure between Wikidata data and DBpedia, willing to help wherever needed Jimkont (talk) 11:16, 6 December 2018 (UTC)
  • Volunteer To volunteer to offer any testing of this tool that may be needed. ChristorangeCA (talk) 19:37, 25 July 2019 (UTC)
  • Volunteer any way I can? Moebeus (talk) 15:37, 15 August 2019 (UTC)
  • Volunteer Extraction and analysis of references Lewoniewski (talk) 10:11, 27 August 2019 (UTC)
  • Volunteer Become a contributor 81.170.17.33 11:50, 11 December 2019 (UTC)
  • Volunteer I want to contribute with enhancing Czech dataset 77.75.74.248 16:58, 24 March 2020 (UTC)
  • Volunteer interested to help on HarvestTemplate and sailing topics. Simon.letort (talk) 10:47, 21 April 2021 (UTC)

Community notification

[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

At the beginning of this year we distributed the previous proposal (GlobalFactSync) with different communities and received a lot of feedback. Since this proposal has the same topic as the previous one, it is already known to the relevant communities.

Wikimedia & Wikidata Community

Wikipedians

[edit]

For the last proposal we have received constructive feedback from many Wikipedians:

Updates

[edit]

Social Media

DBpedia Community

Wikipedia & Wikidata Community

Endorsements

[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • Endorsements from the previous proposal
  • Support Support the synchronisation of facts is a necessary step towards an improved appreciation of what Wikidata has to offer. GerardM (talk) 11:32, 29 November 2018 (UTC)
  • Support Support, quality comparison of multilingual data is an actual problem and a challenging task. Lewoniewski (talk) 12:22, 29 November 2018 (UTC)
  • Support Support --Vladimir Alexiev (talk) 00:49, 30 November 2018 (UTC)
  • Support Support I find this project as a very good way to bring Wikidata and DBpedia closer together and benefit both communities. I think it also brings a very good way to feed sourced/cited statements from wikipedia projects (or add references to existing statements) to Wikidata which is something that will benefit Wikidata a lot Jimkont (talk) 09:27, 30 November 2018 (UTC)
  • Support Support Linked Data Community needs more synchronized and high quality of content to be consumed by the end-users, public and private sector stakeholders. YamanBeyza (talk) 10:15, 30 November 2018 (UTC)
  • Fact synchronization and fusion is key to increase Wikipedia's quality across languages in an efficient way. Hopefully it will eventually benefit Wikidata and its users too. Nicolastorzec (talk) 18:48, 30 November 2018 (UTC)
  • Strong support Strong support I would love to see more interaction between the DBpedia project and Wikidata. A collaboration with DBpedia to improve the extraction process from infoboxes sounds like a very natural project. − Pintoch (talk) 21:32, 30 November 2018 (UTC)
  • Support Support this is a promising approach to improve the data quality of Wikidata. The DBpedia community has a lot of experience in this area so seeing the two groups supporting each other more can't be bad (rather they should jointly serve the same purpose - making good data more accessible!). Lambdamusic 17:03, 4 December 2018 (UTC)
  • Support Support It seems an interesting idea, that provides another way to double check data quality in Wikidata and a peculiar way to bring Wikidata and DBpedia closer. Sannita - not just another it.wiki sysop 01:07, 6 December 2018 (UTC)
  • Support Support Synchronization would be very helpful! Digitaleffie
  • Support Support with the hope that this can help make a dent in the number of referenced statements on Wikidata and that we can learn some things from the project for my team's work on editing Wikidata from Wikipedia. --Lydia Pintscher (WMDE) (talk) 20:54, 6 January 2019 (UTC)
  • Support Support Essential problem to solve for the datas of the Wikiverse. Data isolation between projects is a problem for Wikipedias as high level datas available in some project are not easily reusable in other projects. As it seems a lot of wikipedians are relunctant to contribute directly to Wikidata to solve the problems, and data sharing is not an easy task, such a tool is a really interesting step towards a better data sharing. TomT0m (talk) 12:01, 7 January 2019 (UTC)
  • Support Support I think it is very important to strengthen the interplay between the largest data communities of the Web. White gecko

References

[edit]