Grants talk:Project/DBpedia/GlobalFactSync/Archive 1

From Meta, a Wikimedia project coordination wiki

Import reference to Wikidata - quality check - Reusing reference

Moin Moin together, I come from the German-speaking Wikipedia, so in advance apology for school English. Also I find it difficult to transfer data to all Wikipedias, especially dates of birth and death data, but also to the population. The data, which should be imported, would have to comply with requirements so that further use can be guaranteed. I would follow the template de:Vorlage:Internetquelle / en:template:cite web. Because you also have to consider what happens when a web link is switched off and this is only available as an archive link.

I tried to create an example of how the data should be imported to Wikidata, so that further use becomes meaningful and understandable for everyone.

Link in infobox (examples) Example to import (Modul Wikidata)
|population=50 <ref>[https://examplewebsite.com ''reference'']</ref> In Wikidata:
title (P1476)
reference URL (P854)
subtitle (P1680)
publication date (P577)
author (P50)
retrieved (P813)
archive URL (P1065)
archive date (P2960)
file format (P2701)
original language of film or TV show (P364)
In Wikidata:
references
examplewebsite.com
of 2018
1. January 2018
Example men, Example woman
1. Januar 2018
-
-
-
german
|population=50 <ref>{{Internetquelle |url=https://examplewebsite.com |titel=reference |zugriff=2018-01-16}}</ref>
|population=50 <ref>{{Internetquelle| autor=Example Man, Example woman| url=https://examplewebsite.com| titel=Reference| titelerg=of 2018| hrsg=examplewebsite.com| datum=2018-01-01| archiv-url=| archiv-datum=| zugriff=2018-01-16| sprache=de}}</ref>

The data can be displayed with the module Wikidata in other templates (like infoboxes etc.). So my question is: In what format is the data imported with the tool (GlobalFactSync tool or Primary Source Too)? For any inquiries, I'm willing to help. Regards --Crazy1880 (talk) 17:32, 16 January 2018 (UTC)

Thank you for pointing to the references of the templates. The data gets extracted by DBpedia from Wikipedia infoboxes and we will trace respective references for single facts. We intend to make use of the existing Primary Source Tool for providing data to Wikidata, i.e. datasets will be provided in form of QuickStatements or Wikidata compliant RDF. --Mgns (talk) 15:07, 17 January 2018 (UTC)
Crazy1880 Thank you for your interest in the project. Please feel free to register as a volunteer on the project page. We look forward to your ideas and support.--Juliaholze (talk) 16:34, 17 January 2018 (UTC)
Thanks again for the impulse. We adopted the idea in the proposal. --Mgns (talk) 14:01, 23 January 2018 (UTC)
when creating an example for book data import, please respect the decisions about the cataloguing of books. The property to be used for language is P407, not P364 which has been deprecated. See discussions on Project Books and P364 for more info.

Query/Sparql

As Wikidata and DBpedia can be queried with SPARQL, couldn't be possible to query both at the same item and find possible additional statements directly through query.wikidata.org and upload with QuickStatments? This especially as some DBpedia mappings have already been imported to Wikidata. --Jura1 (talk) 14:09, 20 January 2018 (UTC)

Currently there is no SPARQL endpoint containing all DBpedia language editions and it would become a mess to do so. So you could do this for English or German or Greek DBpedia separately, but this does not help you either. Secondly, there are no references extracted by DBpedia at the moment which is a major requirement for importing data to Wikidata. And last but not least this project is not primarily about loading data to Wikidata but to provide an overview over the variation of the same-but-different facts in Wikipedia infoboxes and Wikidata. --Mgns (talk) 21:19, 20 January 2018 (UTC)
Yes, quite a good idea. Let's see. So Wikidata is accessible via Itemview, i.e. looking at one Q and via SPARQL at query.wikidata.org. DBpedia does only do an extraction based on the dumps. The way we are leaning at the moment is to do infobox values to wikidata property mappings, so the extraction can be done directly for each Wiki article. So maybe we can do it like this: Starting from a Qx, we can query all interlinked Wiki articles and then do an extraction from the WP2WD mappings. This would serve as a Quickstatements input. Do you think this would work? --SebastianHellmann (talk) 14:26, 21 January 2018 (UTC)
I think people at Wikidata generally focus on a given property for a series of items or for a several properties on a group of items. So a query for all missing dates of birth/deaths could be useful, but we already do most of that. --Jura1 (talk) 13:49, 30 January 2018 (UTC)
Ah, ok, I finally understand how you meant it, originally. There are quite a few options here. I think for this proposal, we are focusing on Wikipedia editors, who are mostly doing one article at a time. For your idea: The proposal here also focuses on extracting the references/sources from Wikipedia (and also suggesting new ones). These have a high value as such for Wikidata and can be loaded in numerous ways. We could also do it via SPARQL in the way that you suggested. However, we probably need to set up a special DBpedia SPARQL endpoint with all the language versions + the references and then it might get a bit awkward to query with SPARQL, although it would work. Might be that we can load it into the Blazegraph of Wikidata directly as Blazegraph has native support for metadata like references. This is something we need to think about a bit more and discuss with staff members. An extra API call from Quickstatements might also work. Is there an overview page of all the queries you do and the coverage? I have only seen this occasionally, also the tests, that are created. --SebastianHellmann (talk) 09:56, 31 January 2018 (UTC)
Making a full extraction of 120 languages normally takes 10-14 days on 64 core server with 128 GB RAM and then needs 3-4 days to load into a SPARQL database. This might be a barrier as data is almost 3 weeks old this way, although we are porting to SPARK now. We suggested to use an on-demand per page extraction, which can be done live on request. --SebastianHellmann (talk) 10:52, 31 January 2018 (UTC)

Population number sample

Population numbers at Wikidata are somewhat problematic. I'm not sure if much is being added if we import more of them from Wikipedia (or a Wikipedia mirror). --Jura1 (talk) 14:09, 20 January 2018 (UTC)

Well, it is just one example. Could you tell us what the problem is? There is a number, a date and a reference. Should be straightforward. --SebastianHellmann (talk) 14:08, 21 January 2018 (UTC)
There is also the determination method. We came to the conclusion that given their bulkyness that some are better stored at Commons in tabular format. --Jura1 (talk) 16:24, 21 January 2018 (UTC)
I don't know what to make out of this comment. So they are supposed to be stored somewhere else? How do you query them then at query.wikidata.org ? I see that the value as such is not so simple as e.g. birthdate. But the claim-statement + +scope + reference structure in Wikidata is exactly designed for such cases. --SebastianHellmann (talk) 12:10, 22 January 2018 (UTC)
for problem 1 , you need to go to the referenced data, from the government source, and build an API / automatic transfer method to wikidata. manual update does not scale. on english once upon a time RAMbot did a one time population of infoboxes, but it is stale. we have talked to US census but they have publishing model, not API. Slowking4 (talk) 14:13, 24 January 2018 (UTC)
Regarding manual updates, it is clear that they don't scale infinitely. The Wikiverse only has a certain amount of cities. At the moment, you have 200 pages per city (across WP plus WD), where editors have to do an update. If we enable to see the other 199 values in a sensible way, this should help a lot and at least simplify the manual effort. Hopefully to a level, where it is manageable. Wikipedians have maintained this value in the infoboxes for a decade now manually. So it kind of works, it is just very hard work. --SebastianHellmann (talk) 14:35, 28 January 2018 (UTC)
There are probably at last ten reasons why this sample isn't that great. Some of them: (1) a good basis for data could be a specific source for the location (all years). (2) a good basis could be a specific source for all similar places in a region for a given year. (3) fields with multiple values are hard to maintain in Wikidata (interface wise) (4) depending on the determination method, there can be several "best" values (e.g. current estimate, most recent census). (5) merely viewing values or even values and sources isn't going to help much to pick among them. (6) monthly values of (1.) should use tabular data at Commons. (7) some infoboxes store values centrally in the template and output the applicable value (8) .. --Jura1 (talk) 13:43, 30 January 2018 (UTC)
Ok, I see your point, maybe there is still time today (deadline) to change to another example. Regarding (5) we made a proposal to put some intelligence into the view. While this might not be perfect, I would argue that it helps a lot. From my experience, I would estimate that we can pinpoint and the best value based on data analysis in 30% of the time and rank it first, and rank it among place 2-5 in 60% of the cases. We can optimize this similar to Information Retrieval. At the moment, this seems to be a rather manual process, where people are browsing in separate tabs for aid. --SebastianHellmann (talk) 10:23, 31 January 2018 (UTC)

This has huge potential - providing it's done right

As someone that's been working on migrating English Wikipedia infoboxes to use Wikidata for the last year or so, I can say that the biggest problem right now is to have sufficient information on Wikidata that it can immediately provide a decently filled out infobox. At the moment far too many new statements have to be added to Wikidata to reach basic infobox standards. It sounds like this proposal would do well at improving that.

However, once there is sufficient information on Wikidata to fill an infobox, you immediately run into the issue that the Wikidata information needs to be well-referenced and reliable - if it's not, then it won't be accepted by at least the English Wikipedia. Looking through this proposal, I can't see how it would achieve that. We can't just reference DBpedia - we need to look at sources that aren't derived from Wikipedia to provide good references. And sadly, most infoboxes are completely unreferenced (often by design). The references we need sometimes exist in the article content, and sometimes they are in external databases, and sometimes we need to find completely new sources for them. If this proposal can do that, then that would be fantastic - but it's a hard challenge, and falling short of this might make things worse rather than better.

I'm happy to engage in a conversation here if that would help. Thanks. Mike Peel (talk) 19:17, 20 January 2018 (UTC)

My experience as someone who has been importing data from multiple Wikipedias is that the biggest issue is quality. In awards, in where people went to school there is more than a 5% difference between what a Wikipedia says and what is known elsewhere. This proposal is not about populating info boxes, it is about the quality of the data. When we concentrate on the delta between the various sources, when we curate and source where we face issues, we target our effort on where it makes a difference. The current "source everything" approach is a blunt instrument that does not generate the result we could have when we face our issues. I have been happy to engage in a conversation and hope this helps. Thanks, GerardM (talk) 06:25, 24 January 2018 (UTC)
need to build a team to improve data quality. need to build methods to input data references from awarding institution, or data generating institution source. the "source everything" will harm the wikis that insist on that unreasonable standard, that is their cultural decision. Slowking4 (talk) 14:23, 24 January 2018 (UTC)
Hey @Mike Peel:, note that we had the old proposal more focused on reference finding: GlobalFactSync. This was a cooperation with Diffbot and we are still doing the text analytics described there, i.e. mining the facts and references in the article text. Also it is easy to see how to include other tools for that like StrepHit. We removed that from this proposal as it might distract from the step we have to make first, i.e. working with the references that are already in Wikipedia's infoboxes, but not in Wikidata. So, nobody is planning to reference Wikipedia or DBpedia, but there are ideas on how to bring in more an better references in future work. --SebastianHellmann (talk) 14:48, 28 January 2018 (UTC)

Other initiatives for value comparisons?

If the primary purpose of this is to compare values between infoboxes (at Wikipedia and Wikidata), how does it compare to other initiatives for that (at one or the other Wikipedia and at Wikidata)? How far is there interest for such a feature? --Jura1 (talk) 11:28, 21 January 2018 (UTC)

What other initiatives? VisualEditor does not show anything. Wikidata is sometimes used in infoboxes, but not very often. Do you know anything concrete? --SebastianHellmann (talk) 14:19, 21 January 2018 (UTC)
Ok, thanks for the links, we will look at them. What we are suggesting here would be a more systematic approach, covering all properties and all infoboxes (*as good as possible) and provide tools for comparison and also helping with the mapping. --SebastianHellmann (talk) 11:52, 22 January 2018 (UTC)
d:Q6823265: Merlissimo's project. Crosswiki comparison on dates of birth and death. (2012-2014) --Metrónomo-Goldwyn-Mayer 06:18, 29 January 2018 (UTC)
@Metrónomo: Was this done manually? It takes maybe 10 minutes for us to write a query producing these figures on a regular basis. --SebastianHellmann (talk) 09:53, 30 January 2018 (UTC)
Ok, got it, he is crossreferencing Wikidata with the Categories such as LivingPeople. Seems to work well, but only for properties like deathdate, i.e. if deathdate exists, the page should not be in the category. It doesn't go into the values as such, i.e. show varying deathdates between wikis. --SebastianHellmann (talk) 09:58, 30 January 2018 (UTC)
Was updated using MerlBot. According to its operator, it was written in Java. I do not know if he used Wikidata, but compared the categories of birth and death (mainly, death) in a certain list of Wikipedias. All biography without category of year of death was considered "living person". --Metrónomo-Goldwyn-Mayer 10:12, 30 January 2018 (UTC)

Experience with dbpedia data import into Wikidata ?

Even if the purpose of this feature is to compare data/sources, is there are any experience from project participants with importing comparable Wikipedia based dbpedia-data into Wikidata? --Jura1 (talk) 17:39, 21 January 2018 (UTC)

We have been discussing this for years now. In order to make any data, we have been extracting, acceptable for Wikidata, we need to make an extension that extracts references. That was also a big barrier in the first place, otherwise Wikidata might have started out with DBpedia data as a seed. The reason why this didn't happen is that the DBpedia community is more focused on the data itself and not on the references. This would become the glue between the communities. --SebastianHellmann (talk) 11:55, 22 January 2018 (UTC)
Personally, I think it's hard to assess the proposal as it's untested as such imports are (still) useful. --Jura1 (talk) 13:47, 30 January 2018 (UTC)
What about this figure from Wikicite 2017 starting on slide 15. The bar shows that there are 2/3 of the Wikidata Statements not referenced. So there is huge potential for this, if not for the actual facts. --SebastianHellmann (talk) 10:29, 31 January 2018 (UTC)

Digging your own grave?

DBpedia is slowly becoming less relevant. More data gets added to Wikidata and less data is missing compared to the infoboxes. At one point Wikidata will have more and better sources data than the Wikipedia infoboxes. Why would anyone still use DBpedia if we reach this point? Why we invest in a project that is slowly going to fade? Multichill (talk) 18:21, 22 January 2018 (UTC)

DBpedia is becoming really relevant now. Wikidata has strategic importance for DBpedia and we are trying since years to contribute in any way. The reference extraction is crucial here, but we don't have the resources to do this, hence the proposal. If you read the proposal, the grant is not about supporting DBpedia. The goal here is to exploit the technologies we have to improve Wikipedia and Wikidata. I see your point, that the DBpedia Extraction Framework will not be so relevant, if we are successful in this project. From a research perspective, this was super successful (see the 5k citations from Soeren). We are working on other things now, so the real value for DBpedia lies in receiving good data from all Wikimedia projects. So the perspective here is to move from a parallel effort to a symbiotic relation. --SebastianHellmann (talk) 09:58, 23 January 2018 (UTC)
Dear Multichill, it's unfair to make claims/statements without any supporting evidence. How did you measure relevance? Can you please drop some numbers/evidence here? Have a look at Google trends for the term DBpedia - I see quite strong and constant DBpedia trend. Also, lets take this from a scientific point of view: there are over 22K papers research papers mentioning DBpedia in their title, while only 4K for Wikidata. My point here is not to compare the popularity of DBpedia vs Wikidata, but more to confirm the relevance of DBpedia. Another important thing - this project is not about supporting DBpedia, but developing technology that many existing and future new Wikimedia projects would benefit from. This is not about who has more or less data, but about management, comparison, import of data across different Wikimedia projects.--M1ci (talk) 13:07, 23 January 2018 (UTC)
Given that DBpedia imports all Wikipedias and Wikidata. Given that it has its existing audience who know why to trust their effort, it is safe to say that have a bright future ahead of them. If anything, we would do better when we cooperate with DBpedia. Thanks, GerardM (talk) 06:27, 24 January 2018 (UTC)

Important: Change your proposal status from "draft" to "proposed" to submit by deadline

User:SebastianHellmann,

Please note that you must change your proposal status from "draft" to "proposed" to submit by your proposal for review in the current round. You have until end of day today to make the deadline.

Warm regards,

--Marti (WMF) (talk) 21:49, 31 January 2018 (UTC)

Thanks for the reminder, I set it to proposed. I still see some small edits. I hope it is ok, if I edit another hour or so. --SebastianHellmann (talk) 22:54, 31 January 2018 (UTC)
SebastianHellmann, no problem to keep editing. Good luck! --Marti (WMF) (talk) 22:56, 31 January 2018 (UTC)

Estimated number of references available ?

Is there a break-down by property value of the 500,000 estimated references for statements and the 100,000 additional statements ? From the experience of several members of Wikidata community, the references available at Wikipedia directly in infoboxes is rather low. This even in a few languages where the information is only available in the infobox and not sourced in the article text itself. --Jura1 (talk) 15:09, 1 February 2018 (UTC)

  • Another number that could be interesting is the break-down by property of number of diverging statements with references (this seems to be main focus of the tool, but no indication is given). Are these included in the 500,000? --Jura1 (talk) 09:47, 2 February 2018 (UTC)
The number you are asking here are the results of task A1 and A3, which last 6 months. We can give a very detailed breakdown after the completion of this task. We are also very interested in these statistics as they will guide us where to effectively improve the Wikiverse. Do the several people you mention from Wikidata have any further insight on this? The 500,000 from our side is an estimated guess. We are extracting 14 billion facts from all Wikipedias. 500,000 means that 3-4% have references, which seems about right. To be on the safe side, there is a parallel effort to extract facts and references from the Wikipedia text by the DBpedia NLP department lead by Diffbot plus a consortium of universities and other orgs, so I think that we can definitely reach 500,000, but as I said, it is an educated guess with Task A1 and A3 bringing definitive clarity on this.
The numbers we can give you are the references for 120 languages, you can go here (http://downloads.dbpedia.org/2016-10/core-i18n/) and then check each language and look for citation-data and citation-links (around 2.5GB for English). These reference are from the HTML of the wikipage and are anchored to the text. Relating them to the facts from the infoboxes or text is quite hard. We can also send you a full statistic on which facts are in infoboxes and not in Wikidata (we did this for three properties in the grant, but soon we will have a complete view on this ). --SebastianHellmann (talk) 10:24, 2 February 2018 (UTC)
Estimations would be fine. Is it possible to say which properties these 500,000 / 100,000 facts relate to? If there are just 100 population datapoints for 5000 / 1000 locations, these isn't really worth the exercise.
From the dbpedia resources available, is there a way to say what type of facts these are? Maybe you could base yourself on enwiki/dewiki where you seem to have sparql access.
For the tool to work, shouldn't there be multiple sourced conflicting facts? At Wikidata, it's fairly easy to spot items that have conflicting claims. Can't this be done at dbpedia? Given that this has taken some time to build and improve at Wikidata, I'm not sure why a WMF grant would be needed to build this functionality to be developed at dbpedia. --Jura1 (talk) 08:46, 3 February 2018 (UTC)
The conflict finding is not funded by this proposal. We have a whole team like 4-5 people working on conflict finding for the fusion algorithms in DBpedia for the whole of LOD. We estimated 2 people of the 4 with 10h/week to bring these algorithms into the proposed project. The grant is about extracting the references and then make this information visible to the Wikiverse. We are not in need of funding DBpedia via this grant, we are fine. We just don't have any surplus to do this on our own expense and if you read it, the project does not benefit DBpedia directly. We don't need the references for our users. The goal here is really to make infoboxes and Wikidata better and form consensus and collaboration between all communities.
Also I don't understand what could be a conflict in Wikidata. The model is claim-based and claims are always correct per definition as they cite the facts of the source. So what would be a conflict here? Maybe you mean that you are rating the claims by confidence of their correctness?
Last year the CTO of DBpedia made the W3C SHACL standard. These can be partially generated automatically and also transferred between ontologies.
I can check next week what kind of statistics we can produce at the moment. This is some work though and might take a week or two. --SebastianHellmann (talk) 08:18, 4 February 2018 (UTC)
It does mention "main information an editor can quickly gain here is that there are currently 8 different values". (If the analysis wouldn't drop two relevant characteristics of population numbers,) this would indicate that there is a conflict.
As I understood it, presenting such diverging values is the main idea of the project.
Statements that are reasonably sourced (i.e. references correctly extracted) without any conflicting values could easily be added directly to Wikidata. Sample: add the date of birth with reference for 8 people with currently no d:Property:P569 in corresponding Wikidata items. In the unlikely case that there were 8 different sourced values for the same person, it might worth checking which rank each statement should get rather than adding them automatically. --Jura1 (talk) 20:47, 6 February 2018 (UTC)
Hi @Jura1:, we asked around and in August 2016 somebody from the community did an experiment and tried to match facts to references. The dataset is here: data, preview. I am counting 913,695 for English. I need to investigate more, who made that data and how it was created and what is the quality, but it seems to be feasible to get 500k references, especially, if we extend beyond English. --SebastianHellmann (talk) 08:35, 13 February 2018 (UTC)

Basic infobox mapping

There is a feature at https://phabricator.wikimedia.org/T69659 , but it seems there isn't much interest in this at Wikipedia. --Jura1 (talk) 15:09, 1 February 2018 (UTC)

Thanks for pointing to this discussion. We are happy that other users see good reason in maintaining these mappings as well and that there is adoption in some languages already. Even when it is not in scope of this project to import Wikidata into infoboxes automatically, the TemplateData maps look promising for our needs. For certain, we will assess how to align this concept to the project and reach compatibility in order to avoid duplicate effort. --Mgns (talk) 13:28, 5 February 2018 (UTC)

Eligibility confirmed, round 1 2018

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 1 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through March 12, 2018.

The committee's formal review for round 1 2018 will occur March 13-March 26, 2018. New grants will be announced April 27, 2018. See the schedule for more details.

Questions? Contact us.

--Marti (WMF) (talk) 01:53, 17 February 2018 (UTC)

Comment

It's a bit difficult to see how this would benefit the Wikimedia movement. If I understand correctly, what the team wants to do is a lot of difficult technical work with relatively small outcome for Wikimedia projects. I asked for clarification at the Czech Village pump (where notification was posted) but unfortunately received no response - if you could look into it (@M1ci:) that would be great.--Vojtěch Dostál (talk) 10:23, 19 February 2018 (UTC)

Dear Vojtech, I have just replied to your comment at the Czech Village pump, sorry for the delayed response. M1ci (talk) 13:26, 20 February 2018 (UTC)

Aggregated feedback from the committee for DBpedia/GlobalFactSync

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
7.6
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
7.8
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
8.0
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
6.8
Additional comments from the Committee:
  • The proposal fits with Wikimedia's strategic priorities and has a great potential for online impact. However it sustainability and scalability are less clear because the proposed tool may quickly fall into irrelevance or never has sufficient number of users.
  • High potential of cross-wiki impact, as high-quality data with references on Wikidata is likely to improve a lot of Wikimedia projects. Minor concern regarding sustainability once the grant ends (tools should continue to be live and maintained)
  • The project looks innovative. Its potential impact is high but there are significant risks. The main risk is that few people will use their tool/website. The success can be measured. The project is more narrowly focused and less risky than its predecessor - CrossWikiFact.
  • A mix of innovation and iteration. On iterative side, DBpedia has a significant experience with extracting data from infoboxes and a good track record. On innovative side, this data was never properly integrated to Wikidata, and it is likely to generate a significant impact. Measures of success look good.
  • The project can be accomplished in the requested 12 months and the budget appears to be realistic. The participants probably have necessary skills.
  • DBpedia has a good experience, planning and budget are all reasonable, pretty sure this is feasible as planned.
  • The community engagement appears to be limited - basically only the Wikidata community - the same as with CrossWikiFact.
  • There is rather high community support, including both Wikidata editors and Wikipedians from underrepresented communities.
  • Considering the sharing of costs, I am slightly in favor to support it. We know that DBpedia is a relevant project, still used by a lot of website.
  • I supported the previous variant of this proposal - CrossWikiFact. I will support this time by the same reasons although the proposal has become more narrowly focused and generally better, in my opinion.
  • Support full funding
  • Good project, with specific goals, reasonable plan and budget and good community engagement. Don't see any red flags thus full funding.

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.


Next steps:

  1. Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
  2. Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.
Questions? Contact us at projectgrants (_AT_) wikimedia  · org.


Discussion on Infobox RFC

We added a comment here: https://en.wikipedia.org/wiki/Wikipedia:Wikidata/2018_Infobox_RfC#A_third_way? The section highlights an alternative way to treat infobox edits, that is quite in line with our proposal. SebastianHellmann (talk) 12:24, 3 May 2018 (UTC)

Vandalism

As we read the infobox RFC, we saw that vandalism seems to be one of the main problems in the usage of Wikidata in Infoboxes. It was not a focus in the proposal as we need to focus on the basics first. However, I am quite confident that it would be easy to shape the algorithms a bit in the direction of vandalism detection. Since we are monitoring all values anyhow, the vandalised value would show as an outlier compared to other values, i.e. significant different to other information. We are also working on versioning in a separate project at DBpedia, which would help to compare new values to old values. SebastianHellmann (talk) 12:24, 3 May 2018 (UTC)

Cofinancing

I want to stress how much I appreciate that this grant request includes 35 k€ in cofinancing by the DBpedia association, although I'm not sure whether those "2 developers from the DBpedia Association working on extraction and fusion will support the project" are an actual expense or a figurative one. --Nemo 09:10, 6 May 2018 (UTC)

Hi Nemo, in the proposal we focused on specifying what the WikiMedia funding will be used for. We already spent quite some time on preparing the proposal and also make the prototypes. In the end, the cofinancing covers the part of the DBpedia infrastructure/tools that needs to be adjusted in order to match the requirements of the work in the proposal. It would be inadequate to let WikiMedia pay for DBpedia development. The only extension covered in the grant are the reference extraction. The cofinancing covers: 1. changes to http://mappings.dbpedia.org in case we find a good place to merge the effort of mapping the data, 2. the adaption and deployment of the live extraction server, i.e. the component that let's you extract data on the fly http://mappings.dbpedia.org/server/mappings/en/extractionSamples/Mapping_en:Infobox_building extract some example for Infobox Building 3. Patches to any software tool that is necessary to improve the results of the project.
In short, the number is definitely not figuratively as we expect quite some workload that is not covered by the grant. We are willing to invest this as we have the expectation that we (WP, WD, DBpedia and so the rest of the world) will get better data at the end. SebastianHellmann (talk) 11:58, 7 May 2018 (UTC)

Prototype with more focus

Eiffel Tower
Tour Eiffel
General information
Type Observation tower
Broadcasting tower
Location Sync 7th arrondissement, Paris, France
Coordinates: Sync 48.858222°N 2.294500°E
Construction started Sync 28 January 1887
Completed Sync 15 March 1889
Opening Sync 31 March 1889 (129 years ago)
Owner Sync City of Paris, France
Management Sync Société d'Exploitation de la Tour Eiffel (SETE)
Height
Architectural Sync 300 m (984 ft)[1]
Tip Sync 324 m (1,063 ft)[1]
Top floor Sync 276 m (906 ft)[1]
Technical details
Floor count Sync 3[2]
Lifts/elevators Sync 8[2]
Design and construction
Architect Sync Stephen Sauvestre
Structural engineer Sync Maurice Koechlin
Émile Nouguier
Main contractor Sync Compagnie des Etablissements Eiffel
Website
.paris
References
I. ^ Tower at Emporis

We thank the committee for the feedback and would like to address two points with this reply, which are interconnected. (Point 1) We are aware that our community engagement was not yet very extensive. This is mainly caused by the fact that feedback is best discussed based on a working prototype as users can see the tool, play around with it and criticize/praise and it is following Wikipedia:Rapid_application_development. Point 2 is that our proposal is written on a meta-level, i.e. it describes how to tackle the problem from a bird's eye view. However, it is very hard to see how this would work out for the individual editor, who is only concerned with some articles where she chooses to edit normally. During the last months we have worked in the background on a better engine to aggregate data and we are now able to provide another prototype for part of the Wikipedia:Template:Infobox_Building of Wikipedia:Eiffel_Tower. We will also post the prototype on Wikipedia:Wikidata/2018_Infobox_RfC to collect first feedback there and also start to collect ideas for the 10 Sync Targets.

The prototype is a hard-coded version of the interactive user script. If activated, the script will insert a sync button next to the template parameter. The sync button could also contain different colors like green/red to dignal that the attribute might need attention. Upon clicking the editor is redirected to an external page right now to see the other values. At the moment, this is still an external website, but the underlying code is done in JSON/Javascript (full data for 50 million articles from wd,en,nl,de,sv,fr is here) and it can be integrated into UserScript and Wikipedia UI seamlessly, i.e. with a pop-up box.

The presented info is focused on the specific source of which the request comes from, in this case ENWIKI and highlights the value of the source for easier orientation. Note that the data in the prototype is still not up-to-date and also references are missing (part of this proposal).

Note that this just another prototype which might change significantly based on the community feedback. Also the data is still very sparse.

Round 1 2018 decision

This project has not been selected for a Project Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding. This was a very competitive round with many good ideas, not all of which could be funded in spite of many merits. We appreciate your participation, and we hope you'll continue to stay engaged in the Wikimedia context.

Comments regarding this decision:
We will not be funding your project this round. This was a very intriguing project idea and many reviewers were interested in supporting the project. However, this round there there many proposals submitted that the committee was interested in funding, and not enough funds to award grants to all of them. They determined that there were other proposals that were a higher priority for funding at this time.

Next steps: Applicants whose proposals are declined are welcome to consider resubmitting your application again in the future. You are welcome to request a consultation with staff to review any concerns with your proposal that contributed to a decline decision, and help you determine whether resubmission makes sense for your proposal.

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.


Quality issues in Wikipedia/DBpedia

On behalf of the research group from Poznań University of Economics and Business I would like to share a few words on what we do with regard to this grant. We basically build machine learning models for estimation of quality of Wikipedia articles based on features extracted from the article. We take into account articles in various language versions linked via interwiki links.

The methods we develop could be useful for the project with respect to resolution of disputes around the facts. If there are several values, we can provide quality of the source and use it as a weight for assessment of the most appropriate value. For this we need estimation of the quality of each article. Such estimations, so far based on a simple method, for any article in one of the 55 most active language can be found at Wikirank, a service for presenting the comparative quality and popularity of articles.

One of the important quality dimensions considered is credibility. Therefore, we also develop methods for extraction of references from articles, and we evaluate their importance base on place of publication (e.g. impact factors for journals). In-text references and references in infoboxes are considered. --KrzysztofWecel (talk)

Some of our researches were already been presented for Polish Wikipedia community (e.g. during Wikimedia Polska 2016.) --Lewoniewski (talk) 15:08, 18 May 2018 (UTC)

Some of our works related to extraction and analysis of references in Wikipedia also received awards: