Grants talk:Project/Hjfocs/soweego

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Quality echosystem[edit]

Step number 1 is to get external identifiers to facts but we need also to make a difference between external identifiers quality and if we can trust them.

I have been working with P2949 Wikitree, P535 FindAGrave, P4159 Werelate are all community based research. I have also been working with P3217 that is Dictionary of Swedish National Biography (SBL) and is published by the Swedish National Archive and is created by professionals.

My experience is that we have a quality difference and a source like P3217 - SBL is an external source that Wikidata should use to build an “eco system” were Wikidata compare facts with a “trusted external identifier” like SBL and if we have a difference a report is produced and an action is taken. By creating this “ecosystem” Wikidata will easier find “fake” facts and can correct it.

By using infoboxes in Wikipedia that are using facts in Wikidata that are checked with a “trusted” external source the reader of Wikipedia will trust the content of the article better.

Maybe the infobox could display a warning if the data in Wikidata is different with the “trusted external source” -Salgo60 (talk) 17:31, 24 September 2017 (UTC)

@Salgo60: thanks a lot for your feedback, very much appreciated.
Let me recap it in a sentence, to see if I correctly understood your point: the broad challenge is that Wikidata content should be supported by references to external reliable sources.
I totally agree with you, and the proposal reflects the position (cf. Grants:Project/Hjfocs/soweego#Why:_the_problem). Zooming in, if you think of an external identifier as a reference to a Wikidata item, the same argument applies (cf. Grants:Project/Hjfocs/soweego#Where:_use_case). That's also why I cast the question to the community, asking which identifiers are worth to be covered (cf. Grants:Project/Hjfocs/soweego#Community_notification; https://lists.wikimedia.org/pipermail/wikidata/2017-September/thread.html#11109)
By the way, this is pretty much similar to what StrepHit did when building its whitelist of trusted external sources for biographies (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline#Sources_identification).
In conclusion, working with the community to come up with a whitelist of external identifiers has a high priority in this proposal. Currently, I have identified 2 targets (VIAF, MusicBrainz) that are both open and rank high in terms of usage in Wikidata content. However, I'm more than happy to revise those targets based on specific feedback.
Cheers,
--Hjfocs (talk) 14:11, 25 September 2017 (UTC)
sorry I am on a mobile phone on a bike tour in Poland but my point yes a whitelist. But we also need to have a system telling the reader that this information is confirmed from a external source with a quality system and reviewed by professional historians. How this should be formalised is outside my knowledge. It should not be up to the Reader to know what quality VIAF has etc.
When speaking to University teacher’s in Sweden they tell me if a student doesn’t check with SBL mentioned above they need to go home and do the homework. A reference like that is something Wikidata can trust and this trust we feel should in an easy way be communicated back to the Reader. - Salgo60 (talk) 21:56, 25 September 2017 (UTC)

Privacy issues with personal information[edit]

On EN WP we say "Posting... personal information is harassment, unless that person has voluntarily posted his or her own information, or links to such information, on Wikipedia... This applies to the personal information of both editors and non-editors."

This is clarified with "Content and sourcing that comply with the biographies of living persons policy do not violate this policy; neither do discussions about sources and authors of sources, unless comments about persons are gratuitous to determining source quality."

With respect to specifics we say "dates of birth that have been widely published by reliable sources, or by sources linked to the subject such that it may reasonably be inferred that the subject does not object".

WD does not appear to have much for notability requirements thus many of these links will likely be created for non or borderline notable people. I also do not believe that just because someone makes a tweet or posts something on facebook it automatically implies that the subject does not odject to the details being spread further.

I am concerned that this proposal will make peoples personal life and data to easy to find. Doc James (talk · contribs · email) 16:56, 24 September 2017 (UTC)

I just wanted to add the same comment. The proposal seems to ignore privacy issues. Personal information about living people should not be processed automatically like proposed but handled by human editors, judgement and responsibility. -- JakobVoss (talk) 07:13, 25 September 2017 (UTC)
@Doc James and JakobVoss: thank you both for your concerns, which raise an essential point. I'll take this opportunity to explain why the proposal does not aim at importing any personal information at all, by replying to your arguments.
  • Posting personal information is harassment
soweego does not post any information, but only aligns a source database (Wikidata) with a set of target ones. No data imports occur (cf. Grants:Project/Hjfocs/soweego#How:_the_solution), and no content gets automatically propagated into the knowledge base. It's rather a matter of statements about external identifiers, e.g., Robert De Niro (Q36949), VIAF ID, https://viaf.org/viaf/114969074
  • Personal information about living people should not be processed automatically like proposed but handled by human editors
This is exactly what is foreseen: the results of the alignments are not directly added to Wikidata, but first cast to the community for approval or rejection (cf. Grants:Project/Hjfocs/soweego#Project_goals and Grants:Project/Hjfocs/soweego#Implementation). Confident links will be made available via a bot, which requires a generally structured discussion prior to the eventual approval of the request. Non-confident links will instead undergo manual validation via the primary sources tool.
I hope this makes it completely clear that soweego is not affected by privacy issues.
Last, but not least, concerning notability on Wikidata: WD does not appear to have much for notability requirements thus many of these links will likely be created for non or borderline notable people.
I'm more than happy to include a filter that excludes items in Wikidata with no site links to Wikipedia. However, I would love to have a confirmation from the Wikidata community to first understand whether this is an issue.
Cheers,
--Hjfocs (talk) 15:48, 25 September 2017 (UTC)
There is also a comment on this topic in the endorsement section of the proposal. Personally, I have tried to use the primary sources tool with the Soweego dataset (see the "proof of work" section) in the past days and I got to add links to the official Twitter profiles of encyclopedic people (all athletes that have an article on en.wiki and also an official Twitter account). I think that:
  • filtering items that have at least on Wikipedia article or using the primary-sources tool solve any concern about notability
  • filter only for verified social media accounts (this was the case for 4 out of 5 of my tests, where the remaining case was not the Twitter account of the athlete, but one of a fan club); Gven the fact that Facebook and Twitter have "verified accounts" (with blue marks) I do not see how this would constitute doxing, given that this equals to having the person in question publicly declare that this is her social account.
--CristianCantoro (talk) 16:47, 26 September 2017 (UTC)
CristianCantoro, thanks for your message. I agree that filtering in those two ways is a small improvement over not doing so. Even so, it is insufficient to raise the original Soweego proposal to the standard that should be required for WMF grant funding. Why so?
  • Social media account verification is not entirely reliable.
  • Social media lacks editorial standards and accountability. Social media companies' account verification processes seem to be about as close as those companies come to having either, and those efforts are unreliable (see above). IMO, the WMF should not fund the placing of public trust in organisations with such absent editorial standards and accountability. Nor should the WMF fund what could (cynically) be viewed as a sophisticated effort to drive traffic to such websites.
  • The existence of a Wikipedia article for a person does not guarantee that the person is notable. Wikipedia articles about non-notable people can be created by fans or vandals. Eventually, these articles should be discovered and deleted by other editors, but in the meantime an algorithm such as the one you suggest would be misled into assuming the subject was notable.
I'm sorry I wasn't able to agree with you, but I hope you at least feel I have addressed the points you wanted me to. Thanks again for reaching out, Zazpot (talk) 12:07, 29 September 2017 (UTC)
The claim that "soweego is not affected by privacy issues" - stressed as hopefully "completely clear" - for me indicates that the crucial point in the concerns of User:Doc James, User:Zazpot and User:JakobVoss is not addressed, namely that the alignment alone (creating a triple <QID> <PID> "some_identifier") is a privacy breach, if "some_identifier" links to a pseudonymous identity not unquestionably revealed by the person herself. I cannot see in which way the tool - in it's originally suggested form, or with a later on added Twitter/Facebook/whatever linker plugin - would address this issue. Jneubert (talk) 09:47, 6 October 2017 (UTC)

┌─────────────────────────────────┘
Please see the #Summarized_response_to_#Privacy_issues_with_personal_information section for a condensed reply to this topic. --Hjfocs (talk) 15:39, 1 December 2017 (UTC)

I thought about privacy issues here but (speaking with my sociologist hat on, and thinking about ethics in academia) I don't really see an issue here, as long as as linked social media accounts are verifiable. I looked at Barack Obama for example, he has two verified twitter accounts. I see nothing here violating privacy. Perhaps if someone could present a practical, if hypothetical privacy concerning example, I could comment further, but so far I am not seeing any serious issues here. --Piotrus (talk) 09:28, 28 August 2018 (UTC)

Questions about past projects[edit]

@Cgiulianofbk: In the Particiants section, the following links are provided:

These projects look very interesting but need more love from their maintainer! − Pintoch (talk) 14:32, 25 September 2017 (UTC)

@Pintoch: thanks for reporting the broken link. I asked Cgiulianofbk for insights: unfortunately, the pokedem.it project is not online anymore. I'll update the bio accordingly. On the other hand, feel free to contact him for the Wiki Machine.
Cheers,
--Hjfocs (talk) 15:55, 25 September 2017 (UTC)
Forgot to specify that pokedem.it (use case on Italian politicians) is indeed offline, but the project is in active development. You can have a look at a recent demo here: https://www.youtube.com/watch?v=SE7z0y61a9U ; development website here: http://demo.futuro.media/
Cheers,
--Hjfocs (talk) 16:09, 25 September 2017 (UTC)

September 26 Proposal Deadline: Reminder to change status to 'proposed'[edit]

As posted on the Project Grants startpage, the deadline for submissions this round is September 26, 2017. To submit your proposal, you must (1) complete the proposal entirely, filling in all empty fields, and (2) change the status from "draft" to "proposed." As soon as you’re ready, you should begin to invite any communities affected by your project to provide feedback on your proposal talkpage.

Warm regards,
--Marti (WMF) (talk) 04:41, 26 September 2017 (UTC)

Soweego integration with Primary Sources Tool is not working?[edit]

I have enabled the Primary Sources Tool (PST) per Wikidata:Primary_sources_tool#Give_it_a_try. Every time I click the resulting "Random soweego item" link, I am taken to Chase Holfelder (Q22278221), which shows me no new (i.e. blue-highlighted) claims to accept or reject. Other sources for the PST also exhibit the first problem (i.e. taking me to a given item every time I click the "Random <source> item" link, rather than actually taking me to a random item from that source), so this first problem is probably due to a bug in the PST. However, those others sources do not suffer from the second problem, i.e. they do at least give me a claim to accept or reject.

Without seeing examples of the new claims soweego is going to suggest for addition to Wikidata, it is hard to assess the value of the proposal. Zazpot (talk) 19:43, 26 September 2017 (UTC)

@Zazpot: thanks for trying the proof of work, which is actually working as expected. Probably, you didn't see the blue-highlighted statement because identifiers are located at the bottom of item pages. More specifically, in the case you mentioned, it is the very last statement.
For quickly grasping the content of the soweego prototype dataset, I suggest to browse it via the primary sources list (cf. d:Wikidata:Primary_sources_tool#Give_it_a_try, step 5.1). You just have to click on the sidebar link in the Tools section, then on the load button. The Random link works well for large datasets, due to the underlying random algorithm used (cf. https://github.com/Wikidata/primarysources/issues/57).
If you still want to play with the item-based view of the tool, I recommend to try an experimental feature, which facilitates browsing blue statements through a sidebar menu. The feature is not deployed yet, but you can use it if you add the following line to your common.js page (d:User:Zazpot/common.js for you):
importScript( 'User:Hjfocs/browse_pst.js' );
Hope this helps. Cheers,
--Hjfocs (talk) 10:34, 27 September 2017 (UTC)
Hjfocs, there really doesn't seem to be any blue-highlighted statement at all on that page. Please see this screenshot.
Thanks for the link to the bug report about non-random entries. Good to know it wasn't just me, and that the devs are aware of the issue.
Thanks also for the suggestion of the experimental feature, but I normally don't allow JavaScript execution at all, and I'd certainly rather not execute ~2000 lines of experimental JavaScript, in a logged-in browser session, without reviewing it, which I'm afraid I can't promise to do any time soon :( As an alternative, maybe you could add some screenshots to your grant proposal illustrating the intended soweego/PST workflow and some example claims? Thanks again, Zazpot (talk) 11:24, 27 September 2017 (UTC)
@Zazpot: I understand now why you are not able to use the tool: it's a Wikidata gadget, so it's written in JavaScript. You should allow its execution in your browser to see it working. I will update the tool instructions accordingly.
Concerning the screenshots in the proposal, I agree it's a good idea, but I'm not sure I can update the proposal page, as we are past the submission deadline.
Cheers,
--Hjfocs (talk) 13:20, 27 September 2017 (UTC)
Hjfocs, apologies: I assumed you would understand, from the fact that I was able to use the gadget at all, that I had enabled JavaScript in order to do so. To be clear: the JavaScript of the PST was indeed executing, which is why I was able to use the gadget (otherwise, it would not have shown up in the left sidebar, etc) and is also why the blue background was present for one claim per entry when I used the PST with sources other than soweego. The lack of JavaScript wasn't what caused a blue background to be absent from all the claims in Chase Holfelder (Q22278221); I still don't know what caused that problem, sorry.
As for adding screenshots, I guess you could ask the WMF staff if they would be OK with you continuing to improve the proposal. Zazpot (talk) 20:06, 27 September 2017 (UTC)
@Hjfocs:, one of your colleagues has updated the proposal today. With that precedent, perhaps you could also address my screenshot request? (If you would also fulfil the remaining three out of the four requests in my original opposition comment, that would be great, too!) Zazpot (talk) 15:43, 2 October 2017 (UTC)
  • @Hjfocs: I got Primary Sources enabled, and I also don't see any blue info on Chase Holfelder (Q22278221). Furthermore, in Primary Sources list I don't see any info in the soweego dataset, and all VIAF IDs coming from the freebase-id dataset, which is structured (no ML stuff there) --Vladimir Alexiev (talk) 11:46, 2 October 2017 (UTC)
@Vladimir Alexiev: thanks for your notification. Can you please try again? We experienced variable behaviors among different users, but it should be working for everyone now.
Cheers,
--Hjfocs (talk) 14:19, 2 October 2017 (UTC)
@Hjfocs: Now "Random primary sources item" shows blue stuff. When I select "soweego" as dataset, "Random soweego item" shows Oli Silk (Q7086766) and shows a blue id (MusicBrainz), but always goes to that item (i.e. no randomness). When I click "Primary Sources List" and select "soweego" as dataset, I always get MusicBrainz artist IDs, not VIAF ID. Cheers! --Vladimir Alexiev (talk) 17:48, 2 October 2017 (UTC)

┌─────────┘
@Vladimir Alexiev: I'm happy to see that the soweego proof-of-work dataset uploaded to the primary sources tool is working fine.

Cheers,
--Hjfocs (talk) 08:56, 3 October 2017 (UTC)

Vladimir Alexiev, Hjfocs, I am now getting the same results as Vladimir Alexiev. Thanks for getting that working.
Incidentally, w:Oli Silk listed what purported to be the subject's location and date of birth, in apparent violation of w:WP:DOB. (I have now removed that information and contacted w:WP:OVERSIGHT to suppress it, for the subject's privacy.) It's only a bit of a stretch to say that I would never have known Oli Silk's date of birth if it weren't for Soweego... Yes, this happened via Wikipedia, and yes, w:WP:BLP is meant to stop it, and no, Soweego did not use Silk's DOB in any obvious way, but even so, this rather emphasises the privacy concerns expressed elsewhere. As a community, Wikipedians need to be more alert to these risks, and need in general to create tools to improve people's privacy and other human rights, rather than to reduce them. Access to knowledge, yes; except where a completely legitimate privacy concern exists.</soapbox> Zazpot (talk) 17:39, 3 October 2017 (UTC)
@Zazpot: Your concern with privacy is admirable but may I just say that if we censor full name, DOB, (maybe also profession?), that will make authority coreferencing nigh impossible. I'm not sure how can my identity be stolen if someone knows by DOB... but maybe because I don't live in an idiotic country like the US ;-) --Vladimir Alexiev (talk) 08:26, 4 October 2017 (UTC)
Vladimir Alexiev, with full name and DOB it used to be possible in at least some countries (e.g. UK and New Zealand) to obtain a birth certificate. Until 2007 in the UK, (and maybe still today in other places?) this could then be used to obtain a passport if the birth certificate's genuine subject did not already have one. Even without that loophole: some institutions use DOB as an authenticator, e.g. for account recovery, usually in conjunction with a very small number of other, often quite readily discoverable pieces of information (email address, &/or phone number, &/or last 4 digits of bank card number, etc). Evidence: [1] [2] [3] [4] [5] [6]. Also, people often use their DOB as their PIN, according to research from the University of Cambridge.
Disambiguation for notable people is typically still very easy without DOB, by seeing whether the two records match on what the subject is notable for. In the case of Oli Silk, I was able to confirm the MusicBrainz page referred to the same person as the Wikipedia article, because the discographies on both contained the same album names, and it seemed very unlikely that there would be two Oli Silks in the world with such similar discographies. Yes, this kind of matching is a manual process and non-trivial to automate. I'm fine with that at the moment. There are other, much more useful things for coders to be automating/fixing right now, and on which I think WMF funds would be better spent. See Wikimedia's bug tracker for myriad examples. Zazpot (talk) 10:34, 4 October 2017 (UTC)

IMPORTANT: revision of target databases[edit]

The Soweego proposal has been revised to link GND and Library Of Congress instead of Twitter and Facebook.

In the first draft of Soweego, we were planning to (also) generate links from Wikidata items for persons (not Wikidata registered users!) to their corresponding Facebook and Twitter accounts, motivated by our belief that linking public persons (and, generalizing, organizations, events, pieces of work) in Wikidata to their social media accounts may be of benefit to the Wikidata ecosystem.[1][2]

This has raised some concerns in the community[3][4] regarding the privacy of linked persons, which we tried to address in the proposal and its talk page. In particular, the main concern associated to the proposed linking activity (we exclude any form of data import and/or interaction on social media in Soweego) regards the risk of de-anonymizing a private social media account by linking it to its owner person on Wikidata, a practice which may be unlawful and even harmful for the involved person as pointed out in the received feedback.

Clearly, de-anonymizing a private account is not our intention, and even if we see this risk as unlikely - such private/anonymous accounts are difficult or impossible to be found and matched by the software and besides the resulting links would have to be manually approved - we believe there are several practical solutions for minimizing or completely avoiding that risk (e.g., restricting to persons also in Wikipedia and to verified accounts). However, we acknowledge the need for further discussion with the community on this topic, aimed at identifying and setting clear criteria for the ingestion of those links that may address all the concerns expressed, in relation to Soweego but also to the 88K links to Twitter and Facebook accounts that are already present in Wikidata.

We thus decided to discard Twitter and Facebook as linking targets (as suggested) and to consider in their place GND and the Library of Congress. Together with the already planned VIAF and MusicBrainz, these resources represent the most used external catalogs in Wikidata (for persons), as well as nicely cover the broad domain of writers and music artists.

This revision does not change the overall idea and proposed linking framework, which has been conceived to be extensible and support linking to additional datasets (e.g., via community contribution). At the same time, the external datasets that we now consider are all non-commercial and either community-based / peer-reviewed or controlled by authoritative bodies, as suggested, this way tackling some of the concerns raised and giving Soweego a possibility to demonstrate its benefits while avoiding the controversial and non-essential (for the goals of Soweego) point of linking to social media.

As expressly asked, we commit to avoid generating any link from Wikidata person items to their social media accounts as part of Sowegoo (or any possible extension of it) unless clear criteria for their ingestion in Wikidata are agreed upon by the community through the discussion we advocate for.

--Fracorco (talk) 15:35, 27 September 2017 (UTC)

Very much agree that Wikidata should have a much better BLP policy. Discussions have been ongoing, but have not yet been very productive... :-( See d:Wikidata_talk:Living_people_(draft) and d:Special:Search?search=blp&prefix=Wikidata:Project+chat/&fulltext=Search+the+archives&fulltext=Search. (The fact that Wikidata doesn't yet have a good BPL policy is obviously not carte blanche to use it to make arbitrary claims about living people.) Zazpot (talk) 22:28, 27 September 2017 (UTC)
I think ditching Facebook & Twitter is a good move. However, isn't there a lot of redundancy between VIAF, LOC and GND? LOC and GND are already harvested by VIAF, so I am worried that you would end up duplicating OCLC's work, and therefore not bringing much value. Wikidata already has tools to pull identifiers from VIAF once an item has been linked to a VIAF id. − Pintoch (talk) 08:04, 28 September 2017 (UTC)
@Pintoch: collaboration with OCLC is indeed planned.[5] In addition, the VIAF linker is scheduled as the first and most important task in terms of effort.[6] To rephrase what you said ("pull identifiers from VIAF once an item has been linked to a VIAF id"), only after we ensure exhaustive linking between Wikidata and VIAF (which is currently not the case),[7] we can consider leveraging the existing effort by OCLC you mentioned. If OCLC confirms that they have complete mappings to GND and LoC, that would definitely alleviate our efforts for the GND and LoC linkers,[8] and we'll be happy to amend the planned ones.
Thanks for highlighting that working directly with OCLC is even more crucial.
Cheers,
--Hjfocs (talk) 16:59, 28 September 2017 (UTC)
I am not sure why you consider that pulling identifiers from VIAF can only be done after full coverage of the VIAF database? Anyway, this is already happening, right, for instance with the Authority Control gadget. I do not see why we should wait for full coverage before doing that? Or maybe I misunderstood what you meant by only after? − Pintoch (talk) 09:41, 29 September 2017 (UTC)

┌─────────────┘
@Pintoch: I tried to rephrase what you said: "once an item has been linked to a VIAF id". So, first you have to link a Wikidata item to a VIAF ID. Only after that, you can eventually pull other identifiers from VIAF.

The gadget you mentioned is indeed related work, as it enables linking to VIAF. However, it works one item at a time, and is totally manual. You can think of soweego as the authority control gadget on steroids, where you potentially link all items at a time, and in an automatic (when confidence is high) or semi-automatic (when confidence is low)[9] fashion.

Cheers,
--Hjfocs (talk) 16:46, 29 September 2017 (UTC)

Coverage statistics[edit]

I read «even the most used external catalog[7] (VIAF)[8] is only linked to 25% of people in Wikidata» but I don't see any explanation of why this would be a problem. It might be that 75 % of people on Wikidata are not authors and don't belong to VIAF; also, not all VIAF people belong to Wikidata, and VIAF has many duplicates. Rather than the totals, it would be more useful to have some sampling like "out of a sample of 250 random person items, x % could be linked to a VIAF record but were not yet". Also, if the goal is to make Wikidata a bridge between the different authority files, it would be more useful how many of the known GND identifiers can be translated to a VIAF identifier, and so on. --Nemo 14:47, 28 September 2017 (UTC)

@Nemo bis: thanks a lot for your constructive feedback.
  • We opted for absolute rather than relative statistics because we believe a random sample of e.g., 250 items (as you suggested) out of 3.6 million ones[10] would be highly anecdotal.
Said that, we would need to implement a baseline VIAF linker to address your comment in a statistically significant way, i.e., to estimate the ratio of VIAF identifiers that are missing in Wikidata and can be linked. We are working on it anyway, since we see the task as a first investigation step for T2.[11] At the time of writing the proposal, our hypothesis was that VIAF is a much bigger superset of Wikidata,[12] thus having the potential of exhaustive coverage over Wikidata people, although we recognize it's very hard to grasp estimates on people in VIAF at this phase.
  • With respect to integration between GND/LoC records in VIAF, we currently refer to the following statements:
    • GND has provided about 9.5 M (7.4 M people) authority records to VIAF,[13] although it contains "approx. 14 M";[14]
    • LoC has provided about 9 M (6.3 M people) authority records ("not necessarily loaded by VIAF"),[15] although it contains "over 8 M descriptions".[16]
These seem to be evident mismatches that need further investigation on the data dumps. All but one statement[14] are outdated, i.e., from 2013 (assuming "2103" is actually 2013,[13] otherwise somebody wrote it in the future :-) ). Wikidata was a 1-year old baby when these statements were made, and probably things have changed a lot since then.
Cheers,
--Hjfocs (talk) 11:03, 29 September 2017 (UTC)
if 350 is anecdotal, make it 3500 or whatever number is needed to be statistically representative (probably not more than thousands). Yes, the national libraries generally contribute only their "best" authority records to VIAF, because the rest is garbage. In the case of SBN, only some 5k records were sent to VIAF (meaning SBN believes over 99 % of its authority files are garbage) and even those were only minimally clustered to VIAF records (so they're useless). Is your plan to put the lower quality authority records into use, even when they are private? What is the expected gain from lower quality records? --Nemo 08:06, 3 October 2017 (UTC)
  • I asked the same (see #Proposal shortcomings items 1,3,5). I believe that 75-80% of WD persons should be linkable to VIAF. VIAF has at least 370k more links to WD (30% more!) that just need to be imported.
GND and LOC are almost completely covered in VIAF. viaf-links-20161208.txt.gz shows 16.2M "LC" links (LoC) and 26.5M "DNB" links (GND), so don't believe the VIAF webpages. That was 10m ago, now there can only be MORE links. If these are imported, this will raise GND and LOC coverage to 900k-1M each. --Vladimir Alexiev (talk) 18:07, 2 October 2017 (UTC)

Reply to Jarekt's endorsement[edit]

@Jarekt: thanks for your neutral endorsement. Let me address below the concerns you raised:

  • I am not convinced that it can be done automatically
The proof of work[17] is there for you to verify that the approach can be automated. We decided to upload the dataset into the primary sources tool for demonstration purposes, as we needed a way to highlight that those statements originate from soweego. Moreover, the participants of this proposal have developed a related approach, which has already been evaluated and accepted by the research community through peer reviews on relevant conference publications.[18]
  • why build a new tool when often once you find VIAF page it already has a link to Wikidata
Because links may not be bi-directional, and we need a way to ensure that each link is consistent from both sides. That motivates the need for both live maintenance[19] and merging[20] mechanisms.[21]
  • it should be easier to search VIAF, RKD, Benezit, Musicbrains and all the other databases for a match and than ask human to approve or reject them.
That's exactly the main goal of soweego: when confidence is low, we allow users to curate the results via the primary sources tool.[22]
That's because version 1 is still deployed as the official gadget. The uplift proposal has been posted,[23] and the StrepHit team is actively working on it, so please stay tuned.[24]

Cheers,
--Hjfocs (talk) 17:59, 29 September 2017 (UTC)

Hjfocs, you point to the Grants:Project/Hjfocs/soweego#What:_proof_of_work section as a means for reviewers to verify your claim that the Soweego approach is effective. This is problematic because, of the four sources in that section:
  • one is broken (Soweego's integration with the Primary Sources Tool; see discussion elsewhere in this page);
  • one is self-published (futuro.media), so is of limited credibility;
  • one is in a closed-access periodical, so cannot be assessed by most Wikimedia-related volunteers when reviewing your grant application; and
  • the remaining one is not published, so cannot be assessed by anybody reviewing your grant application.
That being so, I don't think you should be surprised if people are not convinced by your claims at this time. I think it would be helpful for reviewers if you could rewrite that section with better sources. Zazpot (talk) 18:26, 29 September 2017 (UTC)
@Zazpot:, concerning your last three points above (I'm skipping the issue with the Primary Sources tool that is discussed elsewhere):
  • Reference #17 http://sociallink.futuro.media/ self-published: this a plain link pointing to our website where we provide access to all the material about SocialLink (code, dataset, documentation, papers).
  • Reference #18 https://doi.org/10.1145/3019612.3019645 non accessible: we used a URL that require a subscription to the ACM Library to be seen. The paper is actually publicly accessible starting from SocialLink website by clicking on a special link ("author link") on this page (only clicking there works, due to publisher's policy).
  • Reference #19 not published: this refers to a paper that just got accepted and will be presented this month at ISWC, in Vienna. We put the accepted manuscript online on SocialLink website - this is essentially the same paper as the final one that will be published by Springer (just formatting is different), which we cannot post online now in accordance to Springer' self-archiving policy.
We fixed the last two points in the proposal. Thanks for pointing them out.
--Fracorco (talk) 14:06, 2 October 2017 (UTC)
Fracorco, thanks for making those improvements. It would be great to see the PST integration working as described in the proposal, too. Zazpot (talk) 15:39, 2 October 2017 (UTC)
The primary sources tool works as expected. --Hjfocs (talk) 17:15, 3 October 2017 (UTC)

Proposal shortcomings[edit]

Authority coreferencing is extremely important to me (see Wikidata:WikiProject_Authority_control: notified that project so people can comment here). Regretfully, I think this proposal has many flaws, see below --Vladimir Alexiev (talk) 09:34, 2 October 2017 (UTC)

1. What is your estimate of what the total number of matches should be to VIAF, LCNAF and GND?

2. Before employing AI methods, I think you should first promulgate identifiers across structured sources

> VIAF is only linked to 25% of people in Wikidata, circa 910 thousands

3. Looking from the VIAF side, viaf-links-20161208.txt.gz had

  1.29M "WKP" links (Wikidata)
  6.27M "Wikipedia" links (various languages)

(Note that these are different, eg http://viaf.org/viaf/100144403 has Wikipedia@https://pt.wikipedia.org/wiki/Teresa_Magalhães and WKP|Q17280389).

So importing these links to WD will bring in at least 370k links, but perhaps a lot more (because VIAF has progressed in the last 10 months and because some of the Wikipedia links may not be reflected as WKP links)

> The problem is even more evident when looking at the Library of Congress (LoC) and GND: coverage 11%

Both LOC and GND are full members of VIAF, so have you thought about promulgating identifiers across?

4. Of 413k WD entities with LCNAF, 2.1k don't have VIAF. Of 410k with GND, 9.6k don't have VIAF. Both these can increase VIAF coverage.

5. The opposite direction will be a lot more productive. viaf-links-20161208.txt.gz had

  16.2M "LC" links (LoC)
  26.5M "DNB" links (GND)

If these links are imported, I bet this will increase both LoC and GND links in WD to 900k-1M each.

6. I checked the "Primary Sources list" tool described in the proposal:

  • the soweego dataset shows no data
  • VIAF ID (P214) loads a number of statements, but they all are from freebase-id, which is a structured dataset. The first item (Lisa Gardner) already has a VIAF ID https://www.wikidata.org/wiki/Q1028373#P214 imported from French Wikipedia, and the Primary Sources tool doesn't suggest any new reference

7. > these relevant subjects will be engaged accordingly: OCLC organization, responsible for VIAF

What is your realistic plan for doing that? If that was easy, why hasn't WD fully engaged with VIAF until now? If it had, WD wouldn't be missing 30% of the links present in VIAF. And I bet that VIAF is missing a lot of links present in WD. Also see problem 10.

8. > Project metrics: soweego datasets should' cover 2.1 million people with no identifier.. we cannot estimate the size of the non-confident dataset... 20,000 curated statements

20k curated matches is ridiculously low, given the number of links that can be promulgated exactly (see 2-5) and the number of curated matches in Mix-n-Match (I'd guess at least 2-3M person links)

9. The proposal doesn't describe any interaction with Mix-n-Match, the premier global authority matching tool. Despite Magnus' approval above, I think this is a major defect. The project goal of "25 new primary sources tool users" is very low: I'd guess Mix-n-Match has over 3k users, and maybe 1k active.

10. The proposal doesn't address the major problem of authority matching, which is how to keep a network of authority efforts in sync. Person data and links imported automatically from one authority are not corrected/removed automatically if the authority changes. Therefore manual corrections on the VIAF or WD side are not guaranteed to be preserved. Fixes on one side are not promulgated to the other side with any timelines, so fixes are often overwritten by bots that re-instate the same error. Any human effort should be treated as gold, but currently it's treated with disregard.

A major reason for this is the low level of cooperation between VIAF and WD (see 7). It's still hard to describe an "intellectual investigation" about identity or attribution of person-related records in WD (sourcing statements are not enough to describe complex investigations), exacerbating the problem.

11. Machine Learning algorithms should learn incrementally from judgements and especially from corrections described in 10, which I believe is also not addressed.

I would accept the proposal if it first addresses the easy problems (2-5), demostrates the matching algorithm adequately (6), and works in strong concert with Mix-n-Match (9). Your Machine Learning matcher should replace the Mix-n-Match automatic matcher (assuming yours is better), but leave curation to Mix-n-Match. —The preceding unsigned comment was added by Vladimir Alexiev (talk) 11:40, 2 October 2017 (UTC)

  • "Any human effort should be treated as gold, but currently it's treated with disregard." I love that comment, I will have to remember it! Just to state a bit more clearly (given my weak support vote on the main page) the concerns expressed above by Vladimir Alexiev are close to my own reservations about automation here, and I would urge the proposers to not overpromise or attempt to do too much; let's get the simple things done first. ArthurPSmith (talk) 13:31, 2 October 2017 (UTC)

My comments[edit]

  • I agree with not automatically importing Facebook/Twitter identifiers. ChristianKl (talk) 11:20, 3 October 2017 (UTC)
  • The problem that's supposed to be solved seems to be that VIAF is only linked to 25% of the Wikidata entries and there's a desire for exhaustive coverage (100%). I'm not aware that VIAF has a desire to create an item for every Wikidata entry that represents a person. On the other hand given our current way of dealing with notability, I'm not sure that it's a good idea to automatically create a Wikidata entry for all VIAF entries. For how many people do we currently have entries in VIAF and in Wikidata without there being any link between them? ChristianKl (talk) 11:20, 3 October 2017 (UTC)
    • I don't think anyone says 100%. But I think 90% of WD people are in VIAF; whereas in the opposite direction I agree WD shouldn't be creating all VIAF people just for the heck of it (and that's not proposed). --Vladimir Alexiev (talk) 13:42, 3 October 2017 (UTC)
      • The Project goals of the proposal do say "ideally 100%". ChristianKl (talk) 21:14, 3 October 2017 (UTC)
  • When the key of the proposal is about having more VIAF links, it would likely be good to talk more to the VIAF people before starting this project and get their input. ChristianKl (talk) 11:20, 3 October 2017 (UTC)
    • Agree. The first task before any AI stuff should be to cross-import links from one system to the other (VIAF has at least 370k WD links that are not in WD! And maybe 500k). Also leverage DNB & LOC links that WD has but doesn't have VIAF for that record. Much stronger coordination with VIAF is needed, and I don't know how that group can achieve it. --Vladimir Alexiev (talk) 13:42, 3 October 2017 (UTC)
  • Given that both GND and LoC have VIAF references I don't see the point of training an AI to do matches between Wikidata and GND/LoC. Can you tell us more about why you consider that a good investment of development resources? Databases I would consider to be more interesting are what we currently link with our Bloomberg person ID property. The Bloomberg database has a lot of interesting data that's often hard to access and is currently massively underlinked. ChristianKl (talk) 11:20, 3 October 2017 (UTC)
    • ML is needed to discover more matches. After cross-importing, there'll still be maybe 40-50% potential links to be added --Vladimir Alexiev (talk) 13:42, 3 October 2017 (UTC)
  • MusicBrainz seems to be already well-referenced and I think the Primary Source tool frequently offers suggestions. Why is there a need for more work? ChristianKl (talk) 11:20, 3 October 2017 (UTC)
  • I'm not sure what the Merger & Consistency Checker is supposed to do. Currently, both Wikidata and VIAF have duplicate entries. What's supposed to be done when the tool finds entries that aren't 1-to-1 matches? It would be valuable to have the Primary Source tool suggest Wikidata items that can be merged, but I don't see a reason why this should be focused on VIAF, GND or LoC entries. ChristianKl (talk) 11:20, 3 October 2017 (UTC)

Slightly premature?[edit]

I agree with a lot of the comments made above, but I'd like to add a few more points:

  • soweego plans to use the Primary Sources Tool. I know that a new version of this tool is expected to be released at some point, but it seems a bit risky to base a proposal on a tool that is currently not very usable. I do hope that the overhaul that has started recently will be a success, but this overhaul is a big project and still at a very early stage. So I think it would be safer to wait for the tool to work as expected before committing to objectives that are intrinsically bound to that tool (number of statements added and number of new PST users).
  • I am a bit curious about the objectives: 20,000 curated statements for just 25 new users of the tool? I assume you expect that current users of the tool will also get involved in the soweego dataset, but still, that makes a lot of edits per user! I am myself a PST user and I only add statements when I happen to edit items where they are proposed. I did not count how many times I curated statements, but the figure is clearly far below this order of magnitude. Maybe you will find more diligent users, but I don't know many people who would be happy to curate identifiers for thousands of items manually.
  • Like others, I am worried by the statistics given earlier to justify the proposal: the proportion of human Wikidata items which currently have a VIAF id is really not the right metric! This is very misleading about the potential progress that can be made. I trust the good faith of the proposers so I suspect that it was just an unfortunate mistake. The decision to change the target datasets from Twitter and Facebook to GND and LOC at the last minute is also an indication that the proposal is just not quite ripe yet.
  • I also agree that in terms of authority control in Wikidata, there are much lower-hanging fruits than machine learning solutions. Pulling identifiers from VIAF components is one possibility. Researching how to solve the problems that arise when two databases mutually harvest each other and propagate errors is another… It may not be as intellectually appealing as machine learning, but it would be more useful in terms of practical outcomes.

Overall, this project is quite similar to StrepHit, so I think it would be worth taking advantage from what we have learned from that project. In the renewal application for StrepHit, you wrote that StrepHit and its sister datasets are simply unprofitable if [the Primary Sources tool] is unusable. I totally agree: machine learning is useless if we lack the basic infrastructure to integrate it with Wikidata. And I think there is still a lot of work to be done on that basic infrastructure, even beyond the PST. Wikibase is nice and stable (at least from a user perspective), but basically everything that gravitates around it is pretty much at alpha stage in terms of quality. I would love to see more grant proposals like Grants:IEG/Wikidata_Toolkit, which gave awesome deliverables. (Can anyone imagine not having the Wikidata Query Service around?) That requires solid software engineering practices to create first-class software that the community can build upon.

So, I think these machine learning projects are very exciting on the long term, but I would recommend waiting a bit before carrying them out as it looks a bit premature to me. − Pintoch (talk) 18:52, 3 October 2017 (UTC)

Eligibility confirmed, round 2 2017[edit]

IEG review.png
This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 2 2017 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through 17 October 2017.

The committee's formal review for round 2 2017 begins on 18 October 2017, and grants will be announced 1 December. See the schedule for more details.

Questions? Contact us.

--Marti (WMF) (talk) 22:00, 3 October 2017 (UTC)

Aggregated response to comments[edit]

We are very grateful to all the users who provided constructive feedback, which allowed us to further improve the proposal. We would like to clarify here a set of key points that were raised.

Problem statement and proposed solution, i.e., what the project does and what it does not:

  1. soweego addresses the reference gap problem in Wikidata,[25] not the coverage of e.g., VIAF;
  2. it is a general-purpose disambiguation tool, linking a source to a target,[26] at a large scale;
  3. the system itself is not bound to specific target databases.[27] Therefore, it does not necessarily focus on the addition of links to e.g., VIAF;
  4. the selection of in-scope targets definitely requires further community discussion;[28]
  5. the implementation of linkers for specific targets should maximize the impact in Wikidata. Hence, extensive catalogs that are trusted and widely used (such as VIAF) may be suitable target candidates;
  6. the given statistics are observations that merely capture current usage signals, and are not the main motivation that drives the proposal;[28]
  7. the maturity of the proposal is not tied to the development status of the primary sources tool, also because a bot is foreseen for direct import of confident data.[29]
  • Candidate tools for data curation
Generally speaking, we believe that the PST should be the standard way for Wikidata to host datasets that need curation, which include machine learning-based outputs. Concerning soweego, we deemed the PST to be generic enough (i.e., it deals with any statement, not just links to identifiers), thus being suitable for the scenario where the community may want to import other statements from the linked targets.
Furthermore, this proposal is actually in sync with the development roadmap of the PST, which contrasts the maturity argument:[30] from the grant program side, grantees would be announced on December 1;[32] the ingestion task would end on the 7th month.[33] Suppose an ideal setting where the project gets accepted and starts immediately after the announcement: it means that the non-confident datasets uploads would not be ready before end of June 2018. The PST development is bound to the StrepHit project renewal,[34] which is scheduled for completion by end of May 2018 (1-year time span, project start on 22nd May 2017).[35]
The tool is definitely an alternative option to bear in mind, especially if it has a large user base (we were not aware of the numbers reported by Vladimir_Alexiev). In any case, we can discuss this directly with Magnus_Manske, who created the tool and has accepted to act as an advisor for soweego.[38]
  • project metrics: 20K curated links are too few[39] VS 20,000 curated statements for just 25 new users of the tool (...) that makes a lot of edits per user![40]
The rationale behind the choice of this threshold originates from local metrics gathered with StrepHit, where more than 127k total (i.e., not dataset-specific) statements were curated in 6 months.[41] Given that it is now possible to filter by dataset, and that soweego ones are planned between month 4 and 12,[42] we consider it appropriate.
Moreover, the statements and users metrics are not correlated, thus entailing that the new 25 users are not necessarily those who would curate the 20k statements. Success signals are rather based on the increased use of the tool.
Finally, our experience in the alignment of Wikipedia to Twitter shows that the time needed to validate a link may be in the order of 1+ minutes, which maps to 2+ person-months of community work to validate the 20K threshold.
The argument was specifically raised for VIAF, but also applies to other databases (for instance, MusicBrainz has links to GND). In any case, we think this is definitely a relevant suggestion, and we are happy to include it as a sub-task for each linker.[46]
We argued in point 6 above that the given numbers are not the main problem tackled by this proposal.
With respect to Nemo_bis's request, we are thankful to Vladimir_Alexiev, who is certainly much more experienced than us with respect to VIAF, and has already partially responded.[47] We are happy to include the exhaustive answer as a sub-task of the target selection,[50] if the community indeed ranks VIAF as a high-priority candidate.
  • authority matching support[51]
The Validator component[52] should detect stale links in Wikidata (e.g., due to changes from one authority) and point them out. We would then need to decide with the community what to do next (e.g., manual curation, deal with bot changes). During the validation of existing links and the suggestion of new ones, we also consider global consistency across multiple sources (e.g., Wikidata item X links to target entry A, which links back to Wikidata item Y, which is different from item X).
  • engagement of OCLC[53]
We think that a major problem in the involvement of external databases curators is to keep their identifiers in sync with Wikidata. We believe that soweego can enable a more effective engagement process, since it ensures live maintenance of links, via regular link ingestions, rather than spot ones that require considerable human effort.

Cheers,
--Hjfocs (talk) 17:57, 4 October 2017 (UTC)

"general-purpose disambiguation tool" : Your time-planning suggests that it will take 10% of the effort of the whole project to add a new database. To me that suggests that it won't be easy to run the tool against hundreds of new external ID properties. Do you think that it would be realistic that the tool will be in the future in a form that allows to add a new external ID in a few hours of human work? ChristianKl (talk) 16:23, 18 October 2017 (UTC)

Related works not mentioned yet[edit]

Please have a look at the work by Joachim Neubert and me. E.g. Joachims talk and paper Wikidata as a linking hub for knowledge organization systems? Integrating an authority mapping into Wikidata and learning lessons for KOS mappings, available at https://at-web1.comp.glam.ac.uk/pages/research/hypermedia/nkos/nkos2017/programme.html and Mappings extracted from Wikidata at http://coli-conc.gbv.de/concordances/wikidata/. I must admin that I'm a bit biased against this proposal because I prefer small tools that can independently be combined (and replaced!) instead of yet another general application. Nevertheless parallel efforts towards related goals will also bring us forward. -- JakobVoss (talk) 10:42, 6 October 2017 (UTC)

References[edit]

  1. https://meta.wikimedia.org/w/index.php?title=Grants:Project/Hjfocs/soweego&oldid=17265238#Outcomes (motivations in previous version of the proposal)
  2. A quick grep-based analysis of archived messages on Wikidata mailing list from Jan 2015 onwards shows 19 messages containing Twitter or Facebook in their subject (zgrep -iE "^Subject:.*(twitter|facebook)" *.txt.gz | wc -l on dump files) and 172 mentions of these social networks (zgrep -iE "twitter|facebook" *.gz | grep -vE "http|@" | wc -l where we exclude account names and links).
  3. #Privacy_issues_with_personal_information
  4. Grants:Project/Hjfocs/soweego#Endorsements
  5. Grants:Project/Hjfocs/soweego#Identifier_database_owners
  6. cf. T2 in Grants:Project/Hjfocs/soweego#Work_package
  7. Grants:Project/Hjfocs/soweego#Where:_use_case
  8. cf. T3 and T4 in Grants:Project/Hjfocs/soweego#Work_package
  9. cf. G3 in Grants:Project/Hjfocs/soweego#Project_goals
  10. SPARQL query 1
  11. Grants:Project/Hjfocs/soweego#Work_package
  12. https://www.oclc.org/content/dam/oclc/events/2016/IFLA2016/presentations/VIAF-Reflections.pdf#page=12
  13. a b http://viaf.org/viaf/partnerpages/DNB.html
  14. a b http://www.dnb.de/EN/Service/DigitaleDienste/Datendienst/datendienst_node.html
  15. http://viaf.org/viaf/partnerpages/LC.html
  16. http://id.loc.gov/authorities/names.html
  17. Grants:Project/Hjfocs/soweego#What:_proof_of_work
  18. Grants:Project/Hjfocs/soweego#SocialLink
  19. cf. D1 in Grants:Project/Hjfocs/soweego#Output
  20. cf. D3 in Grants:Project/Hjfocs/soweego#Output
  21. cf. G1 in Grants:Project/Hjfocs/soweego#Project_goals
  22. cf. G3 in Grants:Project/Hjfocs/soweego#Project_goals
  23. d:Wikidata:Primary_sources_tool#Primary_sources_tool_uplift_proposal
  24. phab:project/view/2788/
  25. Grants:Project/Hjfocs/soweego#Why:_the_problem
  26. Grants:Project/Hjfocs/soweego#Why:_the_solution
  27. Grants:Project/Hjfocs/soweego#Implementation
  28. a b Grants:Project/Hjfocs/soweego#Where:_target_selection
  29. Grants:Project/Hjfocs/soweego#Project_metrics
  30. a b #Slightly_premature.3F, point 1, currently not very usable
  31. Grants:Project/Hjfocs/soweego#Endorsements, neutral endorsement, not (...) usable at this stage
  32. Grants:Project, round 2 2017 schedule
  33. Grants:Project/Hjfocs/soweego#Work_package, T8
  34. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Primary_Sources_Tool
  35. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Timeline#Overview
  36. Grants:Project/Hjfocs/soweego#Linking_Wikidata_to_identifiers
  37. #Proposal_shortcomings, point 9
  38. Grants:Project/Hjfocs/soweego#Participants
  39. #Proposal_shortcomings, point 8
  40. #Slightly_premature.3F, 2nd bullet point
  41. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Final#Local_Metrics
  42. Grants:Project/Hjfocs/soweego#Work_package, T9
  43. #Proposal_shortcomings, points 2 to 5, promulgate identifiers
  44. #My_comments, 6th bullet point, cross-import links
  45. #Slightly_premature.3F, 4th bullet point, low-hanging fruits
  46. Grants:Project/Hjfocs/soweego#Work_package, T3 to T6
  47. a b #Coverage_statistics
  48. #Proposal_shortcomings, point 1, 3, 5, estimate links to VIAF, LOC, GND
  49. #Slightly_premature.3F, 3rd bullet point, statistics given earlier
  50. Grants:Project/Hjfocs/soweego#Work_package, T1
  51. #Proposal_shortcomings, point 10
  52. Grants:Project/Hjfocs/soweego#Implementation
  53. #Proposal_shortcomings, point 7

OpenLibrary and OCLC[edit]

Sort of related: http://blog.archive.org/2017/10/12/syncing-catalogs-with-thousands-of-libraries-in-120-countries-through-oclc/ --Nemo 06:05, 13 October 2017 (UTC)

Thanks Nemo_bis for the pointer, adding this related post too: http://www.oclc.org/en/news/releases/2017/201727dublin.html --Hjfocs (talk) 08:48, 19 October 2017 (UTC)

Aggregated feedback from the committee for soweego[edit]

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
8.3
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
7.3
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
7.8
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
6.5
Additional comments from the Committee:
  • The project fits with Wikimedia's strategic priorities as it is intended to solve one of the main problems of Wikidata - a lack of references in its statements. It has a significant potential for online impact like its predecessor - IEG/StrepHit. If successfully completed it may be sustained, scaled, or adapted elsewhere.
  • Yes, it would solve a problem of data quality and improve a project
  • The approach - based on AI - is without doubt innovative. The potential impact is significant though not without risks. The success can be reliably measured, though measures of success should be improved.
  • The submitter is an experienced programmer and already worked for similar projects.
  • The scope can be accomplished in 12 months and the budget seems to be realistic. The participants have necessary skills as I can judge from the completed IEG/StrepHit project.
  • The project has one specific target community - the Wikidata community. On the other hand there is no uniform community support. Some community members object to collection of data from social networks. So, the community engagement should be better.
  • I am happy that this application has evolved/changed in response to comments from the community, specifically related to privacy concerns. At the same time, I think the changes have made the proposal less strong as the switch to new target databases happened at the last minute. While there is some good support for the project, many in the community see it as premature. Given the high costs, I would be inclined to not fund this round but encourage the applicants to resubmit next round.
  • The project is good but the budget is too high to give a simple yes, in my opinion a better evaluation by the tech staff of WMF would be required.
  • The proposal seems to be slightly premature though I am willing to give it chance. The participants should think about what online databases they should use, what tools they want to use for curation of the data before they are added to Wikidata (PST, Mix'n'match or some other) and what are the proper measures of success of this project.
  • This proposal is well made, addresses a serious issue of Wikidata, has strong community support and is partially found by another stakeholder. However, the issues raised about privacy need to be very carefully addressed. I would suggest to have someone (community, WMF ?) working as an ethic advisor and veto any development that would cause a thread to the respect of privacy, which is one of our core value.
IEG IdeaLab review.png

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.


Next steps:

  1. Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
  2. Following due diligence review, a final funding decision will be announced on March 1st, 2019.

Questions? Contact us.



Response to committee feedback[edit]

We would like to warmly thank the review committee for their valuable comments. Please find below our reply, which summarizes the points needing additional clarification.

  • Budget
    • the budget seems to be realistic
    • the budget is too high to give a simple yes
    • partially found by another stakeholder
We have tried our best to propose a solution that would be as reasonable as possible, minimizing the impact on WMF. The hosting research center will cover more than 50% of the total workload, namely 20 out of 38 person-months. In terms of human resources, this translates to 100% commitment of 2 software developers and 50% commitment of the software architect. Cf. Grants:Project/Hjfocs/soweego#Budget.
  • Privacy
    • Some community members object to collection of data from social networks.
    • this application has evolved/changed in response to comments from the community, specifically related to privacy concerns. (...) the changes have made the proposal less strong
    • the issues raised about privacy need to be very carefully addressed.
    • I would suggest to have someone (community, WMF ?) working as an ethic advisor
The initial proposal idea involved mainstream social networks for two very practical reasons: the benefit for Wikidata (enabled by the high coverage potential) and the participants' specific experience (based on the SocialLink system, cf. Grants:Project/Hjfocs/soweego#SocialLink). Said that, we have acknowledged and integrated the community concerns, removing those controversial targets. As a result, we are strongly convinced that reliable non-commercial identifier databases would not pose any privacy issues, for two very simple reasons: they are public and trusted. Hence, we do not see any risk for an individual not willing to publicly disclose his or her identity to have his or her privacy breached. In any case, we still believe that this also applies to social networks like Twitter, with public accounts and a built-in mechanism for trust.
Finally, we would be indeed pleased to collaborate with a Wikimedia ethic advisor to further discuss this sensitive aspect.
  • Measures of success
    • The success can be reliably measured, though measures of success should be improved.
    • The participants should think about (...) what are the proper measures of success of this project
Please cf. Grants_talk:Project/Hjfocs/soweego#Aggregated_response_to_comments, project metrics point, where we have tried to give extra details, besides the proposal section.
  • Targets: The participants should think about (...) what online databases they should use, what tools they want to use for curation
We totally agree, and have given high priority and weight to such activities. This especially applies to the database selection, which is the first planned task and has a considerable effort (cf. T1 in Grants:Project/Hjfocs/soweego#Work_package). In addition to the community, we think that discussions with our technical advisor Magnus Manske will be vital for the latter activity.

Cheers,
--Hjfocs (talk) 17:04, 17 November 2017 (UTC)

Target databases scalability[edit]

  • @Hjfocs: I already asked about the concern of scaling to new databases above. I don't think that a project which has the only output of creating VIAF links (and VIAF linked databases) is worth the requested 75k EUR. On the other hand, we currently add a lot of identifiers for new websites. Just last week we added identifiers for CGF athlete ID, OKS athlete ID, Gymn Forum athlete ID, World of O athlete ID, HOO athlete ID, Collective Biographies of Women ID, Snooker Database player ID, EUTA person ID, ChinesePosters artist ID, Vermont Sports Hall of Fame athlete ID, Alaska Sports Hall of Fame athlete ID, Radio Radicale person ID, Mémoire du cyclisme cyclist ID and Africultures person ID. All of those are point to entries about people and we currently and our abilities to link those records to our existing data is imperfect. A tool that needs lots of time to be customized for every database is of limited use. A tool that has the perspective to evolve in a way where normal users can add new databases in a reasonable timeframe would on the other hand provide a lot of value. We could also have community processes to review the decision for add individual identifiers to the project.
If a tool that allows community members to easily add new databases is too much for the next step but there's a good potential to develop it once this tool is finished, I'm fine with with funding this proposal for building basic capabilities.
Tools that allow us to interact well with long-tail databases that aren't already well referenced like VIAF are more importend than tools to link to well-referenced databases like VIAF. ChristianKl (talk) 18:07, 24 November 2017 (UTC)
Thanks a lot ChristianKl for the insightful observation. We would like to highlight once again the importance of the very first planned task, i.e., database selection, which has an overall substantial weight (cf. T1 in Grants:Project/Hjfocs/soweego#Work_package). The outcome of that task should be a trade-off between big databases that can potentially yield a coverage leap and the long tail of smaller ones.
In other words, T1 will play a key role to drive the implementation of the Linkers components, keeping in mind the following typical compromise: the system can be either tailored on few targets with high performance (few but good), or made generic enough to plug in lots of targets, at the cost of performance and maintenance.
Cheers,
--Hjfocs (talk) 16:14, 1 December 2017 (UTC)
Strategically, VIAF is already a well-staffed organisation works on getting links to everywhere and that includes having links to us. To the extend that we host data to which they don't already have access, they have incentives to step up their matching to us. When it comes to the long-tail however there's a lot of data that's unlikely get linked without help from computerized tools. Even when the quality of the resulting data is lower when building the system to be able to deal with more database sources I think that tradeoff will be worth it.
Given that Wikidata is at it's core a community project, it's okay for a tool to need some human maintenance. The key is to have that tool structured in a way that it's not necessary that the creators of the tool do the maintenance but the community can help with the maintenance. If you add the idea that community participants can add databases that inturn produces a need for programming an UI for that task. It doesn't have to be an UI that's easy to use, but creating an interface would be essential for getting people to interact with the tool. ChristianKl (talk) 18:13, 11 December 2017 (UTC)

Summarized response to #Privacy_issues_with_personal_information[edit]

We are immensely thankful to all the users who have brought up the privacy topic. We would like to outline below our understanding with a list of key aspects.

  1. We do look forward to having an ethic advisor on board, as suggested by the review committee;
  2. soweego does not aim at creating new items, but rather at linking existing ones;
  3. soweego can only disambiguate relatively explicit identifier pairs. As a fictitious example, it is not possible to link Tom Cruise (Q37079) to the MusicBrainz ID DonaldDuck789, which may be the one used by Tom Cruise (Q37079) for private purposes. This would most probably entail target identifiers that are intentionally disclosed by their owners;
  4. if the alignment itself can potentially bear a privacy breach, it may also hold for the 6 million people identifiers that are already available in Wikidata. The Validator component can take charge of this.

Cheers,
--Hjfocs (talk) 15:37, 1 December 2017 (UTC)

When it comes to privacy, we are currently developing a draft of a Living people policy at https://www.wikidata.org/wiki/Wikidata:Living_persons_(draft) . In case it gets accepted in it's current form we will have two specific classes of properties that are especially privacy sensitive. If the soweego tool would be configured in a way that it doesn't add properties that belong to those classes to living people, I think the privacy concerns would be effectively addressed. ChristianKl (talk) 18:24, 11 December 2017 (UTC)

Round 2 2017 decision[edit]

IEG IdeaLab review.png

Congratulations! Your proposal has been selected for a Project Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, 75,000 Euros

Comments regarding this decision:
The committee is pleased to support this project, with the following caveats, given the talkpage dicussions and per the recommendations of Leila Zia:

  • UX designer should review the look and feel before finalizing the software to make sure it is compatible with the needs of Wikimedians
  • Revision of your measures of success to merge two components: 20,000 curated statements done by at least 25 new primary source tool users.
    • We don’t want 2 people entering 10,000 curated statements. If that measure is achieved with only 2 people and we lose 1 of them, we lose a lot of contributions. We want to have a sustainable community of people who use this tool and create content.
  • Develop a plan for syncing when external database and Wikidata diverge. It may be beyond the scope of the project to address this issue fully, but we would like you to think about how it might be addressed.
  • Continue to assess and manage the risks of addressing people topics. If risks can’t be mitigated, avoid people topics. Minimally, make sure a human is always associated with data imports related to people topics.


Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!

Questions? Contact us.