Talk:Wikidata/Strategy/2019

From Meta, a Wikimedia project coordination wiki

Thank you for wanting to provide feedback on the vision and strategy papers for Wikidata and Wikibase.

General feedback and questions[edit]

Any chance these can be put into normal wiki pages, rather than PDFs? PDFs are harder to navigate, and the format means that we can't easily fix simple typos (like "aural" in the Vision paper). --Yair rand (talk) 17:28, 27 August 2019 (UTC)[reply]

Given that the goal is to have a shared vision, a Wiki also gives more opportunity to actually get shared input. ChristianKl14:58, 28 August 2019 (UTC)[reply]
I'll see if I can find the time but realistically I'll not be able to do it anytime soon and keep 2 versions of 4 documents in sync :( --Lydia Pintscher (WMDE) (talk) 16:59, 1 September 2019 (UTC)[reply]

Feedback and questions for the vision paper[edit]

Comments by ArthurPSmith[edit]

All the papers are great at saying what Wikidata (and Wikibase) is and how it's been developing and doing wonderful things. However I was hoping to get a bit of clarity on where things are going; it seemed to me the clearest spot on this was the "Main roadblocks" section in the "vision paper" - but those are mostly talking about things that are NOT good without a lot of specifics on how to fix them. For example, when you say "responsibility of developing the software should be shared by different actors in the ecosystem, allowing for the software to capture the needs of specific contexts advancing the whole movement" is the suggestion to have other Wikimedia projects add staff to work on Wikibase/Wikidata, API's etc, or to try to engage the volunteer community more extensively in development, or perhaps to look towards other entities (Apple and Google?) to do some of the software development? There seem to be suggestions of an idea to split Wikidata from Wikibase in some way? ArthurPSmith (talk) 18:23, 27 August 2019 (UTC)[reply]

Yeah this is a bit fuzzy still because that's open for exploration. How I imagine it is that we'll see small companies or other organisations in the Wikimedia movement for example popping up who will focus on Wikibase for specific target groups. So there could be one entity that extends Wikibase for libraries specifically, developing tools like specialized input forms. Some of that might be generally useful and go upstream for everyone to use, while others might be too specific for the particular context and only be used there. How that'll all play out in reality I am not sure yet. --Lydia Pintscher (WMDE) (talk) 17:04, 1 September 2019 (UTC)[reply]

Comments by ChristianKl[edit]

The Sloan Foundation recently gave money for the development of Scholia. It would be great if we have an ecosystem where the Sloan foundation and the Knight Foundation could fund some Wikidata/Wikibase related projects but at the same time you can't directly write into a strategic vision that they should. The vision is more of a document that can be used to go to entities like them and create grants. ChristianKl19:37, 27 August 2019 (UTC)[reply]

Yeah. Are there specific things that you think prevent this from happening at the moment that we might need to address? --Lydia Pintscher (WMDE) (talk) 17:08, 1 September 2019 (UTC)[reply]

_____________________

One important step is getting to an agreed uponmeasure for data quality as that currently does not exist.

I don't think that we need an agreed uponmeasure for data quality. Standards for data quality are very context dependent. In biomedical academia, quoting primary sources for claims is a sign of high data quality. On Wikipedia on the other hand primary sources are a sign of low data quality. It's interesting that it comes a few paragraphs before:

This means that the structures that shape Wikidata and the knowledge that is created with this tool are modeled after our “western” way. If we want to live up to our ambition of creating equity we need to create structures within our movement that allow communities across the globe to shape our tools and knowledgebase according to their needs.

Agreed standards of data quality are about enforcing a certain (and likely Western) standard of what's data quality for everybody. Strategy is about deciding about which alternative you want and which you don't want. You can't have the advantages of central quality standards and diversity in standards at the same time. That said I don't think that the central problem of our community structures is a reliance on Western ways. Plenty of decision within Wikidata also need to be made centrally and that needs discussion in a common language where English is the only candidate. Instead of trying to write idealistic postmodern ideas about equity into the vision I would support dropping the current paragraph on Community Structures. ChristianKl19:37, 27 August 2019 (UTC)[reply]

Hmmm you have a point there. I hadn't connected those two so far. Any idea how to untangle this? Because both equity and data quality are things I get told again and again are very important (and I agree). --Lydia Pintscher (WMDE) (talk) 17:08, 1 September 2019 (UTC)[reply]

Make linked external datasets part of the picture[edit]

Wikidata has grown to an indespensible linking hub for both RDF and non-RDF data on the web. This linked external data profits highly from Wikidata for enrichment in applications, but it also largely enhances the value Wikidata, and Wikidata may draw from metadata from that linked external data (license permitting). So it should be part of the overall picture, as in this rough draft:

Wikidata ecosystem with linked external datasets
In the drawing, databases (DB) stands for all kind of linked datasets, not only databases, but also repositories, catalogs, portals, etc. - whatever provides a persistent identifier and has an external-id property. I'd suggest to leave out the special GLAM box, because it spans Wikibases and other linked datasets, and because GLAM is not so special (vs. e.g. the bioinformatics community).

--Jneubert (talk) 09:33, 30 October 2019 (UTC)[reply]

Feedback and questions for the Wikidata for Wikimedia projects paper[edit]

Editable infoboxes on Russian Wikipedia[edit]

In Wikidata for Wikimedia project there is box:Editable infoboxes on Russian Wikipedia: Russian Wikipedians developed a gadget to allow editing of Wikidata’s data for a limited number of infobox types directly from their Wikipedia.. Is there any links or examples related to this gadget? --Zache (talk) 13:57, 31 August 2019 (UTC)[reply]

@Putnik: Do you have a useful link maybe? --Lydia Pintscher (WMDE) (talk) 17:14, 1 September 2019 (UTC)[reply]
ru:Википедия:WE-Framework this maybe? It has also global gadget which works in every wiki--Zache (talk) 02:32, 3 September 2019 (UTC)[reply]
Yes that's the one! I'll add links to the next version of the paper. --Lydia Pintscher (WMDE) (talk) 16:11, 1 October 2019 (UTC)[reply]

Feedback and questions for the Wikidata as a platform paper[edit]

Comments by Csisc[edit]

I thank you for your efforts. I am honoured to add several points to Wikidata as a platform working paper. On one hand, I have to inform you that three important projects are missing for the list provided by the working paper:

  • QAnswer, Question answering platform based on Wikidata (Research paper)
  • Wikidata for Machine Interoperability (Research paper)
  • Medical Wikidata, I presented this project during Wikimania and my paper about the project has been accepted for publication in Journal of Biomedical Informatics. However, the paper is just proofread by the editor for typesetting and that is why it is not currently published. The project is about adding support of medical information such as symptoms and medical classifications to Wikidata. This project is important for various purposes such as medical education and clinical decision support. What I caught from a discussion with Tim Moody is that this project can be expanded to add support of medical institutions and facilities across the world so that decisions about humanitarian actions can be assisted by computers for example.

On the other hand, I agree on the fact that Wikidata cannot be effectively used for important taska until its quality becomes enhanced. From what I discussed during and after Wikimania, there are only three automated methods for that:

  • Using the APIs of reliable citation Indexes such as PubMed Entrez API to automatically verify and add reference support to Wikidata statements. In fact, citation indexes includes data about millions of scholarly publications. Consequently, they include a reliable overview of the sum of all human knowledge that can be used to enrich Wikidata.
  • Adding EntitySchemas for each class of Wikidata items. These EntitySchemas are build using ShEx and can be used to identify missing or wrong statements related to a Wikidata item.
  • Using Wikidata query service to identify wrong property use.

These three methods should be discussed and further developed. --Csisc (talk) 19:54, 27 August 2019 (UTC)[reply]

Comments by ChristianKl[edit]

We are not looking for truth but verifiability.

I do care about truth. In case we see patterns inside of Wikidata that results in us having data that isn't true I think it's important to address them. Wikipedia's notion of verifiability over truth is problematic and comes with their rejection of expert knowledge. While we currently don't have a system for authentication of expertise I wouldn't want to rule out that we won't make decisions in the future where we rate the knowledge of an expert higher then the information of an article that seems to be false. ChristianKl20:51, 27 August 2019 (UTC)[reply]

ChristianKl: I also deal with methods for the automatic verifiability of data. For example, drug used for treatment (P2176) can only be used for a medication (Q12140) as a subject. It cannot be used for diseases or genes as a subject. This is evident even without expert interference. --Csisc (talk) 08:05, 28 August 2019 (UTC)[reply]
The question is what you do when truth and what can be verified diverges. I think we should have a goal of providing the user with the truth. ChristianKl08:33, 28 August 2019 (UTC)[reply]
ChristianKl: Definitely agree. --Csisc (talk) 08:45, 28 August 2019 (UTC)[reply]
I agree with you that we do not want untrue data. Obviously. But sometimes we can't know and it's those cases where we need guiding lights. And Wikimedia has decided to go with verifyability in this case. And with conceptualizing Wikidata as a secondary database it's pretty deeply baked into the system even. Denny wrote about it here. So I think what we need to communicate is that you shouldn't come to Wikidata expecting to find the one and only truth about everything - but instead a best effort guided by references and hard work. Do we agree on that? If so is there a wording that would cover this to update the paper? --Lydia Pintscher (WMDE) (talk) 17:22, 1 September 2019 (UTC)[reply]
Lydia Pintscher (WMDE): We discussed this point during Wikidata and Health session of Wikimania 2019. I agree on the fact that verifiability is a better choice than truth because it is hard to judge a statement as false if there is a reference behind. This should be clearly shown in the report. However, there are several situations where we can easily recognize that a statement or a property use is wrong thanks to description logics. For example, if X is drug used for treatment of Y:
  • Y is a disease and cannot be a drug
  • X is a drug and cannot be a disease
  • Y is NOT drug used for treatment of X
Such a fact should be included to the paper. In practice, this can be easily apploed:
  • If we can define an EntitySchema for each Wikidata class (diseases, drugs, human genes...), we can easily identify missing properties and wrong property uses.
  • As citation indexes such as Medline involve the sum of all scholarly knowledge, we can use the API service of citation indexes to verify and add reference support to Wikidata statements and even to add missing Wikidata statements.
  • We can Wikidata query service to identify wrong property uses. E.g. click here.
I ask about your opinion concerning this. --Csisc (talk) 19:20, 1 September 2019 (UTC)[reply]
  • There are plenty cases where newspaper articles contain wrong information. When a user with local knowledge sees those errors and tries to correct it they get told "Truth isn't the goal but verifiability is, if you don't have a source of comparable reliable as the newspaper article, we don't want your facts". I do consider that to be problematic. The way around this is to assume that sometimes users do have local knowledge and expertise and should be trusted to correct an error. I don't think we should copy the hostility of expertise that Wikipedia has. ChristianKl12:31, 3 September 2019 (UTC)[reply]
ChristianKl: I definitely agree with you. This should be added to the report as well. However, what I deal with are scholarly publications and not newspaper articles. Scholarly publications are published research items that have been reviewed by confirmed scientists. So, there is a limited probability that they involve wrong information. --Csisc (talk) 18:41, 4 September 2019 (UTC)[reply]
@Csisc: When it comes to scholarly publications a lot is well reviewed. I however expect that in many cases the information that leads to {{{}}} isn't of that nature. I could imagine cases where there's a low quality data-source that gives main subjects for specific papers but we would still want experts on Wikidata to enter a more appropriate {{{}}}.
Information about the nature of citations is similar. We will likely have low-quality machine generated judgements of the nature of specific citations in the future and I could imagine that in many cases human experts could overrule low-quality sources like that. ChristianKl19:09, 8 September 2019 (UTC)[reply]
@ChristianKl: This is absolutely a respectful opinion. The statements automatically generated by machine need to be verified by experts before inclusion to Wikidata. I will include that in my research work about the topic. --Csisc (talk) 07:49, 10 September 2019 (UTC)[reply]
@Lydia Pintscher (WMDE): I ask if you can add the outcomes of this discussion to the report. --Csisc (talk) 07:49, 10 September 2019 (UTC)[reply]

Minor grammatical issue[edit]

On page 8: "Values: If we don’t provide a reliable, high-quality knowledge base other players will do it that will not be doing this in a way that is [...]." --abián 18:37, 1 September 2019 (UTC)[reply]

Feedback from a data re-user[edit]

Hi! I'm Connor Shea (Nicereddy on Wikimedia sites) and I'm a Wikidata data re-user :) I've also contributed back quite a bit, with around 13.5k edits between myself and my bot.

I just wanted to chime in with some of my thoughts, having spent quite a while on scripts to improve Wikidata's coverage of video games, data import scripts for my own site, SPARQL queries, working with the WikiProject Video Games community, etc.

For context, my website is called vglist (https://vglist.co/) and its intent is to let users track their video game library (think Goodreads, but for video games). It's an open source Ruby on Rails app I started working on last December. The biggest problem I had when starting the project was figuring out how I could get a starting dataset to work with. I knew Wikidata existed, but I was skeptical of its data quality when it came to video games. After researching available alternatives (MobyGames, GiantBomb, etc.) I decided that my best choice - based on the licensing of the data, the API capabilities, the accuracy of the data, and whether I could contribute improvements back to the data source - was to go with Wikidata.

However, I was still unhappy with the data quality/depth in a lot of respects. Oftentimes, platforms were non-existant, external identifiers were lacking, descriptions were poor and/or minimal, games were duplicated across multiple items (sometimes due to there being an item for every platform a game was on, for example), and release dates didn't have specific region or platform listings (e.g. frequently games come out in Japan before they come out in the US, or games come out first on a console and then are later ported to Windows). So, I started working on improving it. I added PCGamingWiki IDs, and then used those as a jumping-off point to import better data from PCGamingWiki, like Steam IDs, GOG.com IDs, Discord SKU IDs, WineHQ IDs, etc. Then I used the Steam API to get the supported platforms, and did various kinds of data cleanups whenever I saw potential problems. I also wrote a bunch of SPARQL queries to help me find items that were likely incorrect or missing data (https://gist.github.com/connorshea/d813cae2bad4dd72a490efe925dfb6c2), and then went through each of the items manually.

So, 13.5k edits later, I'm much happier with the data quality for video games on Wikidata, but there's still plenty of room for improvement.

I actually ended up implementing a 'Wikidata Blocklist' in vglist, to prevent known bad data from leaking into my site when I ran imports. For example, Pokemon Sun & Moon are two separate games, but because they're released together at the same time, each has one separate Wikidata item and then a shared item (https://www.wikidata.org/wiki/Q22954794), so there are 3 items for 2 games. Admittedly, this is mostly just due to a difference in approaching our respective data catalogues rather than a flaw in the dataset, but I thought it was worth mentioning.

I had about 31k games in vglist when I first launched it in late March (and still have about that many today), and >99% of that data (excluding game covers) came from Wikidata.

Currently, once data is imported from Wikidata I don't really touch it again, e.g. if a game gets a new release on Nintendo Switch, I'll never have that information imported automatically into my site. There are a couple reasons for this, one of the major ones being that I don't want to make 31k API requests to Wikidata on a regular basis :) I also don't want bad data to 'leak' into my site, and generally the data doesn't change often enough for this to have been a concern for me yet. The biggest source of new data is for new games, which get new entries anyway, and I already do imports for new items every few weeks anyway.

I've repeatedly considered dropping the ability to edit games on my site at all, and instead just point users to Wikidata if they want to modify the data. So far, I haven't done that partially because I'm not 100% sure if I've considered all the potential drawbacks, and partially because I spent a long time making the editing work so I don't really want to throw it away. :P

I've briefly considered letting users edit on my site and then syncing those changes back onto Wikidata with a bot, but I was concerned with potential spamming and also didn't see much value in adding so much complexity to my site. Generally, whenever I find information that's incorrect on vglist I'll also update it on Wikidata myself anyway, and there aren't many people doing edits on my site besides me.

This ended up being a lot longer than I had intended, sorry about that, and sorry it's not a super cohesive comment. My main points of feedback are, essentially:

  • Thank you for Wikidata, it's been immensely helpful to me.
  • Thank you for supporting data reusers, and I hope my feedback (or, long anecdote, as it's turned out) as a data reuser is helpful. :)
  • I agree with pretty much everything in this paper, and strongly agree that improving data quality is very important to the continued success of Wikidata.
  • I'd definitely be interested in more discussion around how reusers can contribute back, either automatically or manually.

If there are any questions you'd like to ask me about my experience integrating w/ Wikidata, how my site works, etc. I'd be happy to try and answer them. :)

Thanks a ton, Nicereddy (talk)

Thanks so much for building your site on top of Wikidata and this experience report. That's always useful to have and it's good to know that we're thinking in the same direction.
When we get to further research into how to make contribution from data re-users easier I'd love to have your input. That should be happening next year. --Lydia Pintscher (WMDE) (talk) 16:10, 1 October 2019 (UTC)[reply]

Feedback and questions for the Wikibase Ecosystem paper[edit]

Comments by ArthurPSmith[edit]

I was hoping to hear a bit more specifics about where we are and what's needed on federation. What I had in mind was allowing 2 (or more) Wikibase instances to cross-reference each other so that statements in one can use properties and items and other entities from the other(s) without having to make a copy of the data. Something like that has been done with Commons, right? Or enabling copy of data at a certain point in time and then some kind of later resynchronization process, for data that is flagged as copied from another instance. Is there a general place these specifics are being discussed at this point in time, or is it still a bit nebulous as it seems from these strategy papers? ArthurPSmith (talk) 18:29, 27 August 2019 (UTC)[reply]

A feature like that seems to be central for them being able to reuse our ontology. Especially the reusing of Wikidata property should be made easy. ChristianKl06:44, 28 August 2019 (UTC)[reply]
I have some more ideas in my head but those are all still rather rough and not yet completely guided by what people actually need. I'll open a page to gather input in the next days so that I can then use that to refine the ideas and figure out the concrete next steps after statements in Commons --Lydia Pintscher (WMDE) (talk) 17:25, 1 September 2019 (UTC)[reply]
The page to collect input is now up at d:Wikidata:Language barriers input. --Lydia Pintscher (WMDE) (talk) 18:31, 1 September 2019 (UTC)[reply]
@Lydia Pintscher (WMDE): Language barriers seems to be a very different topic then how to make it easy for other Wikibase instances to reuse Wikidata items/properties. ChristianKl12:33, 3 September 2019 (UTC)[reply]
Urgh. Yeah :D I of course copied the wrong link... Here's the correct one: d:Wikidata:Federation input. Sorry. --Lydia Pintscher (WMDE) (talk) 18:27, 3 September 2019 (UTC)[reply]

Comments by Abián[edit]

Thanks for the paper! I would like to share some comments to complement it:

  • The pressure of some stakeholders over Wikidata is usually seen as a problem, but this pressure also hides a number of opportunities that should be wisely controled for the benefit (and maybe benevolent hegemony) of Wikidata and free knowledge. Cultural and public institutions that offer Wikidata-based services are concerned about the quality of Wikidata data (with greater or lesser frequency or success, they keep the data up to date, ensure that the data is accurate and there is no vandalism, etc.) and have to cooperate with the rest of the ecosystem and balance the interests of other stakeholders, dynamics that result in a high-quality Wikidata that can meet the information needs of many people and provide valuable content to Wikipedias. Without these institutions and their professionals, volunteer editors alone will not be able to keep Wikidata under the required quality standards and, most importantly, up to date, even if the up-to-date data is maintained in federated Wikibase instances; Wikidata data is too scattered, too abundant and growing, and not by preventing institutions from leaving their data here will volunteers be prevented from doing the same, volunteers are unstoppaple machines of adding more and more new data (although, sadly, they're not as good at maintaining existing data). The fact that some institutions and part of their finances depend on Wikidata's quality is great news and something we can pretend, we just have to reach a compromise between the pressure that Wikidata receives and the pressure that Wikidata can afford, and try that Wikidata can afford a little more pressure over the years without breaking or degrading.
  • To get this, to get a free and successful ecosystem of Wikibase instances and constructive relationships and, in general, for the Wikimedia movement to continue to pursue its goals, more money needs to go to Wikidata and the Wikibase ecosystem, although money, or resources in general (including data and knowledge, people's time and computation power), from any source should not be welcome. We must not accept resources from those with whom establishing dependency relationships means putting at risk the independence of the development of Wikibase or the Wikimedia movement. Too powerful stakeholders, or stakeholders on whom the survival of the Wikibase ecosystem or the Wikimedia movement can depend, will end up determining, explicitly or subliminally, how the software is developed, will condition the direction of the Wikibase ecosystem and the Wikimedia movement and will pervert our culture and values. Over 50% of the money that gave birth to Wikidata came from entities somehow related to these powerful stakeholders. This cannot be the proportion thanks to which Wikidata and the Wikibase ecosystem, the core of innovation of the Wikimedia movement, keep running and evolving in the years to come. Please prioritize Wikidata and the Wikibase ecosystem inside Wikimedia and raise people's awareness of how important these projects (and their independence) are for everyone.
  • Last but not least, it is important to get stakeholders to actively use Wikibase for the whole data lifecycle. Having Wikibase simply as a data cemetery, as one more place to dump and forget the data, has almost no value; these bad and common open data models, called "linear producer-driven exclusive open data initiatives" by Francisco Javier López Pellicer (Q29401823), can do no good to anyone. Ideally, institutions should not use Wikibase as an extra but as an indispensable part of their systems and internal operations instead of other closed or isolated local solutions. This not only will benefit both institutions (for having consistent and connected data enriched with bi-directional feedback flows with their users) and users (for having always updated and, in general, quality data available), but it will also benefit the ecosystem and our movement because implementing Wikibase will mean abandoning other closed or proprietary solutions and establishing future entry barriers to them.

Thanks again, and keep up the good work! --abián 13:10, 28 August 2019 (UTC)[reply]

All good points. I'll see how we can integrate them. Thanks! :) --Lydia Pintscher (WMDE) (talk) 17:28, 1 September 2019 (UTC)[reply]

Comments by Dsalo[edit]

Hi! I teach metadata and related topics, including linked data, in an iSchool to future (and often current) GLAM workers. I used Wikidata in an assignment for my most recent linked-data course, and would like to be able to use Wikibase directly as well. I very much appreciate the nod in the document toward installation in shared-hosting environments as well as demo systems, because trying to work with server-based software in academia is absolutely horrific. What I didn't see that I'd like to see is a plan for thorough usability/UX assessment of Wikibase's data-entry interfaces, and the related development work that would have to happen to address whatever the assessment turns up. Thank you for sharing this strategy work! --dsalo

Great to hear you're teaching with Wikidata! For me the work you mentions goes into a lot of the pieces that are in there already (getting more institutions to use Wikibase, getting more contributions from a more diverse community, more acceptance in the Wikimedia projects, etc.). I'll see how to make that a bit more explicit. --Lydia Pintscher (WMDE) (talk) 17:32, 1 September 2019 (UTC)[reply]