Grants talk:Project/Future-proof WDQS

What is the problem you are trying to solve?[edit]

Under "What is the problem you're trying to solve?", you describe how WDQS should be. Does that mean that WDQS currently does not fulfil any of the items listed? For example, you write that it should "expose a SPARQL endpoint for queries" which I thought exists? Please correct me if I am wrong or misunderstand - I would have expect a description of the current problem(s) and situation in this section. --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- That is citation from a thread from the wikidata mailling list stating that WDQS does not scale as much as we would like. I added the requirement "easy for researcher and practionners to setup their own instance."--i⋅am⋅amz3 (talk) 13:21, 25 January 2020 (UTC)[reply]
- I tried to address your request for better description that should be readable by a non-tech. --i⋅am⋅amz3 (talk) 00:05, 27 January 2020 (UTC)[reply]

You have identified a problem you want to solve. For any problem, there are often several possible technical implementations to consider. You propose an implementation based on nomunofu. Can you also share about what other approaches you have considered and why you ultimately chose this implementation and technology? --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- I mainly considered the same approach based on an OKVS in other programming languages: Python, Go, Rust or Java. Python is slow. Go has a single compiler implementation, hence subject to vendor lock-in. Rust is not easy to learn. Java can be solution. I favor Scheme because there is plenty of learning material, it is easy to learn and Chez Scheme implementation is fast. On microbenchmarks, it is faster than Virtuoso. I favor the OKVS approach because it is the easiest to setup, scale, maintain and is future proof (there is several open source vendors and there companies already using OKVS and in particular FoundationDB in their products. --i⋅am⋅amz3 (talk) 13:21, 25 January 2020 (UTC)[reply]

@Iamamz3: Thanks a lot! Currently the page seems to mention "nomunofu" without explaining or linking to what that is? Plus I cannot find a public code repository, as all links that an internet search brings up are 404 errors. :( --AKlapper (WMF) (talk) 20:34, 27 January 2020 (UTC)[reply]

@Iamamz3: question: could you provide more details on the first point? (for instance, for each ideal point if WDQS is currently good, medium or bad at it ; the idea is to see more precisely where is the potential and need for improvements, caveat: I'm not an expert tech). VIGNERON

Done, but more to come --i⋅am⋅amz3 (talk) 00:05, 27 January 2020 (UTC)[reply]

Comparison with other solution[edit]

Remark: last August, KingsleyIdehen created a static demo endpoint in Virtuoso here (see the announcement on the wikidata list), did you look at it? Maybe there is some useful part behind this idea (or/and in the mail list answers). Cheers, VIGNERON * ^discut. 14:11, 25 January 2020 (UTC)[reply]

Yes I am aware of that. I made small pros/cons analysis at: https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS#What_are_other_solutions? --i⋅am⋅amz3 (talk) 00:03, 27 January 2020 (UTC)[reply]

Budget[edit]

Can you provide an itemized breakdown of your requested budget? It would be helpful to see your calculations. --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- The budget depends on a) whether I do that work full-time (current proposal) or on my free time b) whether I can rely on wikimedia hardware (see https://phabricator.wikimedia.org/T206636) --i⋅am⋅amz3 (talk) 13:21, 25 January 2020 (UTC)[reply]
- I changed the budget to only give an idea of what I expect, this must be reworked depending on the plan and applicable taxes. --i⋅am⋅amz3 (talk) 12:16, 26 January 2020 (UTC)[reply]

Under Project goals and Project impact[edit]

Under Project goals and Project impact, can you provide further narrative that explains how and why you chose these goals and how you will measure if you meet them? Currently you state that "WDQS does not scale" and I'm wondering how you came to that conclusion, as things are often not binary (maybe you meant "does not scale well" here?). --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- This is not my analysis, it was reported several time and acknowlged by User:GLederrey_(WMF)... --i⋅am⋅amz3 (talk) 00:01, 27 January 2020 (UTC)[reply]

Under "What is your solution?" you mention that "the first step will be to set up a benchmark tool" - This could be listed as a bullet point under "Project goals", together with other outcomes which describe more specifically what you plan to achieve? --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- I added a list of Activities and I address some of the problem of wdqs and wikidata in How_to_make_WikiData_scalable? . Project impact are documented, in particular, it is a MVP, it will not be iso with the current wikidata in terms of visual edition and will not be backward compatible with the existing bots. --i⋅am⋅amz3 (talk) 00:06, 29 January 2020 (UTC)[reply]
- TODO I will add a non-goals section. --i⋅am⋅amz3 (talk) 00:06, 29 January 2020 (UTC)[reply]

Request for feedback[edit]

Have you had any conversations already with people working on Wikidata Query Service about your ideas for this project? If so, can you share anything about those conversations? Or even better, perhaps you can invite them to share their feedback here. (Also see the links in the "General discussion" section on wikidata:Wikidata:Community portal.) It would be great to see people from the Wikidata community engaging directly on this talkpage so we can hear their thoughts about the proposal. --AKlapper (WMF) (talk) 20:22, 24 January 2020 (UTC)[reply]
- I have listed many requests for feedbacks at the bottom of the page. I agree it would be great to see people from Wikidata community engaging directly on this talk page... --i⋅am⋅amz3 (talk) 13:49, 26 January 2020 (UTC)[reply]

Bold[edit]

I read everywhere on wikipedia that contributors should be bold. So, I reworked the proposal to be larger. I will document a solution and address the other comments soon. --i⋅am⋅amz3 (talk) 17:38, 26 January 2020 (UTC)[reply]

It seems to me very unlikely that a complex product like Wikidata can be rewritten in a year by one person yet it seems that's your proposal. Can you explain why you consider that doable? Wikidata has a lot of features besides WDQS. ChristianKl ❪✉❫ 16:15, 28 January 2020 (UTC)[reply]

The simple answer to that is legacy. --i⋅am⋅amz3 (talk) 00:29, 29 January 2020 (UTC)[reply]

I can add a section about non-goals to make clear that the proposal will not be drop-in replacement for ALL WikiData features when 1.0.0, but it will have the necessary building blocks to work toward that goal. --i⋅am⋅amz3 (talk) 00:30, 29 January 2020 (UTC)[reply]

If the goal is to be part of a Wikidata which many legacy parts I would like to know more about the state you expect after a year and how that state will be integrated into Wikidata. ChristianKl ❪✉❫ 08:30, 29 January 2020 (UTC)[reply]

It depends. I know that I don't know. User:Addshore's diagram in the post wbstack infrastructure is very helpful. Still, it is not a wikidata service dependency diagram. Off-hand I can think of two strategies to support other wikimedia projects until those rely on a new API (based on SPARQL): A) nomunofu (or another service) will write changes to wikidata based on the stream of changes API. Existing users will be able to continue to use the services offered by wikidata. That will increase lag and limit throughput and is more brittle. B) Create services that will translate legacy API calls into new API calls. That will not limit throughput, there will be no lag, but requires more software. --i⋅am⋅amz3 (talk) 11:49, 29 January 2020 (UTC)[reply]

Any evidence that your approach scales better?[edit]

MySQL and ElasticSearch are pretty fast for their respective purposes (data storage & search); what evidence do you have that your approach would be faster on similar hardware, at Wikidata's current and likely near-future size? ArthurPSmith (talk) 23:18, 27 January 2020 (UTC)[reply]

There is also blazegraph in the infrastructure, that should be taken into account. I need more input about pain points with mysql and elasticsearch. My experience is that ES and mysql approaches to scale is more difficult. FoundationDB was meant to internet scale, and is used for example in similar contexts at Apple in the entity store. FDB is meant to scale horizontally. Based on my experience ElastisSearch scale because we throw more hardware at it. I think given the big names investigating [1][2] it is worth giving it a try. --i⋅am⋅amz3 (talk) 00:27, 29 January 2020 (UTC)[reply]

Why Scheme?[edit]

Scheme seems to me like a language that has relatively little developers. Using a more common language like Java would allow many more programmers to contribute. ChristianKl ❪✉❫ 16:13, 28 January 2020 (UTC)[reply]

I will start the argumentation to say that more developer does not mean better code. The problem is not that people do not know scheme, the problem is that there is no big project using scheme. --i⋅am⋅amz3 (talk) 00:28, 29 January 2020 (UTC)[reply]
Java does not have the necessary features, as of yet January 2020, to reproduce what is already done, in particualr it is missing lightweight threads --i⋅am⋅amz3 (talk) 12:39, 12 February 2020 (UTC)[reply]

Problem of edits that span multiple items[edit]

Currently it's not possible in Wikidata to do an atomic edit that affects multiple items. This is a problem for edits that merge two items because those merges are complicated to undo, especially when they involve items that have a lot of links. For other batch edits it would also be great to have batch edits that are atomic and can be undone as a batch. ChristianKl ❪✉❫ 08:27, 29 January 2020 (UTC)[reply]

Atomic edit that affects "multiple items" and undoing those changes will be supported (I added the 9th row in The table in "How to make WikiData scalable?" section and rollback (or reverts) will be recorded in history. --i⋅am⋅amz3 (talk) 11:23, 29 January 2020 (UTC)[reply]

Edit tab[edit]

To ease on-boarding, and offer a clear path from user to power user, item view (WYSIWYM) will be read-only. Besides the discussion tab, there will be another tab, for instance called "RDF" or simply "edit", where all the triples for the current item will be displayed in a table (WYSIWYG). The idea behind this particular choice, is to help the user learn some RDF and toward the goal of being able to create their own SPARQL queries via the textual format or visual editor (e.g. vsb). That in turn might improve "software literacy", and reduce the need for many purposefully built tools. I believe that relying on a single framework and more precisely on the SPARQL (+ change request) API will reduce the number of required software that should be created to support the needs of the community. --i⋅am⋅amz3 (talk) 11:24, 29 January 2020 (UTC)[reply]

The user facing data of Wikidata isn't triples. It's a lot more dimensional then that. For every statement you have at least a forth dimension that includes the rank. Units, Qualifiers and references add additional dimensions. ChristianKl ❪✉❫ 19:27, 29 January 2020 (UTC)[reply]

Discussion from 2018[edit]

Maybe Future of query server interests you. --Jura1 (talk) 08:36, 30 January 2020 (UTC)[reply]

bold, but not reckless[edit]

Based on fifth wikipedia pillar: I withdrawn the proposal. --i⋅am⋅amz3 (talk) 12:27, 30 January 2020 (UTC)[reply]

Some more feedback[edit]

(Adding feedback since it was requested even though the status is withdrawn.)

Thanks for putting together the proposal and trying to think through how we can help Wikidata continue to grow healthily.

As it currently stands I fear there are several fundamental issues that make me worried about it - some of them already mentioned here before:

Wikidata is a highly complex technical system. Only a part of that is due to a legacy code base. There are a lot of tools, processes, workflows and expectations built into the system that are definitely not possible to replace in one year, much less by one person.
We have relied on home-grown solutions in Wikimedia (in other areas) way too often and it has come to bite us. I am very hesitant to move to something that is just built for us and not used, contributed to and tested by several other large players unless there are other really convincing arguments that outweigh this.
Rearchitecting the whole technology stack and at the same time redoing the core of how the project works socio-technically is extremely risky at this stage of Wikidata's life.
There is no significant expertise among the people who would need to maintain this long-term in Scheme.
The building of the system is just the start. Much more important for us is the maintenance and expansion over a long period of time. This doesn't seem to be addressed so far.

I hope this helps. --Lydia Pintscher (WMDE) (talk) 13:45, 13 February 2020 (UTC)[reply]

---

Since otherwise, it would be impolite, here is my reply to all the points:

Re "complex system, tools, workflows, and expectations to replace in one year, by one person"[edit]

That is the de-facto standard argument against any change.
If you look closely, most (programming) projects are built with only two hands, and maintained by a community.
It is not clear in the proposal, but in the talk page it is written, I proposed two strategies to support in backward compatible way existing tools, processes and workflows, search for "two strategies to support other wikimedia projects".
In my opinion, "legacy code base" is poor choice of words I made, I intented to mean "debt".
I stand on the shoulders of giants, wikidata is a low hanging fruit.
What about time traveling queries, isn't that an expectation that wikidata fails to deliver?
What about down-scaling and knowledge equity, isn't that an expectation that wikidata fails to deliver?

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]

Re "home-grown = bad, tested by several other players, not us."[edit]

Again de-facto standard argument against any change,
Research and development in software engineering is a lost art. I would expect much more nerve from the biggest encyclopedia of all the times,
It is not because it failed in the past that it will fail in the future, again "standing on the shoulders of giants",

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]

Re "rework socio-technically is risky"[edit]

I don't understand that part, it is meant to carry some political subtext?

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]

Re "Lack of Scheme expertise"[edit]

Again standing on the shoulders of giants,
Scheme has been 50 years in the making with loads of learning material, and concept borrowed by other programming language as recently as today, see Java Loom project.
Also, the proposal states that clues will emerge on how to go forward,
Like I replied to the related conversation in this talk page, Java does not have the necessary features to reproduce the intended design.
Stronger claim: no other programming language has the required features.

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]

Re "building is just the start: what about maintenance and expansion"[edit]

De-facto standard argument against any change. Citation needed.

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]

If a system is to serve the creative spirit, it must be entirely comprehensible to a single individual." Daniel Ingalls[edit]

--i⋅am⋅amz3 (talk) 10:50, 14 February 2020 (UTC)[reply]