Jump to content

Talk:Community Tech/Tool Labs support/Tool Labs vision

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 8 years ago by GLavagetto (WMF) in topic Replacing the Sun Grid Engine

Call for participation

[edit]

I (BDavis (WMF)) am the "owner" of this page in so far as I started writing ideas down here and the general topic is directly related to my current position at the Wikimedia Foundation. The ideas here are not entirely (or likely even primarily) my own however. Even if they were, plans on this scale (many person months of work) need collaboration and feedback to ensure that a biased point of view does not dominate the planning. I'm seeking input from interested parties on all aspects of the content presented here.

I am seeking constructive participation. Arguments that the status quo is fine and that changes are unneeded are not constructive. Suggestions of which problems have been missed or which changes would be of the most beneficially impact are constructive. Suggestions of existing FLOSS projects which might be examined and other communities that have solved certain problems are constructive. Prognostications of failure due to past performance of initiative X, Y, or Z are not constructive.

More "What Would You Do If You Weren't Afraid?" and less "Who Moved My Cheese?"[1] please. --BDavis (WMF) (talk) 05:01, 15 April 2016 (UTC)Reply

Project roadmap

[edit]

The items in the project roadmap all sound reasonable, but I'm not sure what all of them actually accomplish. I'm not suggesting that they need to be changed, just that they could use some elaboration (for people who are clueless like me) as to what each item actually accomplishes from a practical point of view. For example, what does "Build an extension for Horizon that can manage the Puppet config for Labs projects" do for us? What are the advantages of replacing the Sun Grid Engine? Does that give us better resource allocation? Just want to make sure these are clearly spelled out. Kaldari (talk) 17:36, 15 April 2016 (UTC)Reply

Kaldari, you're right, clear answers to your questions would be helpful. What form do you propose that BDavis takes to spell out his proposal more clearly? I made an attempt to answer your questions in new topics below, which I hope keeps the conversation manageable for Bryan. -- RobLa-WMF (talk) 18:52, 15 April 2016 (UTC)Reply
I think Kaldari's question is partly also about making the connections between the steps easier to parse. I edited the "making it better" section to sort the ideas into more general buckets, and I think the straw dog proposal could use that as well. The way I read it, steps 1, 2, 3, 5, 8 and 9 are all connected to the goal of converting Tool Labs/Wikitech to SUL. That's a little confusing, plus it's not even the correct Fibonacci sequence.
I'd suggest either breaking these projects out into their own lists (SUL project, Metadata project), or at least adding a label to make it easier to follow through the list. I'm going to be bold and add labels to the list -- Bryan, obviously feel free to take them off if they're not helpful. :) -- DannyH (WMF) (talk) 23:01, 15 April 2016 (UTC)Reply
Thanks for those edits DannyH (WMF). One thing that the current list doesn't make very clear is which steps truly are blockers for others. This will be easier to express when I start moving individual projects into Phabricator tasks where we can build a dependency graph with blocker tasks and parent tasks. I have moved things around a bit and added a few descriptive sub points since Kaldari's first post here (thanks for kicking things off Ryan!) that I hope will also help. I have a very complex narrative in my head and I'm not sure which parts people need to know to understand the bigger picture implications of the steps that are outlined thus far. I'll try to work on a better elevator pitch for the big picture goal. I think that will help some. --BDavis (WMF) (talk) 23:34, 15 April 2016 (UTC)Reply

Extension for Horizon

[edit]

Kaldari wrote: "what does 'Build an extension for Horizon that can manage the Puppet config for Labs projects' do for us?". I haven't learned how Puppet configs work, and I suspect the answer to that may be "make it so that 'learning Puppet' is not a precondition for using Labs", but that's just my guess. -- RobLa-WMF (talk) 18:52, 15 April 2016 (UTC)Reply

This is one of a handful of blockers to the eventual goal of making wikitech a SUL wiki. Functions that are currently provided by Extension:OpenStackManager will need to be moved to other tools because SUL will not provide the needed authorization controls that are currently received from LDAP auth.
Luckily the goal of incrementally diminishing the need for OSM is shared with the Labs team. Horizon is an upstream project of OpenStack which handles many of the same functions as OSM. Using a generally available FLOSS product as the basis of our IaaS solution should reduce maintenance costs in the long run. Andrew Bogott has been actively working on deploying Horizon as an OpenStack management console. Both he and Krenair have been working to extend Horizon to meet additional use-cases that we have that are not satisfied completely by the upstream project yet with those changes being upstreamed when generically useful to other OpenStack deployments. --BDavis (WMF) (talk) 20:12, 15 April 2016 (UTC)Reply

Replacing the Sun Grid Engine

[edit]

Kaldari wrote: "What are the advantages of replacing the Sun Grid Engine?" This is one that I'm guessing Yuvi or someone else from Ops could answer just as clearly and directly as Bryan could. Anyone care to jump in? -- RobLa-WMF (talk) 18:52, 15 April 2016 (UTC)Reply

There are various reasons. Mostly, SGE is a product of the 90s (and maybe 00s), and the way system administration was done in that time period. There are many manual steps in all parts of configuration, and none of the admins know SGE well enough to fix issues when they arise (see for example the regular outages in January). Next, SGE is largely unmaintained -- debian doesn't provide security updates (so they have to be backported manually), and as a further result of the lack of maintenance, SGE has been removed from jessie altogether. This effectively means SGE locks us to Trusty (EOL in 2019). There are also reasons to prefer k8s from a user perspective, but the main reasons to switch are the ones listed above (if jessie had still packaged SGE, we likely had not spent as much time in an alternative). Valhallasw (talk) 19:09, 15 April 2016 (UTC)Reply
The Labs team has already selected and deployed Kubernetes (k8s) as a next generation component for Tool Labs. It is currently being used for the PAWS tool as well as by a few alpha testers for web and bot processes. K8s is a system for deploying and managing containers spread across a grid of worker nodes. The transition from using SGE and NFS to using containers will bring benefits in the form of stability (no NFS), scalablity (easy to run N copies of the same container), reproducability (containers are versioned), and flexibility (each container is an isolated environment that can install software packages without effecting its neighbors). It will also bring new challenges in the form of different workflows than existing Tool Labs developers are accustomed to using.
Helping solve these workflow challenges is where evaluating PaaS projects like OpenShift and Deis will be important. We can either choose to build out a workflow of our own (as was done with the existing SGE tooling like jsub) or we can look for healthy FLOSS projects that attempt to solve similar issues. The eventual solution is likely to be a mix of both upstream projects and custom tooling. The big developer facing benefit to the new workflows will be using tools that are common with other projects and more likely to be covered by books, blogs and StackOverflow answers. This will make it easier to form community norms and best practice standards than the existing "here's a shell prompt, figure things out" basis of interacting with SGE. --BDavis (WMF) (talk) 20:48, 15 April 2016 (UTC)Reply
Most of the advantages of kubernetes sound largely a bet, but being able to upgrade Debian (as Valhallasw says) i a very clear one to explain. I suggest to spell this out as the main goal. Nemo 22:34, 16 April 2016 (UTC)Reply
Can you explain to me on what basis the advantages of kubernetes over SGE sound "a bet" to you? Because they don't sound like a bet to me at all, but I guess you have a different opinion (and a very low consideration for the experience in operating large application clusters of the people involved in the choice, which includes me). GLavagetto (WMF) (talk) 10:43, 23 April 2016 (UTC)Reply
My email to labs-announce from September 2015 probably has useful information. It also has the evaluation criteria we used in a Google Spreadsheet, where we had a column for the 'status quo' (GridEngine) as well. YuviPanda (talk) 08:26, 23 April 2016 (UTC)Reply

Initial thoughts

[edit]

I agree that there's room for improvement with the existing Tool Labs setup.

I agree that the getting started process (particularly the user registration form) is clumsy, as noted by bd808 at phabricator:T128158#2128397.

I agree that it's easy to accidentally monopolize resources with long-running queries and memory-hungry processes. Let's fix that.

I don't agree with the strong focus on integrating unified login. A lot of the current project roadmap focuses on "SUL" (which I don't think ever even gets defined in the draft). Integrating MediaWiki authentication won't suddenly make people write better documentation, it won't suddenly remove limitations on shell usernames, and it won't suddenly make generating and uploading a public SSH key easier. What benefit is there to formal "ownership" of a Tool Labs tool by a Wikimedia wiki account? Usually the person's shell name or Wikitech username is similar to their Wikimedia wiki username. If it's not, who cares?

I don't agree that (shared) hosting in a Linux environment is antiquated or unusual. The idea that Kubernetes and a Horizon-based Puppet Labs (not that Labs, the other one) configuration management system is going to be more familiar or easier to use for developers than shared hosting seems kind of crazy to me.

Regarding the project roadmap, it doesn't seem very aligned with the What does "better" look like? section. As mentioned, it focuses heavily on SUL, when if we look at the "better" section, we see:

  • Communication: questionable "ownership" followed by two indecipherable bullet points about promoting... something?
  • Guidance for new users: A.k.a. "write better documentation"; cf. bugzilla:1.
  • Metadata and discovery: "Require" metadata... how? Form field validation? And isn't this basically describing what Ryan set up on wikitech.wikimedia.org using SemanticForms that few people used?
  • Resource management: Sounds great, let's do this.

When you compare this section with the project roadmap, there's a bit of a disconnect. The project roadmap talks about SUL a whole lot and making yet another console.wmflabs.org site where you can... put your SSH keys? It's difficult for me to follow.

The actual juicy parts of the "Project roadmap" section, such as implementing a tool to manage Git repositories or deploying a "PaaS workflow replacement for SGE" (whatever the hell that means) are left weirdly undefined and non-specific. --MZMcBride (talk) 01:46, 16 April 2016 (UTC)Reply