Community Tech/Tool Labs support/Tool Labs vision
Tool Labs is a hosted platform as a service (PaaS) system maintained by Wikimedia Foundation staff and volunteers with a primary mission of providing a reliable hosting platform for volunteer developed software ("tools" and "bots") which support and augment on-wiki content creation and curation activities.
The core services offered by Tool Labs include:
- Multi-maintainer access to hosted software
- Static web page hosting
- Dynamic web application hosting with PHP, Python 2, Python 3, Ruby, Java, and other implementation languages
- Distributed ("grid") job processing for one time, recurring, and continuous jobs
- Access to real-time replicas of Wikimedia production databases
- Access to per-project MySQL/MariaDB databases
- Shared file system storage
- Shared usage of wikitech.wikimedia.org wiki for end-user and co-maintainer focused documentation
The current Tool Labs system was modeled after the services offered by the Toolserver platform that it replaced. This means that the system designs can trace their roots back to the technologies and practices prevalent in shared hosting services in common use circa 2006. Various improvements were made both at Toolserver and Tool Labs over the last ten years, but the fundamental services offered and the means of accessing them have remained largely the same.
Deploying and operating software on the Tool Labs platform, requires an understanding of the Linux operating system and various technologies commonly associated with Linux and other Unix derivative systems. Most operations require interacting with the Linux command line in a shell session. Creating a shell session requires using an ssh client and public/private key pair for authentication. Even something as basic as the process of creating the necessary accounts and configuration to access the servers is cumbersome for a new user.
Software deployment, configuration management, source code versioning, and many other details are problems left to be solved by the individual tool developers. These processes and procedures can vary dramatically from project to project and very little guidance is offered to new users of the platform. This makes it difficult to promote best practices in the community. The lack of integrated version control also makes enforcing Terms of Service clauses about open data  and licensing  problematic.
The shared hosting underpinnings of Tool Labs are most often felt when a badly behaving process causes resource starvation for other tenants. This can manifest in many ways and for various reasons but are very seldom caused by a malicious actor. Instead it is trivially easy to monopolize a shared resource accidentally. Very few services in Tool Labs enforce automatic limits or quotas for CPU, RAM, or disk usage. This is convenient for experienced power users, but can have wide spread consequences when a less experienced developer uses a naive technique which works well for processing a small amount of data, with a much larger set of data.
The "home wiki" for Tool Labs is wikitech which is convenient only because the wikitech user account is stored in an LDAP directory. That same LDAP information is also used for authentication and authorization by the technologies that power the Labs infrastructure that Tool Labs builds upon. The lack of SUL account integration creates a barrier for communication between tool maintainers and their on-wiki audiences. This friction leads to one of three outcomes: some tools are documented on the home wiki of their maintainers (and thus not easily discoverable); other tools are documented on wikitech (and thus require creating a new account to edit their documentation and talk pages); the remaining tools are not documented on any wiki (and thus collaboration on their documentation and support is difficult). The disconnect between SUL accounts and wikitech accounts also complicates contacting the maintainers of tool accounts.
Discoverability of existing tools and bots by people interested in using or helping maintain them, is also problematic in the current Tool Labs environment. When a new tool account is created, the only piece of information that is automatically collected is a short name for the tool. This name has to be suitable for use as a Unix user name and thus has many limitations. There are solutions available for a tool maintainer to provide additional information, but they require additional steps beyond the initial account creation. Neither current solution is collaboratively accessible to the larger community of tool maintainers or users.
What does "better" look like?
Tools are developed by volunteers. Creating and maintaining a tools competes for time with all of the other things that a human might want or need to do in their life. The process of using Tool Labs should be optimized with a clear understanding that time is a scarce resource. Workflows should be designed to minimize the amount of required reading and knowledge of the local systems that is needed before the tool developer can get to the interesting and useful work of creating and deploying their custom business logic for the benefit of the Wikimedia movement.
Once created, a tool is in constant danger of being abandoned due to competition with other opportunities. Workflows in Tool Labs should be designed to guide developers towards making choices that will make adoption of tools by multiple maintainers easy and commonplace. Code should be published publicly, managed with a version control system, and distributed under an OSI approved license. The necessary steps to deploy and operate a tool should be published in an easily discovered location. Communication channels for discussing defects, improvements, and maintenance of the tool should be available and accessible to the general Wikimedia community.
The shared hosting techniques and interfaces of the early Internet era have become quaint and archaic in comparison to more modern PaaS offerings. There will always be a desire for classical shared hosting practices in Tool Labs, but it should be possible to treat those systems as the special case for power users that they are, rather than as the default experience that all users must conform to. By providing systems that promote standardized practices and simplified user experience, we can expand the reach of Tool Labs services to a wider audience of contributors.
A re-imagined platform for hosting tools and bots should address these areas:
- Allow tools and bots to be "owned" by SUL accounts so that wikitech can be a SUL wiki and more connected to the other project wikis.
- Promote communication between tool maintainers, tool users, and Tool Labs administrators, via well defined channels.
- Promote best practices for developing and operating tools and bots in a manner that allows collaboration.
- Provide easy access to a public version control system for tool source code.
Guidance for new users
- Provide step by step guidance for creating any special credentials that are needed in addition to a SUL account.
- Provide step by step guidance for creating new shared tool accounts.
Metadata and discovery
- Require associating useful metadata with tools (link to version control; OSI approved license; basic description).
- Allow collaborative maintenance of tool metadata, including facilities for moderation and versioning.
- Allow easy searching of metadata on existing tools.
- Provide public resource utilization and usage information for tools.
- Provide reasonable resource isolation so that a malfunctioning tool is unlikely to negatively impact other tools operating on the platform.
Planning for change
With the vision outlined above for a better user experience, we come to the difficult work of drafting a roadmap of projects that will incrementally move us from the current reality to the hoped-for future. These projects need to accommodate several constraints:
- Existing Tool Labs users need to be allowed ample time to adjust to breaking changes, as their time for working on tools and bots is a limited and precious resource that should not be squandered.
- The resources available from the current developers and system administrators involved with the Tool Labs project are limited. The number of changes that can be worked on simultaneously is thus limited.
- Workflow changes generally need to have periods of overlapping functionality to allow for socialization. This will also allow for testing of new workflows to determine if they actually are superior to the previous methods.
- When possible, new systems should be constructed from external FLOSS projects. Working with an upstream project to add changes needed for the Tool Labs environment will be more cost effective in the long run than building custom solutions that require long term maintenance.
The work done needs to fit with the goals of the Labs team as well. They have an independent set of projects and timelines for improving the common Labs infrastructure as well as the Tool Labs services. These goals include migrating existing OpenStackManager functionality to Horizon, and replacing the current Sun Grid Engine deployment with a Kubernates.
Here's a straw dog proposal of changes that could be made including some rough dependency ordering:
Done Build a tool that allows associating a wikitech/Labs/Tool Labs LDAP account with a SUL account. (Workflows)
- Lets call this tool "Striker" for now, because we are going to add more things to it in this roadmap. It will fill in the bits that OSM provides for wikitech now which can't or shouldn't be migrated to Horizon.
- Eventually this tool will provide nice step-by-step workflows for common tasks such as setting up a Tool Labs user account and creating a new shared tool account. We will get to that point with incremental additions of functionality so that there isn't too much change all at once and so that we don't spend months building things without anyone using them and giving feedback on UI and workflow improvements.
- This can't actually be a Tool Labs tool. It will need to be hosted in the WMF production environment (although maybe in the Labs VLAN). This might mean it should be under wikimedia.org instead of wmflabs.org.
- The only way we have to authenticate ownership of an LDAP account is auth-bind, so the tool should probably use LDAP as primary authentication and then do an OAuth action to authenticate the SUL account ownership.
Done Extend Striker to manage git repositories associated with tool accounts. (Workflows)
- This is the first compelling reason for a user to interact with the new console tool and create the link between their wikitech LDAP account and a SUL account.
- Self-service Diffusion git repo creation and rights management via conduit API.
Done Build an extension for Horizon that can manage the Puppet config for Labs projects. (SUL; Workflows)
- There are other blockers for separating wikitech from labs, including:
- Various other things in LDAP: User SSH keys, sudo policies
- Allowing Horizon control over project membership/roles (blocked on upstream)
- There are other blockers for separating wikitech from labs, including:
- Done Evaluate FLOSS tools that could be used to replace Sun Grid Engine (SGE) for new projects. (Workflows; Resource mgmt)
Done Extend Striker to allow creating new LDAP accounts that are associated with existing SUL accounts. (SUL)
- This would be a good time to add management of SSH keys and two-factor auth (2fa) as well. SSH keys are needed for Labs ssh access and 2fa is needed for Horizon. Wikitech provides both services today.
- Many of the workflow problems with the current account creation method will be solvable in this tool.
Done Extend Striker to manage tool account (service group) creation. (Metadata & discovery)
- This is where we will also add metadata management for the tool account. That metadata will create the data source that is needed later for a nice tool discovery interface.
- Done Set up process / criteria for taking over abandoned tools
- Doing... Develop evaluation criteria for comparing PaaS (platform as a service) solutions.
- Create an interface for discovering existing tools and bots and connecting users and new developers with existing tool developers. (Metadata & discovery)
- Evaluate Kubernetes based PaaS workflow replacement options for SGE
Select, customize, and deploy PaaS workflow replacement for SGE that works with Kubernates.  (Workflows)
- Evaluate FLOSS PaaS offerings.
- Document and promote new PaaS workflow to new and existing Tool Labs developers. (Workflows)
- Build bridge solutions for migrating legacy projects that are closely tied to SGE to new PaaS. (Workflows; Resource mgmt)
Build system to collect and publish utilization metrics for Kubernates hosted applications. (Resource mgmt)
- Done Collect requests per unit time for web tools accessed through the Tool Labs http(s) proxy
- Collect database utilization data (requests per unit time? rows stored in tool owned tables?)
- Collect cpu utilization, memory utilization for containers run on Kubernates
- Change the primary authentication means on Striker to OAuth so that moving from a SUL wiki to it is seamless for new users. (SUL; Communication)
Convert wikitech to a SUL wiki and complete migration of existing LDAP wiki accounts to SUL. (SUL; Communication)
- Only the accounts used for wikitech access will be migrated.
- ... keep looking for things that can be improved ...
- phab:T128158#2128397 - Tools web interface for tool authors (Brainstorming ticket) - discussion starting 2016-03-16
- phab:T114560 - provide easier way to contact people abusing resources
- Phab:T87279 - Make OpenStack Horizon useful for production labs
- phab:T106475 - Evaluate a 'cluster solution' for use on Tool Labs
- phab:T107993 - Evaluate kubernetes for use on Tool Labs
- [Labs-announce] [Tools] Kubernetes picked to provide alternative to GridEngine