Toolhub will be built over the course of multiple milestones. The goal of the first milestone is to establish a standard data model for describing tools. The goal of second and third milestones is to bring Toolhub to feature parity with Hay's Tool Directory. The goal of the subsequent milestones is to add new features that improve Toolhub's usability. The first milestone should be reached by the end of Fiscal Year 2017–18, while the subsequent milestones should be reached by the end of Fiscal Year 2018–19.
The idea is that tool cataloguing is fundamentally a data problem. In order for the right tools to be discovered by the right people, they need to be described and organized in the right way. Everything else follows from there.
What this page is and is not
This roadmap is an attempt to prioritize and organize the work that will go into developing Toolhub. The roadmap is meant to reflect the development team's current understanding of how to best approach the problem of tool discovery. It also represents a vision of what Toolhub should be, based on an effort to synthesize the research and learn from it. This is a binding plan as of writing, but it is not meant to be permanent or inflexible. The plan can change as new situations arise and lessons are learned. As the plan changes, this page should be updated as well.
We invite you to get involved in this process by following our updates and to discuss the project on the talk page. You can also leave feedback privately by emailing James Hare. If you suspect something is going in the wrong direction, it is better to speak up sooner rather than later.
Milestone 1: Data model
- Status: Done
The data model refers to the pieces of information used to describe tools. This includes the basic facts about the tool, such as its name and who created it, as well as information that helps people who are looking for it.
Our starting point will be the data model used by Hay's Tool Directory, a volunteer-developed directory service. This data model is a good start but it has some drawbacks. Keywords are freeform text, meaning that there is significant term duplication, with terms in English and other languages treated as separate concepts instead of the same (image vs. images vs. media vs. multimedia vs. fotografia). There is also no guidance as to what kind of information to provide, or how. This hinders tool discovery.
Tool records in Hay's directory are represented through JSON files provided by the tool developers. Each tool record contains a unique tool ID, tool name, description, URL, keywords, author, and a link to the repository where the code is hosted. These files, called toolinfo files, are placed somewhere on the web and then downloaded by the directory software to populate the directory. This approach assumes that records are not centralized. Despite this assumption, over 60% of tool descriptions are provided through two central sources: the Toolforge admin console and an on-wiki list meant to be community maintained. This suggests that even though decentralized tool metadata is valuable, in practice the data is significantly centralized. In fact, the creation of a wiki page to serve as an all-purpose toolinfo record suggests there is at least some demand for the ability to just create tool descriptions. Because of this, the data model will be split into two parts:
- Basic information, including the tool's name, authors, and license. This data can be submitted through a toolinfo file that is crawled by Toolhub, or the information can be submitted directly through Toolhub.
- Tool annotations that cannot be submitted through toolinfo files and can only be submitted or modified on Toolhub.
The Wikimedia Tool JSON schema will include basic information expected of each tool. For reference, the current data model used by Hay's directory has been codified as version 1.0.0 of the Wikimedia Tool JSON schema. In May 2018 a draft of version 1.1.0 was published for public comment. Version 1.1.0 will expand the current data model while being completely backwards compatible, creating a smooth migration path. The final version will be published by 30 June 2018. There may also be work on a backwards-incompatible version 2.0.0, depending on project needs. Note that although this data is described using a JSON file, it will be possible to instead submit it directly to Toolhub, without the need to create a JSON file.
Once Toolhub has ingested a tool record, either through crawling a toolinfo file or by receiving data directly, it will be possible to add additional tool attributes. These attributes will make use of a controlled vocabulary, reducing the risk of keyword duplication and making it possible to translate terms into different languages. Hosting the data in Toolhub itself ensures that the community will always be able to edit it, even if the tool description is otherwise contained in an un-editable toolinfo file. It will also help keep the toolinfo files small and abstract away implementation details that would make the files difficult to edit by hand.
(Note that while the data model may be in two parts, the UI will "hide" this distinction for ease of use.)
Milestone 2: Initial API and UI; toolinfo crawler
Central to Toolhub will be the API, on top of which the user interface will be built. Aside from it being good practice to separate data transactions from data presentation, this also reflects our priority of gathering, organizing, and distributing high quality data. Part of realizing this vision is being able to submit and request information about tools in more places than just the official UI. For instance, once the API is online, a Wikidata gadget could be written to create an automatically updated list of Wikidata gadgets, allowing Wikidata users to learn about helpful tools without leaving the site. One hundred percent of Toolhub business will be possible through the API alone.
The API and the user interface will be developed in tandem; as API methods are implemented, so will parts of the user interface. The top priority is to achieve feature parity with Hay's directory. This includes:
- Submitting a URL of a toolinfo.json file to be crawled
- Retrieving a list of URLs of toolinfo.json files that are regularly crawled
- Retrieving a list of all tools
- Retrieving records about an individual tool
Milestone 3: Annotations
Annotations are tool metadata that extend the basic metadata provided during initial tool registration. They're additional pieces of information that aid in tool discovery. Because it is important to be able to make edits to annotations, they cannot be submitted through toolinfo files; they must be managed directly through the Toolhub API or UI. (However, the UI will abstract away this distinction.) This milestone will focus on the addition of API methods and corresponding UIs for editing tool annotations.
Before annotation support is added to Toolhub, annotation data itself may be collected through an adjunct tool.
Milestone 4: Search
This milestone will focus on implementing ElasticSearch support in Toolhub, allowing tools to be searched based on certain criteria. This includes exposing a search API and adding a search box to the UI.
Milestone 5: Direct tool registration
Once the submission of annotations is supported, the next step is to allow tools to be registered directly through Toolhub (as opposed to crawled over the web), including API methods for registering tools, editing tool records (other than those not submitted directly to Toolhub), and deleting tool records.
There are additional ideas for making Toolhub useful. At this point they are just ideas, with no commitment to deliver them. We are interested in knowing what you think of them, and we also invite you to suggest other ideas.
- Endorsements: a simple, lightweight way of expressing approval of a tool
- Metrics about tools, including usage
- Recommendations for tools to use based on criteria such as endorsements and metrics
- Requests for new tools to be developed
- Lists, including personal lists that serve as bookmarks, as well as public lists curated by members of the community
- Automatically generated tool records based on web crawling activities, with workflows to integrate "crawled" tools into the directory proper