Our core responsibility is to provide data services at large scale that are more comprehensive, reliable, secure and fast, in partnership with large scale users where that aligns with our mission and principles. We aim to improve the user experience of our indirect users, increase the reach and discoverability of our content, and improve awareness of and ease of attribution and verifiability for our largest content reusers.
There is a very high barrier to entry for using Wikimedia data, across all projects, outside of the common use cases of reading or editing because this content is hard for machines to segment & understand. This affects how far our data reaches beyond our own ecosystem and the scale of impact it has.
It's well known that a few massive companies use our projects' data, making millions of server-requests every day. Those companies recognise that without the Wikimedia projects, they would not be able to provide as rich and reliable an experience to their own users. Heavy usage of APIs and data services also impacts community usage of those services. OpenStreetMap faces similar challenges, in terms of massive-scale usage of their resources. There has long been a feeling among community members that these companies should do more to reinvest into the Wikimedia communities for the benefits they gain from utilising the content and resources they use.
This lead to the idea of developing a new approach that is more sustainable in the long term and provides a much clearer and cleaner relationship. Any financial benefit would likely only come from a very small handful of heavy commercial users and whilst any benefits feeding back into the Wikimedia movement.
Whilst this idea was developed it became clear that with this there came a responsibility to democratize our data for organizations without the resources of these largest users. To ensure we are leveling the playing field and not reinforcing monopolies, helping foster a competitive and healthy internet. The benefits shouldn't just be for startups or alternatives to the internet giants, but also universities and university researchers; archives and archivists; and non-profits like the Internet Archive along with the wider Wikimedia movement. The creation of a new data services offering should benefit everyone.
Objectives and Key Results will be published following approval of the Wikimedia Foundation annual plan.
Okapi’s focus is on volunteers, teams, and organizations who want to reuse our content in other contexts beyond the article space on our projects, typically at scale — knowledge graphs, search, voice assistants, maps, news reporting, community tools, third party applications and full-corpus research studies. Augmenting Wikimedia’s many datasets to put structure behind our unstructured content will allow all our content reusers to meet their individual requirements while also set us up to build new tools and services in the future, available to everyone. Reusers of our content are looking for three critical components:
- Frequency: Regular current snapshots of Wikimedia projects
- Reliability: Dependable, accessible infrastructure
- Quality: Vandalism-free, “best last revision”
This is in contrast to the Wikimedia API team whose focus is on volunteers, teams, and organizations looking to access and most importantly interact with our data sets. This includes the majority of community editing tools and will be largely out of scope for Okapi. For more information on improvements to the Wikimedia API see the project page on the API Gateway.
Objectives and Key Results will be published following approval of the Wikimedia Foundation annual plan.
- Content: Make our movement's content available in a machine readable formats freely available for all researchers and re-users.
- Resource-load: Reduce the need for high-intensity site-scraping by the biggest regular re-users, that currently targets our production servers.
- Fundraising: Provide a clearer and more consistent way (Product Manager) for the largest re-users to reinvest derived benefits back to the movement, rather than only relying on their occasional altruistic donations that are variable in size.
- Lane Becker (Business Development Manager) - Advancement
- Ryan Brounley (Product Manager) - Product
- Mat Nadrofsky (Director of Engineering) - Technology
- Naïké Nembetwa Nzali (Senior Project Manager) - Technology
- Joseph Seddon (Senior Community Relations Specialist) - Advancement
In addition to the core team we are supported by Speed & Function who are providing contracted Engineers. At this early stage in the project, we are not yet sure of the long-term engineering needs and we want to thoroughly assess the projects ability to become self-sustaining. This way we hope we don't excessively disrupt current WMF projects or divert resources from elsewhere.
- Discovery: Deeply discover how companies are using WMF data currently, what they need, and what steps we should take.
- Community: Ensure that the values of the communities and movement are at the core of our activities.
- Business Models: Find an approach that propels our mission, serves WMF, and helps us achieve our goals.
- Create: Build new features that will propel the movement both financially and with useful resources.
- June 2020: Onboarding, researching, and building an experimental prototype to familiarize ourselves with the environment.
- July 2020: Initial discovery research interviews. Work on HTML dumps.
- August 2020: Finalise initial alpha html dumps.
- October 2020 and onwards: TBD
Given the nature of the project, primary decision making on this project will rest with the Wikimedia Foundation. We will be seeking extensive community input, in particular the technical community, throughout the lifetime of the project. Product input and feedback will be gathered from collaboration with colleagues at the Wikimedia Foundation, industry and research partners, technical partners across the movement and finally with the broader the technical communities via phabricator. Input into the funding developement side of the project will follow a similar pattern. We will be gathering input via research interviews, focus groups and feedback here on meta.
Why are we doing this?
The Wikimedia Foundation has offered paid data services since shortly after its inception, providing feeds to enable to provision of 3rd parties to host their own local databases. The creation of this service was what lead to the initial hire of Brion Vibber and was used to bootstrap the Wikimedia Foundation in the early years. The service was closed to new customers in 2010 and the service was finally decommissioned in 2014 mainly due to lack of maintenance; banner fundraising rapidly supplanting it as a source of funds for the movement; changes in the approaches that large tech companies took querying our data; and the expiration of contracts made.
Whilst banner, email and major donor fundraising have successfully grown to provision the movement in its work, it has long been recognised that our banner fundraising is highly susceptible to changes in reader behaviour within the broader internet. One of the biggest changes is that an increasingly significant proportion of interactions with Wikimedia content is no longer on the Wikimedia websites themselves. Increasingly our users benefit from our content being integrated on other websites and via virtual personal assistants. Since 2015 the Wikimedia Foundation identified this change as something that these trends could severely impact the movements' ability to effectively support itself in its work.
Where has this previously been discussed?
Revisiting large scale data services to help ensure the success of the movement, irrespective of changing discovery methods of Wikimedia content, was floated as a possible avenue for exploration in 2015 and was discussed on Wikimedia-l in 2016. The idea was put forward by two separate working groups during phase 2 of the movement strategy process and work on improving 3rd party API usage was identified twice in the final strategic recommendations  . Extensive feedback has been provided during the iteration phases of the strategy process    and is forming the basis of our work.
We don't expect to charge for access to our APIs for most, if not all users
The service will be free forever for most users of the service, as it is now with our current APIs. Our current thinking is that high-volume customers place the highest value on reliability, support, high frequency, scale, etc, as opposed to pure access. The aim is to charge for services related to a product, rather than for the product itself — a classic open-source business model and keeps us aligned with the mission: that access to knowledge remains free for all. The details on what that looks like is uncertain at the moment but as we continue to move through discovery we hope to define this all a little better.
Where the money will go
Unequivocally to and for the Wikimedia movement. In the same way the money from banners, email, corporate giving, and endowment is to further the Wikimedia mission. All funds collected will either fund Wikimedia programs or help grow the Endowment to support Wikimedia in perpetuity.
How will this be structured?
For now, the Foundation has set up a single-member, wholly-owned US limited liability company (LLC) to provide these services. This lightweight structure allows us to test this service model and gather information from customers without incurring significant start up costs or creating undue reporting burdens. The LLC's activities will be reflected in the Foundation's Form 990 (our annual US tax filing for those unfamiliar) as well as our audited financial report. As this service and our understanding of customer needs evolve, we may pivot to a more robust structure to better protect the Foundation’s operations from various liability exposures. At every stage of development, we will ensure that our legal structure reflects our commitment to reinvest the proceeds from this service into the Wikimedia movement.
Will the community be able to provide input?
Yes. We won't be using a traditional consultation structure as has been utilised in the past. To be able to get community input into the project at the earliest possible date, and to permit the team to move and iterate rapidly, we will be using a series of "sprints" to gather feedback frequently throughout the project, enabling the provision of input prior to key decision points. Feedback will be gathered via a series of interviews, surveys and standard developer feedback mechanisms. Interviews with community members will both feed directly into the project will be used to inform and guide survey questions.
What has already been decided?
What you see on this page and on the technical page on mediawikiwiki and associated Phabricator tasks. Beyond that everything else is still in flux. We hope that the community input into this project will be equal and equitable with the other inputs feeding into our discovery, planning and implementation work.