Jump to content

Talk:API Policy Update 2024

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: just now by Sannita (WMF) in topic Comment period is now over

Sublicensing

[edit]

The only part of this that I could see as being at all controversial is sublicensing. Its not entirely clear to me where the line is between white-labeling the API and "republication of Wikimedia content in accordance with the free licenses that content is licensed under". Maybe the FAQ could be expanded to include some examples of things that would be over the line on that point, and some examples of things that are in accordance with that section. Bawolff (talk) 12:40, 29 August 2024 (UTC)Reply

Thanks, I just saw this question and I will respond when I can. I might have one or two follow up questions as well. SSpalding (WMF) (talk) 14:14, 29 August 2024 (UTC)Reply
There are a handful of different ways to address this question, but maybe I'll try the most straightforward explanation and see if it opens up more or less confusion.
"Republication of Wikimedia content in accordance with the free licenses" is not governed by the sub-licensing section. Instead, this section is about third parties making express or implied promises about access to Wikimedia's servers. As a one concrete example, Operator A who makes a tool that uses the APIs can't contractually promise their User B that the tool will function with 100% uptime. Hopefully, this makes intuitive sense since the Wikimedia APIs aren't themselves guaranteed to have 100% uptime for every user at every moment of the year.
That last sentence, starting with "For avoidance of doubt..." is there as a clarification just in case readers believe this section is related to content reuse which it is not. SSpalding (WMF) (talk) 18:35, 29 August 2024 (UTC)Reply
I'm also struggling with this part, I don't understand why Operator A wouldn't be able to promise Operator B 100% (or whatever) uptime, isn't it Operator A's responsibility to implement caching and fallback mechanisms in the case that Wikimedia is down/unavailable/etc.? What's the motivation behind this part?
Like, I have Toolforge tools that scrape data from the API, cache it, and then present it to users. Pretending I was interested in selling this tool to someone (I'm not) - why is it a bad thing that I host it on my own server and say it'll have 100% uptime? Isn't that my problem?
More realistically, does this mean OpenAI/Anthropic/etc. cannot sell a "Wikipedia plus AI" type plugin that makes API calls to Wikipedia? Legoktm (talk) 22:52, 29 August 2024 (UTC)Reply
Good question. Maybe there are two ways to answer this.
First, this is a restatement of things that already are true and have been true about the way the Wikimedia SRE team managed the APIs and the way Legal thinks about the APIs. If there has not been any issue with the tool across the history of it existing, then it is unlikely that there would be an issue with the use in the future under this written version of the formerly unwritten policy.
Second, to address the hypothetical, maybe a general proposition first.
This policy is about abuse, particularly: (A) better locating abusive operators (a reminder about compliance with existing user agent naming rules), (B) making it easier to stop a specific abuser without inadvertently blocking non abusive users (for example, making it more likely we can turn off the specific user agent doing the harm without needing to restrict whole IP ranges or to collect more data about users to identify users).
A white labeled use can easily obscure abusive operators and makes it harder to figure out how to stop the abusive operator without stopping the tool completely. Hopefully, we don't have to stop anything completely that is being operated in good faith. It seems important that other third parties who may rely on the tool aren't (erroneously) made to believe (through legally binding promises from the abuser) that the tool can never be limited. This harms the operators who rely on the tool to the benefit of the abuser. The provision is written in a way to highlight that tool creators should be honest as to the source of data and how a tool actually operates.
If this sounds like it's attending to abstract edge cases, it partially is. So to address your other question, what is the motivation for this? Two things: abuse is an edge case by nature. We do not block or limit most normal API traffic. So what seems prominent in this written policy doesn't reflect a problem that happens very often in practice but also still needs an organized and principled way of dealing with it.
More directly, here's a example that might resonate more. Abuser builds Tool which calls the API an unlimited amount of times per second. Non-malicious Users believe the express promise that the Tool can be used at X unlimited times per second because they aren't made aware the the underlying engine of the Tool is the Wikimedia APIs which, although broadly permissive, are not resourced for unlimited calls per second. In this scenario, it's much easier to intervene at the Abuser level and ask them to fix the tool as well as be honest about the nature of how the tool works (and its real world limits) than it would be to try to find and remedy all the Non-Malicious Users who were promised that they aren't doing anything wrong by using the Tool.
Finally, to address the AI portion, the written version of the existing policy was conceived of before the 2023 AI frenzy. It is meant to be a restatement of what has been happening internally when SRE decides it needs to block a problem. I don't think simply having a tool that calls the APIs is a problem under this provision whether commerically or not commercially, and AI or not. I hope readers interpret the text as narrowly as the text is written because it's meant to mean exactly what it says. SSpalding (WMF) (talk) 02:33, 30 August 2024 (UTC)Reply
I appreciate the detailed explanation. I think the part that still feels like a disconnect to me is why "sublicense, lease, assign, or guarantee" is the behavior that's problematic rather than the "obscures the identity of the ultimate service provider", which I think has always been the focus, e.g. historically we've had informal(?) policies against live mirrors.
A non-hypothetical example could be T273741, there was a mobile app that was abusing Wikimedia sites (arguably image thumbnails are an API), and didn't reveal to users that it was using Wikimedia sites. SRE blocked it, and my recollection was that there was no discussion on whether sublicensing was taking place :)
IANAL, but I would suggest deleting the first sentence entirely and just replace the section with something like "API clients may not obscure the identity of the ultimate service provider of the APIs (the Wikimedia Foundation), i.e. white labeling is not permissible. For the avoidance of doubt, this term does nothing to limit the use and republication of Wikimedia content in accordance with the free licenses that content is licensed under." Legoktm (talk) 09:58, 31 August 2024 (UTC)Reply
Fair enough! The last thing I'd say is that "sublicense, lease, assign, and guarantee" are all terms that have specific definitions from a legal perspective. They are legal "terms of art," and each word in that first sentence has differently defined characteristics that we hope to highlight. As an example, consider the term "assign". According to Black's law dictionary, assignment is "The act by which one person transfers to another, or causes to vest in that other, the whole of the right, interest, or property which he has... More particularly, a written transfer of property, as distinguished from a transfer by mere delivery."
By saying assign, it allows us to functionally mean all of that above without describing it in a way as confusing as what's written above. Since this provision limits some thing that rarely comes up except for in situations like the non-hypothetical one you brought up, it seems unlikely this would even apply to what most operators / developers do. Therefore, our hope is that these legal terms of art will only matter to the people who are intentionally violating them trying to exploit a loophole in the language. And for them, I hope the specificity gives them some pause. SSpalding (WMF) (talk) 16:53, 3 September 2024 (UTC)Reply
To ask a slightly more pointed question, is Wikimedia Enterprise currently doing anything that would be prohibited by this policy were it to be done by anyone else? AntiCompositeNumber (talk) 00:15, 30 August 2024 (UTC)Reply
I don't speak for Wikimedia Enterprise, so they can weigh in, but I think this might be a complete answer as-is.
The Wikimedia Enterprise APIs and the public Wikimedia APIs are technically separate things. The Wikimedia APIs are administrated by WMF's SRE team. The Wikimedia Enterprise APIs are administrated by Wikimedia Enterprise.
What's written here is a statement of the practices of the WMF SRE team. SSpalding (WMF) (talk) 07:04, 30 August 2024 (UTC)Reply
Hi @AntiCompositeNumber, I’m not from the Legal department, but I am with the Enterprise team, and since you ask about that specifically I thought I could weigh in.
As @SSpalding (WMF) mentioned, this policy document covers the public APIs and it doesn't relate to the Enterprise APIs. It is worth adding that with the Enterprise APIs, it is possible to negotiate for the service needed for your [organisation’s] use case. This can include speed and data-volume limits, and SLA minimums that far exceed what would be considered “very high” on the public API because they use completely different infrastructure. If a company wants extreme amounts of data, extremely high uptime obligations etc. then they can negotiate to receive that on the Enterprise API. [Although the Enterprise APIs are not relevant to this policy, it is worth noting that when extreme use case users move from using the public APIs to the Enterprise APIs, it allows the public APIs to be more resilient because now they won't disrupt the public version for everyone else.]
The purpose of the “sub-licensing” section of this policy for the APIs is to state that users can’t make promises about the WMF’s infrastructure (e.g. about uptime, speed...) to their own users “downstream”. This is a fact of law anyway - you can’t promise something that you don’t control - but it is in here as well to re-emphasise the point. The contracts for the Enterprise API customers are functionally similar in that regard. LWyatt (WMF) (talk) 16:27, 30 August 2024 (UTC)Reply
These seem to be answers to a different question than the one that AntiCompositeNumber asked. Without saying I agree or disagree, but in the interest of calling a spade a spade, i think the real question here is: Is this clause an attempt to give Wikimedia enterprise an effective monopoly on re-selling Wikimedia APIs that they previously would not have enjoyed? This is potentially a hot-button issue, because from a certain perspective it can be viewed as the start of an Enshittification process at Wikimedia. A very small start to be sure; the fear would be that the slope is slippery. Bawolff (talk) 20:11, 5 September 2024 (UTC)Reply
Sorry for the long-winded answer to what feels like it's a direct question. Here we go!
This policy, as currently written, isn't an attempt to "give" anyone, Wikimedia Enterprise or otherwise, anything new. This is a generally restatement of the current status quo of SRE. Specifically, this sub-licensing provision, is a restatement of the current state of the law.
Said differently, it's already improper by default  for developer X to promise things that X has no right to promise to user Y. If X currently does that to Y, then Y would have a legal cause of action against X. I'll call this the "Guarantee Problem" because I'll refer to it later.
What writing this down in this policy is hopefully making clear is that Wikimedia has a right to intervene against X in this situation. As I described in another response -- it's much easier to deal directly with X who is intentionally creating a problem than it would be to deal with an infinite amount of Y users who actually don't know they're creating a problem because they were lied to. Wikimedia is in a unique situation because (a) people's ability to use the Wikimedia APIs is already so broadly permissive (as compared to many other API providers), and (b) we generally don't like blocking uses when we don't have to (whereas other API providers often empower themselves to block any downstream use for any reason).
To start trying to answer your question as directly as possible: this type of thing is generally left unsaid by API providers because if X did this in outside of the Wikimedia context, then that would obviously just get X banned and all of their users permanently blocked once found out. We currently don't do things like that if we can help it, so here  Wikimedia is trying to repeat the obvious in a hope that the type of people who would intentionally violate this will look at it and realize that this is actually something that we care about too even if it won't get X instantly banned from a technical perspective.
This is the important part: the reality is that bad faith developers sometimes assume that the only true limits to respect are the overreaches that get them technically blocked. If we don't say anything to contradict that out loud like what's written here, then that assumption might not be challenged.
We tried very purposefully to word this in a narrow way that would be mirror a non-controversial position: Developer, please do not lie and say you control the APIs when they are here to benefit the community. This is bad because every downstream user you lie to creates a potentially infinite amount of new, individual site reliability problems from people who don't think they're doing anything wrong. This is the point of this section from a legal point of view.
As part of the Legal department, I can only discuss aspects of this that touch the Legal department. I can say that the current, standard Enterprise contract also doesn't allow buyers to be dishonest about the nature and functionality of their own Enterprise APIs to downstream users. Hopefully, this tracks common sense because this "Guarantee Problem" is a universal problem that no-one who provides an API that gets repurposed downstream would want to deal with.
I can also say definitively that this document governs the public APIs. The Enterprise APIs are governed by the Enterprise terms of use (for free users) and negotiated contracts for paid users. The best way to understand what is allowable or not using the Enterprise APIs is to sign up for the free tier of Wikimedia Enterprise. From a legal entity point of view, Wikimedia Enterprise is a separate company than the Wikimedia Foundation. Enterprise users contract with Wikimedia Enterprise.
Enterprise is (1) wholly owned by WMF, and (2) I am one of the people on the WMF Legal team that shares responsibilities working with Enterprise. As part of that team, I help negotiate the contracts to maintain legal, ethical, and policy consistency across the the entities.
I don't know if this addresses your question or simply opens up new ones, but I'm trying to answer in good faith in that doesn't center on theoretical speculations about Enterprise. SSpalding (WMF) (talk) 13:03, 6 September 2024 (UTC)Reply
I think we are all on the same page that whatever agreements Wikimedia enterprise has with the clients of Wikimedia enterprise, is irrelevant to this policy. The part I think is relevant is wikimedia enterprise as a client of SRE apis and not wikimedia enterprise as an API provider. Wikimedia enterprise white-labels SRE apis, if I understand the term correctly. Presumably they do this via special agreement between WMF and Wikimedia enterprise. If some randoms on the internet created their own version of Wikimedia enterprise, backed by WMF SRE apis in the same manner as Wikimedia enterprise is, under this policy, they would not be allowed to do so, if I understand correctly(?) Perhaps this was always the case, so there is nothing new here. In any case, such a restriction does seem very beneficial to Wikimedia enterprise.
I agree 100% that downstream users should not make promises on behalf of WMF about WMF uptime or anything else really. The API is free, it is only fair that WMF is not responsible for downtime or changes to the API. Downstream users should use it at their own risk.
I think the part I am truly worried about is the distinction between "white-labelling" the API, and simply using the results of it in your product.
To make this concrete, lets take a real example that is currently out there in the world. There is a software product I am aware of, that among other things, lets you build an image library. In addition to using your own images, it has an option to expand your images with images from other websites in the world, including Wikimedia Commons. It does this via the Wikimedia API. If the API goes down, some things may be cached, but generally all the commons images will disappear in short order. The software shows all appropriate copyright information of the image. It also shows a message saying that the image comes from Wikimedia Commons (It does not say that the API used is operated by the "Wikimedia Foundation", it simply says that the image came from "Wikimedia Commons") and links to the image page on commons.
Is this permissible under this policy? I would generally assume yes. While the API going down may make the userbase of this software mad, ultimately they don't make any promises of availability so i think the no assign/lease/garuntee/sublicence part is fine. It does not disclose the operator of the original API, but it at least credits Wikimedia Commons as the source, which i presume is close enough but that is pretty unclear from this policy (?)
This software package also has an API. That API reports information about images the software knows about, both the user's own images as well as those from Wikimedia Commons. In essence it repackages the imageinfo part of WMF's API. Is that still ok? It does add a field for the source of the image to distinguish between the user's own images and those coming from commons, but otherwise essentially replays the results of WMF apis related to images without mentioning that it is doing so.
This is what that piece of software does in its default configuration (assuming commons is enabled as an image source, which it isn't by default). There is an alternative configuration, while not as common, still is used a fair bit where instead of saying the image is coming from Wikimedia Commons, it simply says that it came from a shared repository. Is this still ok? The word wikimedia is no longer mentioned anywhere in the user interface, although there is still a link back to the image's page on commons. Bawolff (talk) 20:54, 6 September 2024 (UTC)Reply
Some observations have direct answers and some don't. I'll try to address the background In part 1 and 2 and try to respond directly in points three and four.
1. The narrow situation that this white labeling language anticipates: a tool that both white labels and "obscures the identity of the ultimate service provider." Obscure is an important word here. It is used to purposely suggest that an intent to confuse or cover up is part of the problem. And that is a real problem for the Wikimedia Foundation as managers of API resources. It's also a real problem for the community because if end users are purposely misled about the nature of where the data originates from by a developer, then the end user can't follow the downstream open license reuse rules correctly.
There are an indefinite amount of services that integrate the public APIs as a mostly-invisible part of their core functionality. I say indefinite because there's no true way to definitively track this. That, in itself, is not white labeling. I don't speak entirely for SRE, but from my understanding, the only time that would ever be a notable activity is when it is breaking something.
2. Are there exceptions to the policies written here? Yes and no. No, because we would like people to follow these rules. That's why we wrote them down here.
But yes, insofar as all of the prohibitions are relatively subjective. Therefore there's always a question around what's the context these prohibitions are being enforced in. For example, before limiting something SRE generally asks themselves, is X use actually causing a real-world problem or not?
Also yes, insofar as many of the prohibitions have a "scienter" (intent) element to them. So white labeling is only a problem when it "obscures," and it's worded this way because generally, people without ill intent have no problem being honest about what they're doing. If the Legal Department were looking to enforce this or not, we would ask the question about whether ill intent is being demonstrated and that would inform our final position.
Finally, yes there are exceptions because the idea of exception requests is written into the policy. If anyone believes they are doing something so extreme that it will be a noticeable resource problem, then there's a written invitation in the policy for the developer to tell us about it. It is helpful for us to get notice about what is happening so we can make sure we can actually help make it work and not break something.
3. So to answer your question using the framework above, Enterprise can call the public APIs to collect openly licensed data to then dole out in its ecosystem in the same way others can. Subjectively, what Enterprise does doesn't create problems with the normal administration of the APIs. There's also no ill intent behind their calls to use the free licensed content. And finally, Wikimedia SRE has notice about what Wikimedia Enterprise does.
4. I asked Enterprise to give feedback on for example, and this is what they said:
"So the image example he gives says if the Commons images goes down except for the cached images nothing new will show. That is exactly how it will work with us too. If the Foundation APIs are removed, Enterprise will not have access to new data as well and will only be able to show cached/stale data." SSpalding (WMF) (talk) 03:01, 10 September 2024 (UTC)Reply

What is a "high rate"?

[edit]

In the section that states "Whan using Wikimedia APIs, an operator must not...Request data at a high rate, far beyond common use cases" - it's unclear what "a high rate" might be, and I don't think we can expect a reader to know what "common use cases" are. The information at API:Etiquette#Request_limit explains this somewhat, so it would be good to either link there from this section of the policy, or provide the same type of explanation in this policy doc. Namely: "The exact rate limit being applied might depend on the type of action, your user rights and the configuration of the website you are making the request to. The limits that apply to you can be determined by accessing the action=query&meta=userinfo&uiprop=ratelimits API endpoint." TBurmeister (WMF) (talk) 15:40, 29 August 2024 (UTC)Reply

I appreciate the feedback. I think referencing the API etiquette policy is the closest thing that we might get to a generalizable description of what a "high rate" might look like in practice. We can do that in the final draft. The type of "high rates" that have historically caused resource issues are exceptions to the normal orderly use of the APIs, so it's hard to capture outliers in a simple way. Also, even if it were possible, since this is written focused on abuse, people tend to get very creative and unpredictable when it comes to devising new ways to abuse.
SSpalding (WMF) (talk) 18:14, 29 August 2024 (UTC)Reply
I think it would be useful to minimally define "high rate" and "common use cases" somewhat formally/in more technical language in this document. I think we can all agree calling the edit API once is a "common use case" that does not request data high rate. However, what about a long running query in Toolforge, or a bot that lists all members of a particularly large category ? Would these be considered in violation of the terms of use? Sohom (talk) 03:15, 30 August 2024 (UTC)Reply
There was some debate as to whether to include numbers but for a variety of reasons we chose to do so. By far, the most important reason is that this is a statement of current practices. And under current practices, these aren't qualities that have formal definitions internally. I am on the Legal department, so I do not speak directly for SRE when I say this, but problematic usage often makes itself known because it looks like an obvious outlier. The APIs are resourced to be comfortably used by some of the largest websites in the world, so one has to be engaging in something very anomalous to be considered high use in that context.
That said, we are interested in anything that makes what's written here easier to read or understand because it will make it more likely that it will get followed, and less likely that it will have a chilling effect on normal uses. SSpalding (WMF) (talk) 06:21, 30 August 2024 (UTC)Reply
It's probably worth noting that the software-defined ratelimits are a hard limit, not an entitlement (see also phab:T370304). AntiCompositeNumber (talk) 21:07, 29 August 2024 (UTC)Reply
Agreed in principle. I'll think about the phrasing as we make a final draft. SSpalding (WMF) (talk) 06:01, 30 August 2024 (UTC)Reply
[edit]

There is mw:API:Ratelimit and mw:API:Ratelimit/Wikimedia sites and also mw:API:Etiquette. Would be nice to have see-also section in the new API Policy and links going both ways. Even if it is intentionally not linked outside of Fundation wiki, maybe you could add a soft redirect to mw:API:Ratelimit?

Nux (talk) 20:23, 30 August 2024 (UTC)Reply

Seems useful! I will work with the Product and Tech team to coordinate on whether those pages are up to date and reflect current realities.After that, I'll determine if referencing them will make things easier to understand here SSpalding (WMF) (talk) 01:02, 31 August 2024 (UTC)Reply

Retiring APIs

[edit]

"The Foundation may provide notice regarding updates and deprecations of APIs to the contact information that is provided per the User Agent requirements."

In line with mw:Stable_interface_policy, and its predecessors, this needs to change. This needs to be "The Foundation must provide notice regarding updates and deprecations of APIs to Tech/News. The Foundation may provide these notices to the contact information that is provided per the User Agent requirements."

Snævar (talk) 20:14, 2 September 2024 (UTC)Reply

@Snævar Thanks for the feedback. What's written here currently is not all-inclusive, so it should be read in the context of other written documents. This is meant to compliment what already exists not to supercede it.
With that in mind, by saying that "we may provide notice" here, that doesn't mean there will be any change in providing notice on Meta / to that list if that is what is currently being done.
Instead, the very narrow thing that is being said here is that WMF can give notice of updates and deprecations directly IF the developer provides contact information to allow us to do so. This is not taking away anything from whatever the current status quo is. Instead, it's trying to highlight an existing incentive of providing an accurate and well described user-agent. SSpalding (WMF) (talk) 02:40, 3 September 2024 (UTC)Reply

Before disallowing "Request data at a high rate" please enable acquiring data dumps

[edit]

Not only is it unclear what is meant with that (see thread above), it's also a problem for many useful applications that require e.g. large parts of Wikimedia Commons to be downloaded. I don't think this is a good policy until there is a way for people to acquire physical dumps instead of making many requests at a high-rate and this could require payment (payment above costs for that would be fine), maybe as a new part of Wikimedia Enterprise. Please see the proposal at m:Community Wishlist/Wishes/Physical Wikimedia Commons media dumps (for backups, AI models, more metadata). Prototyperspective (talk) 22:46, 3 September 2024 (UTC)Reply

First off, I think there's value in tackling that Commons wishlist item that you brought up. I'm also sure there are lots of other issues like that make users reliant on agressive scraping and other inefficient ways of trying to access data. Ultimately though, the wishlist is the place for suggesting more efficient ways for people to access data since the Legal department doesn't control those things.
This policy is currently a statement of what WMF already does. It's not written to disallow new things. So thankfully, the kinds of useful applications that you mention that exist right now which download large amounts of data have been able to navigate these existing rules.
Finally, unrelated to this policy, I personally agree with your suggestion that Wikimedia Enterprise is a place where the Wikimedia Movement can potentially find novel ways of increasing accessibility to the content on the projects including Commons. I've passed along your observation and the link to the wishlist item to them. SSpalding (WMF) (talk) 04:56, 5 September 2024 (UTC)Reply
Alright, thanks for that and for clarifying! Prototyperspective (talk) 09:56, 5 September 2024 (UTC)Reply
I feel it would be easier to make policy around that specific usecase if there was actually a concrete example of someone who wanted to do this. All the discussion around this is very theoretical. It is unclear if anyone actually exists who wants this. Bawolff (talk) 20:14, 5 September 2024 (UTC)Reply

Comment period is now over

[edit]

Hi all, first of all thank you very much for your invaluable opinions and contributions to the present page, and to all users who requested clarifications.

The comment period on this page is now over, since no new conversations were had in the last few days and a suitable time has now elapsed since its start. In the next few days, we will publish the relevant content of the page to the relevant Foundationwiki page, after considering how to integrate suggestions from here.

Thank you again for your interventions! Sannita (WMF) (talk) 13:59, 13 September 2024 (UTC)Reply