Jump to content

Talk:Wikimedia Enterprise

Add topic
From Meta, a Wikimedia project coordination wiki

The following Wikimedia Foundation staff monitor this page:


This note was updated on 06/2025

Is there a discussion group for support?

[edit]

Hi, I'm new to the enterprise APIs, and just did a project download for enwiki_namespace_0 but many of the articles contained are not the latest versions. One that you can see since it's near the top is for Athi,_Kenya. When I look at https://en.wikipedia.org/w/index.php?title=Athi,_Kenya&action=history I see there's a new version which adds the redirect where the project download has this version: "date_modified": "2023-01-30T04:42:08Z",

This is the project download API I'm calling:

NAMESPACE=enwiki_namespace_0 curl -L -H "Authorization: Bearer $WIKIPEDIA_ACCESS_TOKEN" \

 https://api.enterprise.wikimedia.com/v2/snapshots/${NAMESPACE}/download \
 --output ${NAMESPACE}

If there's a better place to ask, let me know. Thanks! Rcleveng (talk) 15:39, 7 February 2025 (UTC)Reply

Hello @Rcleveng - I hope you're finding the "snapshots" dataset useful for your needs.
You can find the public helpcenter for technical enquiries about the Enterprise API on its dedicated website: helpcenter.enterprise.wikimedia.com. In the "What do you receive in the Snapshot API?" answer it specifies that:
Snapshot API will return a tar.gz snapshot file of a project as it was at midnight UTC the day before the request and, for free accounts, refreshes twice-monthly on the 2nd and 21st of every month. It contains all of the current articles in each supported project at the time of file creation.
The various formats and refresh rates of data that are available at no-cost are described on on our meta page under "Access".
Finally, for direct technical support, you can login on https://dashboard.enterprise.wikimedia.com/dashboard and create a new support ticket.
-- LWyatt (WMF) (talk) 16:01, 7 February 2025 (UTC)Reply
Thank you @[User:LWyatt (WMF)|LWyatt (WMF)]! I'll raise a ticket there.

Latest Release: Parsed Wikipedia References with Quality Scoring Models

[edit]

The new Parsed References feature in Structured Contents provides parsed inline citations and references from Wikipedia articles in a consistent JSON format. The parsers output maintains a strong connection between the citation and the content it references by linking them at the paragraph level, ensuring context is preserved.

Additionally, references are structured where possible while  preserving the text as it appears on the page, offering flexibility for reusers to adapt the data to their specific needs.

The new Reference Models feature delivers two Machine Learning scores for Wikipedia References: Reference Risk and Reference Need. When an article is updated, the ML models calculate a score  to help editors and reusers understand more context about the article changes and how they affect the article’s overall verifiability and reliability.

Learn more about this release on our blog https://enterprise.wikimedia.com/blog/parsed-references-with-scoring-models/. Wikimedians can also access this beta release via their accounts on Wikimedia Cloud Services. SDelbecque-WMF (talk) 21:03, 27 March 2025 (UTC)Reply

Quarterly product update

[edit]

Hello everyone! If you're interested in exploring Enterprise's latest launches, I've just published the Quarterly update for Jan-March, 2025. We invite you all to check it out here --JArguello-WMF (talk) 14:28, 1 April 2025 (UTC)Reply

Missing dumps for 2025-04-01

[edit]

I noticed there are no recent enterprise dumps available: https://dumps.wikimedia.org/other/enterprise_html/runs/20250401/ (folder is empty).

Will they be available any time soon? (ping @LWyatt (WMF)) Prof.DataScience (talk) 11:09, 8 April 2025 (UTC)Reply

Hi @Prof.DataScience - there's an updated information text on that dump's information page which notes that:
"...as of 03/24/2025, are no longer replicated here. If you are in need of recent runs, dumps of article change updates, or the ability to query individual articles from the dumps, visit Wikimedia Enterprise to sign up for a free account. Alternatively use your developer account to access APIs within Wikimedia Cloud Services."
The folder "20250401" run shouldn't exist, I'll get that blank page removed so as not to cause future confusion. LWyatt (WMF) (talk) 11:40, 8 April 2025 (UTC)Reply
Empty dirs are still there. Could they be removed please so that it is easy to find the latest archived version? Mitar (talk) 07:46, 5 May 2025 (UTC)Reply
(There are now two. I hope nothing is being deleted because of those empty dirs being created.) Mitar (talk) 07:47, 5 May 2025 (UTC)Reply
These folders are simply appearing automatically, and in error. The Enterprise team does not control this page but we have logged the issue as a bug needing fixing. LWyatt (WMF) (talk) 07:55, 5 May 2025 (UTC)Reply

Dataset published on Kaggle

[edit]

[TL;DR – the beta dataset is being shared in a new place but is neither new nor a reaction to AI scrapers. We still want people to give feedback on it though!]

Last week, we released our “Structured Contents” dataset on Kaggle (our blog post announcement; Kaggle’s announcement). This is part of an early beta release that we’re proactively sharing with test partners and across open platforms to engage a broad range of commercial, academic, and volunteer users. Our goal is to gather feedback while refining the dataset for future production release.

This dataset was first openly published in September 2024 on Hugging Face (blog post announcement; talkpage notice), alongside the announcement of expanded free accounts. That update increased access to include 5,000 monthly On-demand API requests, replacing the old trial version that offered only a limited number of free requests. The Structured Contents articles endpoint is included as part of this free access.

Since last week, the Kaggle data release has garnered some media attention (including 1234 etc.). The media stories led to additional awareness of the Enterprise API services] – we saw our biggest ever traffic day as a result! However, many of the media articles wrongly conflated this release with a different blog post from two weeks ago – which discussed the heavy toll on WMF infrastructure by bots scraping data to train LLMs. Unfortunately, no journalist actually confirmed this connection with us before publishing their story. As we continue to submit correction requests (with varying degrees of success), we have had to clarify several misconceptions that arose in the media narrative:

  • Although the dataset is indeed useful for training AI models, re-publishing this beta dataset on Kaggle (which was already available on HuggingFace for 6 months) is not a reaction to the impact of scraping on our infrastructure, nor is it an attempt to “fend off” AI scrapers or “get [them] off our back”. Equally, this re-publication is neither “because” of that scraping activity nor “a solution” to it. It is trying to help developer communities through cleaner and efficient data procurement of an early beta format.
  • Kaggle is not “paying for the data” as one article previously stated. The dataset allows developers to access Wikimedia data in a new machine-readable format. However, the content continues to be both freely-licensed and freely-accessible (gratis and libre). Since last year, the Wikimedia Enterprise team has been testing this product via external releases, to help us get wider feedback and identify use cases to aid in development decisions to improve the product.

We hope to use this increased interest in the beta release to gather more useful feedback on the data structure itself as we turn the beta into a production release, increase the number of Wikipedia language editions it covers, and ensure its utility for more of our service’s users (at the free and paid tier alike). LWyatt (WMF) (talk) 09:08, 23 April 2025 (UTC)Reply

FY25-26 Annual Plan for Enterprise & Tech Partnerships

[edit]

Hello all, As part of the overarching 2025-26 WMF Annual Plan, there is also a lot of extra details specifically about Wikimedia Enterprise and also the newly formed Tech Partnerships teams which will not be included in that document – simply for reasons of brevity!

So, in order to still give visibility/transparency on the details about these specific areas of Wikimedia Foundation work, we've published this at:

Wikimedia Enterprise/FY25-26

This includes the Strategy & Goals, as well as the Objectives & Key Results of both teams.

Feel free to put questions/comments about this information here. LWyatt (WMF) (talk) 13:42, 14 May 2025 (UTC)Reply

Who are Wikimedia Enterprise's customers, and how much did they pay in the previous 2 years? Ganesha811 (talk) 03:55, 15 May 2025 (UTC)Reply
Hi @Ganesha811,
With regards to the first question, I refer you to our FAQ item: "who are the customers?". The short answer is that Google was announced as the first customer a few years ago, but since then we don't generally make an announcement. Sometimes we make a blogpost because of a specific new category of customer or use-case that's interesting for that industry to know about. You can see those here. Consistent with the privacy culture of the way WMF treats both donors (anyone can give money anonymously) and readers/editors (anyone can read or contribute to Wikimedia content anonymously), it is the choice of the customer if they wish to remain anonymous. Nevertheless, all large potential customers have precisely the same requirements as large donations [requiring specific WMF Board of Trustees oversight] as per the "WMF Board Statement on Wikimedia Enterprise revenue principles".
For the second question, you might be interested in the details provided on our FAQ item: "How much money will this raise?". But more specifically, you can find all the annual financial reports here. The most recent report declared annual revenue of $3.4M and the one before that was $3.2M. There is significantly more nuance and detail in the report itself, but it's worth quoting this sentence specifically: "We are pleased to share that projections for fiscal year 2024-2025 show that revenue from new customer contracts are likely to both exceed projected expenses and repay the initial investment from previous fiscal years." That next report is due in approximately November.
I hope this helps.
LWyatt (WMF) (talk) 10:18, 15 May 2025 (UTC)Reply
So there are 5 known customers (Google, Internet Archive, Ecosia, Pleias, & ProRata.ai), but in general, the WMF has chosen not to disclose its customers or their individual payments, nor will it in the future. I ask because the WMF is a non-profit organization whose work is made possible by millions of volunteers, and yet Enterprise is effectively a commercial B2B service. I don't object to it existing and I understand why it was created, but it seems to me it should be more transparent than other parts of the WMF, not less. Ganesha811 (talk) 11:28, 15 May 2025 (UTC)Reply
There's also "Yep.com" which was listed in one of those annual reports. The financial transparency provided by the Enterprise project specifically is significantly more than is either i) required by law, or ii) provided by other individual specific sections of the WMF's revenue. We made sure in the team principles document to include "The publication of overall revenue and expenses, differentiated from those of the Wikimedia Foundation in general, at least annually". Obviously the Enterprise revenue/expenses are also included in the overall WMF audit report etc, but this extra detail is is specifically to allow everyone to see what is happening about this part of WMF finances as easily as possible. So, yes, to your point – this is "more transparent than other parts of the WMF, not less". As for itemising individual payments: as I said in the previous message, we can only enforce transparency upon ourselves, we cannot enforce transparency upon others. Especially since that would contradict with the whole Wikimedia movement's culture of privacy: Anyone can read & contribute to Wikimedia projects, and/or and give money to WMF or Affiliates, anonymously - if they so wish. LWyatt (WMF) (talk) 13:38, 15 May 2025 (UTC)Reply
Thanks for the answer. The culture of privacy makes perfect sense where people are making a charitable contribution. But these are not charitable contributions - they are commercial transactions. Is there any reason to believe any company paying for Wikimedia Enterprise would object to the world knowing it - and even if they did, shouldn't the WMF take action to disabuse them of an expectation of secrecy? I take your point that this may be somewhat more transparent than other parts of the WMF finances, but there's good reason for it to be maximally transparent - the WMF is selling Wikipedia data, generated by its volunteer editors. What are the anticipated downsides to simply publishing a list of Enterprise clients every year or quarter? Ganesha811 (talk) 19:49, 16 May 2025 (UTC)Reply
I would like to quickly clarify your statement that "the WMF is selling Wikipedia data". The WMF is not the owner of the contents of Wikimedia projects – its collective authors are – and no one can sell something they don't own. What is being sold here is access to a dedicated service that is designed to suit the needs of high-volume data users of Wikimedia content. Those users (e.g. search engines, AI companies) are already and will continue to be using Wikimedia content for their commercial purposes, and this is their right in accordance with consistent with the CC BY-SA 4.0 and GFDL etc. licenses. The Wikimedia Enterprise project is an API/Dataset service: it sell access to dedicated "pipes"; but the "water" is the same, and that water is free for everyone.
Yes – we have always, and will continue to, state our preference for public acknowledgement of any customer contract of the Wikimedia Enterprise API service. This is consistent with how major donors to the WMF [some are individuals, some are philanthopic organisations/foundations, some are commercial companies] are also encouraged to publicly show their support. But some of them chose not to, and that is their right. Nevertheless, as I have stated above, all large donations and commercial contracts are subject to precisely the same oversight rules by the Board of Trustees. LWyatt (WMF) (talk) 20:35, 18 May 2025 (UTC)Reply
But my point is that large donations and commercial contracts are different, and should be treated differently and subject to different rules. I know that's not up to you personally - it seems like a question for the Board. As to whether Enterprise is "selling Wikipedia data", your metaphor seems like a distinction without a difference. Managing the update process (software), flow and visibility of Wikipedia data is what the WMF does, at its core. Again, I don't mind Enterprise existing, but it should be maximally transparent. Ganesha811 (talk) 18:09, 2 June 2025 (UTC)Reply
The distinction between a contract for access to a service, and a content license, is indeed quite significant – legally (both in terms of contract and copyright law) and culturally terms of the way that the whole Wikimedia ecosystem operates. It is true that many potential customers begin their conversation with the WMF asking for a "license to reuse wikipedia" within their own product/service. We have to gently explain that they already have that permission – as part of CC. The things that we can negotiated in a paid-for contract are things like an SLA (Service Leven Agreement).
Yes I [personally] agree it would be nice if multinational commercial organisations would be more transparent about their finances – especially their taxes! – but we [professionally] can only enforce transparency upon ourselves. As mentioned above the WMF policy, and the Enterprise procedure/practice, is for the disclosure of the customer relationship as part of our default contracts. LWyatt (WMF) (talk) 10:30, 3 June 2025 (UTC)Reply
Bearing in mind that neither of us are lawyers (as far as I know), why couldn't the WMF require companies to disclose their customer relationship with the WMF as a condition of buying Wikimedia Enterprise at all? i.e. if you want our pipeline, please disclose your purchase publicly or allow us to. Thanks for engaging in good faith throughout this discussion - I appreciate your answers. Ganesha811 (talk) 16:50, 4 June 2025 (UTC)Reply
I too am not a lawyer (though I did do a masters of IP law, back in the day...!) but in this area I've come to appreciate a couple of things since being part of the Enterprise team:
- In general, when a smaller supplier to a larger company wants to publicly name that company as its client, it is because they want to use it in their own advertising. "Our product is trusted by X and Y so you should buy it too!" etc. But companies don't generally like their name (and therefore their brand reputation) being used by someone else as an endorsement. Depending on your perspective, our publicly stating that Google is a customer (in this press release) could be seen as "advertising" or as "transparency" - perhaps both are true!
- In general, big companies' motivation for keeping trade-secrets is not to avoid "the public" knowing, but to avoid their competitors knowing. The problem of course is that they can't be transparent to the public without also giving their competitors an advantage! So, like in the other point above, the strongly-held default position of lawyers of big companies is to not approve this kind of disclosure. The fact that Google's lawyers approved that press release (above), and the other companies I've named earlier in this conversation, is therefore worthy of thanking them. We continue to use these blogposts/comments in our annual report/press-releases as the examples to show how and why we argue for transparency. This is trying to build commercial relationships of mutual trust over time, rather than merely making blanket conditions. It's perhaps messier, and slower, but I believe it's a more sustainable way for Wikimedia to remain embedded as "the essential infrastructure of the ecosystem of free knowledge" in the commercial landscape of the internet over the long term. LWyatt (WMF) (talk) 19:43, 4 June 2025 (UTC)Reply
Thanks for the detailed explanation of these considerations. Ganesha811 (talk) 18:45, 5 June 2025 (UTC)Reply

┌─────────────────────────────────┘
Note: For anyone who might have been following this conversation here, Ganesha811 has continued the conversation over on Jimmy Wales' talkpage here. LWyatt (WMF) (talk) 10:28, 12 June 2025 (UTC)Reply