Talk:What we use the money for
- 1 Fundraising with companies
- 2 Cost reduction, data redundancy
- 3 Lastschriftverfahren
- 4 CPU usage
- 5 long-term outlook
- 6 Zend Accelerator
- 7 Network computing
- 8 Remote Squid caches and bandwidth offers
- 9 Distributed wiki / fault tolerance
- 10 "When we're fast we grow until we're slow"
- 11 Time to outsource caching?
- 12 Promo text
- 13 Donating "add audience"
- 14 Database overhead
- 15 platform platform
- 16 french version
- 17 Category
- 18 BitTorrent model incorporated to ease bandwith costs?
- 19 Title
- 20 Why 200?
- 21 See also
- 22 Numbers
Fundraising with companies
Some issues for companies could be solved with wikimedia-technology and knowledge in which companies are interested, e.g. for research questions, phonebooks, capability maturity models, fact sheets, databases, intranetsolutions, ... For companies it's small money. Hein de Jong--188.8.131.52 20:15, 11 Nov 2004 (UTC)
Cost reduction, data redundancy
It worries me that the costs should be so high - the Wikipedia shouldn't be vulnerable to economic crisis, or any other crisis.
How about moving to a voluntary distributed system where anyone could donate a percentage of diskspace and bandwidth to the Wikipedia, with the load distribution hardware auto redirecting the traffic anywhere in the world. That would also provide extreme redundancy in terms of protecting "..the sum total of all human knowledge." from an Alexandria Library type disaster such as an earthquake or other accident at your co-location facility, and reduce your costs to merely maintaining the auto-load redirection gear (something that also would need to be redundant and distributed) at the co-location facility.
- That's potentially interesting when it comes to image/media delivery. Security is a concern though. We are currently scanning uploaded images to block some known exploits and we'd like to know that this wouldn't allow a cache operator to replace images with virus/trojan containing images. There are some interesting alternative networks which may deliver the image - if we wouldn't swamp them or suffer long delays because of their network which would make people unhappy. Have to investigate those. For serving what we're certainly going to do is put Squid cache servers in many places around the world, at places offering bandwidth and hosting. For the databases, the database dumps are available and guarante that the projects won't vanish. You'll find quite a few of those with search engine searches and the number is sure to increase. Jamesday 13:14, 30 Sep 2004 (UTC)
my english is not the best. I´m living in germany and I would like to give some money to wikipedia, because I think it is a very good Institution. But it is to compicate to send mony to wikipedia. You shoud look to the newspaper www.taz.de . There it is very easy. You can do it online per "Lastschriftverfahren". You only have to wride your dates (Kontonummer und Bankleitzahl (BLZ)) in a formular. Thats all. But I think, wikipedia should have a bankconnection in germany.
Greeting - Josua
Maybe one should protect this page.
184.108.40.206 23:43, 12 Jul 2004 (UTC)
- Done. --Daniel Mayer 01:25, 13 Jul 2004 (UTC)
- Hallo Josua, ich glaube ich habe Deine Frage schon an anderer Stelle beantwortet, aber zu Sicherheit noch mal hier: Der deutsche Verein arbeitet daran. Wenn nicht unerwartete Probleme auftauchen, wird es in wenigen Wochen eine entsprechende Möglichkeit geben. Allerdings kann ich nicht versprechen, dass wir das System genauso offen halten werden, wie das die taz tut (einfaches Formular ohne weitere Überprüfung). Denn im Falle falscher Angaben kann sowas schnell zu erheblichen Kosten (geplatzte Lastschriften sind teuer) für den Verein führen. Werden wir aber mit der Bank klären. -- Arne (akl) 18:35, 20 Jul 2004 (UTC)
Why do the webservers utilize so much CPU? I find that highly unusual. I've run a few large-scale websites in the past (the old members.xoom.com, which was once the 7th most visited site on the web, for instance) and the CPUs on our webservers were basically idle most of the time. Some other resource usually ended up being the bottleneck. I'm guessing that the heavy usage of PHP is probably the difference between the Wikimedia servers and those I've administrated in the past (I've been using PHP since 1997, but not on the high traffic sites).
But just to be sure, I have some questions:
- What does the Apache configuration look like?
- What other modules are running?
- What other services are running on these machines?
- Has any kernel tuning been done?
(I'll go look for some more appropriate page to post this to, as well.)
--Wclark 03:44, 13 Jul 2004 (UTC)
- You're right that PHP makes the difference. PHP is by far the dominant user of CPU. The two main CPU-intensive tasks performed in PHP are wikitext parsing (that is, converting it to HTML), and calculating diffs. Parsing of wikitext is based on an on-demand model with various levels of caching. Currently we have a cache hit ratio of about 85%: 78% from the squids and 7% from the parser cache. 15% of page views is still much more than the edit rate, so perhaps there's some room for improvement there. The reason we haven't switched to a C++ parser is because MediaWiki now has many non-Wikimedia users who don't have full access to their servers. A pure-PHP application is very useful for them. The parser is large and complicated, so maintaining it in two separate languages would be a chore.
- Diffs use about one fifth as much CPU time as parsing, however it seems to me that this time can very easily be slashed. I have an idea for a fast diff method which should require less than a day of development time, including debugging. If you're a programmer looking to help out and this sounds like fun, contact me and I'll tell you about it.
- For the record, a working C++ diff which is 5-6 times as fast (and I have a version that is 9-10 times as fast now, but it's quite a lot uglier and I'm not sure if that is worth it; I guess the database etc. becomes much more the bottleneck at that point) as the current one is now in CVS and can quite easily be enabled. It's largely untested, though. --Sesse
- Our apache configuration looks big and ugly, w:memcached is also running on some apaches, plus we occasionally use their hard drives for backups. I believe some kind of kernel tuning has been done, but I'm not the one to ask about that. -- Tim Starling 04:22, 14 Jul 2004 (UTC)
- Could you add a pointer to the code that is the big bottleneck? I might be interested in doing an experimental reimplementation as an external service. Especially valuable would be the test suite for the code. Thanks, William Pietri 04:38, 14 Jul 2004 (UTC)
- Parser.php and part of Skin.php, if you can call 3000 lines of code a "bottleneck". The important function is Parser::parse(). A number of people have re-implemented the parser in other languages at varying levels of sophistication, notably Waikiki in C++. We haven't adopted Waikiki because of the maintainability aspects mentioned above, and because it was designed as a standalone application rather than a drop-in replcaement for the PHP parser. -- Tim Starling 00:56, 15 Jul 2004 (UTC)
- Thanks. I take it there's no test suite? Rewriting this isn't a small project, so I'll have to leave my experiments until after the election.
- But this comment worries me: "We haven't switched to a C++ parser is because MediaWiki now has many non-Wikimedia users who don't have full access to their servers." As a donor, this makes me more reluctant to donate money. Although I think it's great that others want to use this software, it seems weird to spend thousands of dollars on additional hardware for Wikipedia just because some people are unwilling to get a deluxe hosting package. --William Pietri 17:04 20 Jul 2004 (UTC)
- Apache and general server configuration is currently best documented on the admin wiki. Off site so we don't lose it if the site is down. Still very much a work in progress. Jamesday 07:37, 14 Jul 2004 (UTC)
Is there any chance of a longer term outlook on the need for capacity and implied costs? I know its difficult but there have been several loops around the "we just need this and it will be ok" and then it all bottlenecks again? --AndrewCates:talk 10:34, 13 Jul 2004 (UTC)
Not really. The equipment requirements depend on how rapidly use grows and how effective the ongoing work to improve the programs is. Neither of those is easy to predict. If you want me to suggest something which I'm fairly sure would still be sufficient in six months, it looks like this:
- Three database servers like Ariel, each holding 1/3 of the site. $12,300 each for two more.
- Three to six read only database query servers (to offload the main ones). Say 3U dual Opteron 244s with 4GB of RAM (4 modules, can upgrade to 8GB), a pair of 200GB IDE drives, room for four more drives. The RAM is key here - saves much of disk load. $4,000 each (the RAM is about $1,000 of that). These avoid having to buy more of the very expensive disk setups in the main database servers.
- Three to six read only database search servers. Same as query servers. $4,000 each.
- Three storage servers for images/video and server logs. Dual Xeon 2U SATA RAID 5 with 6 of 12 bays filled for 1,000 GB of disk space each. $5,000 each. These should last a year or two beyond the end of the time, with added drives.
- Ten to twenty web servers. Dual Opterons in 1U cases with a gigabyte of RAM (for memcached caching mainly). $2,600 each.
- Three to six Squid cache servers. Move the curent 1U web servers to this job, raise RAM to 4GB for about $1,000 each and replace with the web servers above, 1 per two moved.
- Gigabit ethernet between core systems. Call it $5,000. May be less.
- Two 32 port terminal servers with cabling at $2,200 each. These let the developers see what happens if a computer which is restarted doesn't boot properly. If they do boot and connect to the network, there are already ways to do it. Saves asking Jimbo or the colo staff to do it at 4AM when they don't necessarily know what is or isn't normal.
- Remote power controllers to allow restarting hung systems. Ballpark $3,000, may be more or less.
Total bill for all of this comes to around $165,000. The database side is probably more than necessary to cope with the load the web servers can handle and I expect that it could handle doubling the number of web servers - but I don't know which parts of the database setup will be hot spots first, so if you want a fixed budget I have to assume all of them. And I also have to ignore the programming improvements I'm hoping for, because they might not happen or might take longer than expected. Jamesday 07:37, 14 Jul 2004 (UTC)
- What we will really do is go for what we can predict as necessary for two or three months - storage server (not having one is costing a lot of developer time), search server, use Suda as query server, two web servers per month - and see where the hot spots are. Then ask for what is necessary to address the hot spots as new ones develop. Jamesday 07:37, 14 Jul 2004 (UTC)
I think part of the problem is that when we buy the servers we need and the site speeds up we attract more visitors and edits. Meaning we need even more servers. When you think about it, it is a good thing as long as we can get the money because we have the potential to grow very quickly. Looking at the Paypal donation totals it looks like we have gotten $1,500 in the last day. It will probably decrease as people get used to seeing the box asking for donations, but it is still a lot of money that should help us with our server problem. Jeff8765 21:29, 13 Jul 2004 (UTC)
- Right. The dual Opteron web servers we want to buy cost about $2,500. Testing with Suda (dual Opteron) as a web server has shown that they are more efficient than two Pentium 4s for our web servers. Dual Xeons might be better - won't know until we have checked. Jamesday 07:37, 14 Jul 2004 (UTC)
- Not Zend specifically. Wikipedia currently uses Turck MMCache as a PHP accelerator. Please see PHP caching and optimization for more details of our experiences with and use of PHP accelerators. Cache strategy may also be of interest. Jamesday 20:01, 17 Jul 2004 (UTC)
Is there a way to reduce the server need, via a network computing?
We've seen so many solutions (bit torrent, p2p,Comparison Shopping, computing@home) networking process to speed up computational use and maximize download via low costs. Isn't there a way to use this in a website?
(this is not a suggestion, but more a question, i am not a server expert..)
--220.127.116.11 11:21, 13 Jul 2004 (UTC)
- Is there any info on the amount of traffic the database dumps generate, I would think these are only seldomly downloaded? If not I could look into setting up a Bittorrent tracker for the dumps. -- 18.104.22.168 17:46, 14 Jul 2004 (UTC)
- The traffic for those isn't great, compared to the whole site. So far it seems unnecessary. Jamesday 14:07, 15 Jul 2004 (UTC)
Remote Squid caches and bandwidth offers
Similarly, are you guys interested in other people running Squid caches? I'd be willing to pay for 1 Mb/s of bandwidth on an idle server, and perhaps more. --22.214.171.124 04:31, 14 Jul 2004 (UTC)
- We're discussing this. It depands mainly on whether our current bandwidth donor would like assistance. Jamesday 14:07, 15 Jul 2004 (UTC)
- Doesn't it also depend on hardware? I mean, external Squid caches would not only help bandwidth, but it would also take some load off Wikipedia's squids (possibly requiring fewer, making us/you afford more web servers etc.). --Sesse 23:41, 10 Aug 2004 (UTC)
- Yes, it does. We now have three donated Squid cache servers, donated hosting and donated bandwidth in France and will be testing those in service soon. If they work well we may well expand that and accept similar offers elsewhere. Jamesday 03:46, 26 Aug 2004 (UTC)
Distributed wiki / fault tolerance
Some people are starting to talk about "distributed wiki" / "fault tolerant wiki" at http://wikifeatures.wiki.taoriver.net/moin.cgi/FailSafeWiki . If you have good ideas for a distributed wiki, especially if they would be inappropriate for wikipedia, that would be a good place to discuss them. --DavidCary 20:36, 14 Jul 2004 (UTC)
- I added a descripton of what Wikimedia does and is considering doing over there. Jamesday 14:07, 15 Jul 2004 (UTC)
If I think it would stay I would love to put three small pictures in a row at the top of this article - something like a pic of a sports car, a pic of some beer, and a pic of some scantily clad lasses. Can I do it? Can I? Puhhh-leeeeze! 126.96.36.199 21:26, 17 Jul 2004 (UTC)
"When we're fast we grow until we're slow"
- Our growth is pretty simple: when our servers are fast, we grow to use available capacity until we're slow again.
- There is still no sign of us hitting the limit on demand,
- so it appears that we'd have no problem finding a larger audience if we had another $50,000-100,000 to spend
where does the 50K-100K number come from? It appears that any number could have been used in this sentence, given the earlier information.
- - our ballpark growth estimates suggest that we'd end up doing that by the end of the year if we stayed fast until then.
End up doing what? Spending 50-100K? Finding a larger audience? 188.8.131.52 20:26, 30 Jul 2004 (UTC)
They are my estimates, being conservative about the growth I expect and suggesting some numbers for the spending required to get and stay fast through the growth I expect to see if we did stay fast. It's there to give people some idea of the approximate amount of money it would cost. Jamesday 17:11, 6 Aug 2004 (UTC)
Time to outsource caching?
From the content page:
"We would like to acquire and set up equipment now to handle our anticipated growth and reliability needs through the end of the year. Web servers are currently the hot item we're after - those reach 100% CPU use for many hours at peak times. This quarter (1 July 2004 - 30 September 2004) alone we have provisionally budgeted $31,000 toward the purchase of new servers."
Before your web server farm expenditures grow out of control, have you considered an outside caching provider like Akamai? There are several benefits: elimination of crippling spikes in traffic, better response time for users (due to fewer hops from hitting geographically closer servers and the distributed nature of the content), vastly reduced loads on your own source web servers resulting in less hardware and maintenance costs on your part, etc. This would let you focus on the back end stuff and let someone else worry about the heavy lifting of dishing up pages to an ever-growing number of visitors.
I don't know if you've reached that point yet -- it may not be cost effective depending on Akamai's rates -- but it's worth investigating. There are a few other providers in this space but Akamai is generally regarded as the largest, strongest, and most likely to survive. Full disclosure: I worked briefly for Akamai's internal, corporate IT dept. a few years ago but was not involved at all with sales, development, or their network ops team, and I otherwise have no ties with them.
If it matters, Akamai runs all of their caching servers on GNU/Linux, mostly various versions of Red Hat I believe.
I realize there is something to be said for retaining control over your own destiny (maintaining your non-profit, public service vision; what if Akamai gets bought or goes under?, etc.), but needing constantly increasing infusions of cash tends to be even more problematic.
Just an idea.
Chris Mansfield, Seattle, WA
- Thanks for the suggestion. For the moment we have a Squid cache hit rate of about 78%. Those handle most page views, particularly high surges from media attention. They also insulate random visitors from the effects of web and database server load. We have a donated set of three Squid servers, hosting and bandwidth in Paris being prepared for test use. If that goes well we may continue that pattern, accepting hosting, bandwidth and hardware offers, or buying the hardware, to provide internet-close caches for anonymous viewers near high volume load points and in places where international bandwidth is costly for the user. The load which typically causes us trouble is search (stresses the database), views from logged in contributors (web servers and database) and editing pages (web servers and database). We can't offload that to Akamai and don't want to discourage people from doing lots of editing. For those issues, we're working on more efficient code - better caching, optional PHP helpers of various sorts and fine-tuning the database queries which we see are slow (the latter being my current area of interest). For search, we offload it to Google and Yahoo whenever searches cause the database servers trouble. Dedicated search servers will help with this. Jamesday 03:46, 26 Aug 2004 (UTC)
I'm not sure where else to put this, but the promo that lists at the top of most pages has a comma in it that on my screen is right after a line break. So you have "page" at the end of the first line and the comma at the beginning of the second line. It just looks unprofessional; thought you might want to change it. There's also a space before the period at the end of the sentence which could go. -- Beland
- Thanks. Jamesday 17:13, 6 Aug 2004 (UTC)
Donating "add audience"
Hi, I really enjoy wikipedia. I read the donation page. Why not ask people if they want to volunteer to get adds on their pages instead of asking them for cash? I personnaly wouldn't mind seeing a few non intrusive Google Ads (or other) on my pages, especially if I know that the money goes to improve the service!
- Ads have proved to be a controversial issue, with some people strongly opposed to them. My personal opinion is that they would provide a useful way for those who view pages to effectively donate tiny amounts of money, too small to be worth paying directly. The sort of approach which gets relatively low (but still significant) opposition involves Google text ads which anyone can turn off just by clicking a button. Limiting even those to only as often as needed for the purpose would help as well. So would making it a per-wiki choice. But operating a site free of ads is good. We'll see how the future turns out. For now, no plans for ads and it's likely to stay that way so long as donations do the job. Jamesday 03:46, 26 Aug 2004 (UTC)
- Jamesday, I think this user was suggesting opt-in ads, while you seem to be talking about opt-out ads. Has that been discussed at all? Do you expect there would be strong objection to it? There could be a setting on your preferences page, "Show advertisements", which would default to "Off". And then you could mention it on the "Donations" page. I think I would be willing to turn advertisements on, and as long as there are never any "Please turn on advertisements!!" notices on inappropriate/random pages, I'd be pretty comfortable with it. -- Creidieki 01:16, 12 Nov 2004 (UTC)
I'm a bit unsure if this is the right article to comment on, but has anybody considered other databases than MySQL? There are multiple free databases out there (PostgreSQL and Firebird, for instance) that are known to scale better than MySQL (especially as you get a lot of concurrent access), although I know too little about your workload to say anything for sure, of course. :-)
- Yes, there's ongoing work to add a database abstraction layer which will allow plugin use of different database servers (once their database-specific and tuned queries are in place). Whether Wikimedia will switch would depend on tests once that work has been completed. The biggest gotcha for MySQL seems to be relatively slow fulltext search followed by a one index per query limit (which can sometimes be worked around with self joins). Peak database loads are in the 2,000 queries per second range, low somewhere under 900 and average around 1,200. Jamesday 03:46, 26 Aug 2004 (UTC)
Ganglia stats must change to http://download.wikimedia.org/ganglia3/?c=unspecified&m=&r=day&s=descending&hc=3
- Thanks. That was a temporary settings change and I see that someone (you?) has updated the links to point to both the usual and temporary places. Thanks to whoever did it.:) Jamesday 13:14, 30 Sep 2004 (UTC)
Lately the rank has gone past 225. You might want to put that in, since you protected the page.
I wrote a helpful message for people who aren't experienced with this alternate brand, and settle some myths and unknown facts, on an eBay forum. The debate about it and the competition has been long, but good research and sense can end it quickly. Even though I'm an intermediate amateur, unfamiliar with high-end appliances, the message does deal with the consistent experience with hardware and software, and their economy, which anyone should be interested in. As for servers, I'm sure you can find comparison charts given by the company. :) What is the BEST computer to buy? lysdexia 00:58, 27 Oct 2004 (UTC)
BitTorrent model incorporated to ease bandwith costs?
- As the site becomes more media-heavy, might there be utility in somehow incorporating something in the vein of BitTorrent's model for distributing content within this wiki?
- Something toggle-able, so that if a visitor wants to help, one thing they could do is install the BT/BT-variant which has been incorporated into MediaWiki?
This is an interesting idea, but it would probably be difficult to implement. I'd be interested in seeing an attempt made, though. ~Mr. Billion
This title ends in a preposition. Shouldn't it be How we use the money or How the money is used? ~Mr. Billion
- Yeah, never use a preposition to finish a sentence with!
- This is the sort of pedantry up with which I shall not put!
The very first sentence says "The Wikimedia sites include Wikipedia, one of the 200 most visited websites in the world (as tracked by Alexa.com)". Then you click the link and see "Traffic Rank for wikipedia.org: 64". Maybe it's time to change the text to "one of the 100 most visited websites in the world", at least. KissL at 184.108.40.206 10:27, 15 July 2005 (UTC)
Just as an FYI, Wikipedia is now the #35 site (looks a lot cooler than "top 100"), but since I can't change that, would someone else mind doing so? -Mysekurity 08:42, 10 December 2005 (UTC)