Upgrade discussion April 2004

This is an archived page preserved as part of Wikipedia's technical history.

The archived discussion on this page covers the architecture when Wikipedia was using a single database server with a borrowed backup slave for redundancy. This page predates most of Wikipedia's tremendous growth. This update is written in August 2006.

Geoffrin was unreliable and was returned to the supplier and replaced by Ariel, early in a long and continuing purchasing relationship from Silicon Mechanics.

Suda had a battery backed disk controller and when switched from RAID 5 to RAID 10 in 2005 improved its disk performance. More recent purchases use RAID 0 because there are always sets of at least three database servers with all data. Tests with RAID 10 using both 10K SCSI and 10K SATA as well as with 15K SCSI led to Wikipedia using 15K SCSI for database servers.

Back to the history, which was one of JamesDay's (first DBA) earliest bits of analysis work for Wikipedia.

For April 2004 there is discussion of the possibility of upgrades. Discussion about findings of fact and postulated solutions should all go on this page so all interested parties with data or proposals can share information. For background information see Wikimedia servers, Wikimedia live traffic graphs and Ganglia cluster stats for Wikimedia servers.

Observed performance[edit]

If you have data to contradict or confirm any observation please say so - the idea is to understand the current system, then propose changes based on that knowledge.

Overall performance is excellent compared to last year and Alexa is currently reporting 0.9 second response time, with 77% of sites being slower.

Squid[edit]

Hit rates are commonly in the high 60s to 70% range, so adding 5% to this hit rate would reduce web server and database load by about 15%.
Jamesday speculated that Squid might be limited in disk use for performance reasons, because of blocking on disk I/O in the past.
- There are about 10 disk fds in use on average, vs. up to 1200 network fds when the apaches are slow (connections pile up). Disk access is not blocking anymore (we're using aufs now), but squid uses a lot of cpu when using ~> 1100 fds. Squid 3 with epoll (2.5 uses poll) improves this, but its current stability remains to be seen.
- ESI can increase hit ratios drastically by allowing logged-in browsing to be cached. Also Vary can be dropped for most fragments which further increases the hit ratio. Purging is also simplified, Main_Page updates will be automatic for example. There's development to add etag and gzip compression to squid. Esi uses more cpu than plain squid, so more squids might be needed when used. Ram size can be smaller though because the total size of cached items is reduced (no duplicate variants of the same item based on vary etc). See Skins for more info on ESI.
Squid is critical to site performance and any improvements to cache hit rate have a high probability of delivering the best performance value for money.
We had a third Squid on loan from Bomis,l now used for database replication. A third at all times seems like a useful redundancy choice, as well as being of use for ESI later, but is not urgent.
Asked to recommend any hardware solutions to increase current cache hit rates, Gwicke had no suggestions.
Two weeks later, Ganglia statistics show that browne is near 100% CPU for much of the day and coronelli around 80% for much of the day at the >180/s loads we recently had.

When the backend (mainly the DB) is slow connections pile up at the squid, cpu usage is very high if >1000 connections per squid are open (poll doesn't scale as well as epoll in Squid3/linux2.6). This is a symptom of the slow backend, when the db is quick the load on the squid is lowish at the same rate.

Web servers[edit]

Shaihulud noted some high CPU periods for vincent and Isidore and all show some periods around 80% CPU (Ganglia CPU load). These match the times when the Squids are known to be CPU-loaded because of open connections and may be for the same reason. This needs to be evaluated somehow - adding or removing servers and/or checking after the new database server is in place to confirm that removing the need to wait for the database removes the load. We currently have too little weekday Ganglia data - we'll have more confidence in peak CPU load levels after about noon Eastern time on Monday, since Monday is typically the day with consistently highest peak load.
- Data from Monday was unclear. The response time with one server out of rotation was slower but the picture was complicated by database access for compression and a holiday this week in France may have been reducing traffic levels. Insufficient information to draw conclusions from because of all the new factors. Does seem clear that there is no urgent need at present, though.
There is an intermittent DNS-related glitch which causes a single web server to receive many hits and can sometimes deliver unusually slow performance. This is unrelated to the general web server load. Addressing whatever on Zwinger is causing this is the likely solution. It's on the to do list.
At periods of extreme site overload (TV appearances) it is possible to connect to the Squids and web servers but a database connection error is reported.
It was mentioned on wikitech-l that AMD CPUs seemed significantly more efficient for our general type of work by someone who had done somewhat similar things. We should try one system of this sort at some point, so we know which is best for future purchases.
JeLuF noted that CPU load on Suda is low and suggested running an Apache web server on it to use some of that CPU power. That appears to be a useful test.
Two weeks later, Ganglia statistics for the month and day show that web server CPU loads tend to peak no higher than 80%, suggesting that web server CPU capacity isn't currently a bottleneck.

Database[edit]

There are pressing disk space issues. Disk will be full in about 20 days from April 11th at the current db growth rate of between 1gb and 1.4gb per week.
On Sunday April 11 compressing of individual old revisions on de was started. On Monday all wikis were switched to compressing individual old revisions. This combination should have temporarily freed some space and slowed ongoing growth rate. Shaihulud plans to start compressing fr then ja etc. when de compression has finished.
At times of extreme high load (TV), database connection limits are reached.
Speculated: at times of normal high load (European evening, US morning), the database load seems to be the bottleneck, as the web servers wait for database service.
The current database server is not as capable as Geoffrin because MySQL is limited to 2Gb on the running 32bit kernel, upgrade to 64bit planned but has the potential for disaster with no second db
Suda CPU load was reported as about 20% at 3PM Eastern, suggesting that CPU isn't the key performance limitation. Ganglia stats for the last week show CPU usually between 10% and 20% with one period of up to 40% for about 24 hours. Does anyone know what it was doing during those 24 hours?
More ram could lower bi by disk caching.
64bit kernel will enable MySQL to use more mem
See this disk I/O data showing Suda at consistently high disk rates, Zwinger at low but higher than the rest, other systems at minimal disk load. Weekly Ganglia stats also show consistently high disk loads.
There seems to be agreement (discussion with gwicke, JeLuF, jeronim, Jamesday actively talking) that Wikipedia transfer sizes are small, implying that striping is not a major factor in performance for this site, because it helps larger reads most.
Disk activity is reported to be about 5:1 read:write ratio. More RAM would reduce the reads but not the writes. This ratio suggests that there is merit in disk solutions which offer more reads than writes. RAID 1 mirroring or RAID 10 offers about twice the number of read seeks as write seeks (each drive or stripe seeks independently). RAID 5 does not offer more read seeks than a single drive, RAID 1 or RAID 10 can deliver because each stripe is on all disks and all must seek together to get the data. In addition, in RAID 5 writes are slowed because at least one read is required to get the parity data unless it has been cached.
The databases are currently split into one ~4gb file, one ~14gb file, one ~39gb file for a total of 57gb. Brion reports that it does not seem unduly burdensome to split these across multiple arrays/mirrored pairs, nor burdensome to split the logs to get a total of three or four sets of spindles spreading the seek loads.
There is some debate whether 15K SCSI or RAM is the way to go. Testing with 8GB of RAM should be done, perhaps using RAM borrowed from another machine or purchased in advance for use in a third Wikimedia Squid server. 15K SCSI drives are available in sizes up to 72GB and 10K in sizes of up to 146GB. Ignoring inefficiencies and using growth rates before compression was enabled, three mirrored pairs of 72G 15K drives using would last for about 120 weeks before being full. Three mirrored pairs of 10K would last for about 280 weeks. Assuming one pair of drives, the 39GB database chunk and it growing at 1GB/week, putting it on one 72GB mirrored pair would allow 33 weeks before it becomes full, allowing about 8 months for larger 15K SCSI models to be introduced. Another pair of 72GB drives, if the only choice available at that time, would gain another 72 weeks.
There is talk of combining and compressing multiple revisions of articles into one record to significantly reduce the size of history. This should significantly lower both database size and growth rate but has no estimated completion time and may not happen. At present compression of individual old records is supported. History on en is currently about 16GB. It seems reasonable to expect compression of multiple revisions together to reduce that to the 1-3GB range, based on testing of compression which has been done.
There is a new database layout in planning which may improve performance.
There is significant interest in evaluating SATA drive solutions using many drives to increase the number of available seeks per second by splitting the database over lots of drives, while simultaneously reducing drive cost. Planning experiments in this area, including buying required drives, seems worthwhile so future purchasing decisions can be based on practical experience with this site.
Excellent database performance is critical to site performance.

Does Suda have a caching disk controller with battery backed up RAM? If it does, is write caching turned on to exploit that capability? There appear to be no negatives, other than purchase cost, from using this capability.
Is it readily practical to have Suda as master for half the project load and slave for the rest, while Geoffrin does the opposite? This would let us fully use the resources to deliver performance instead of having one machine mostly idle.

NFS upload space[edit]

Zwinger is short of disk space.
Zwinger is usually very low CPU, sometimes as high as 40% (Ganglia). Running Apache on it to use the CPU was suggested but there were some objections based on its involvement in the slow web server issue and it being a single point of failure at present, suggesting a conservative approach.

Network[edit]

We're using a shared switch with Bomis and it is full.
All new Wikimedia machines have gigabit ethernet and LiveJournal has found that gigabit reduces latency compared to 100 megabit, so there seems to be merit in gigabit if the cost isn't unreasonable.

Proposals[edit]

Squid[edit]

Software changes (perhaps caching some things for logged in users) appear at present to be the only viable way to increase cache hit rate.
A third Squid for reliability and possibly ESI in the future has some merit, one was on loan from Bomis.
Proposed: no hardware change at present

Web servers[edit]

There are times of 80% load which need to be evaluated further. They should be evaluated by adding another Apache server (perhaps on Suda) and/or removing one, to determine the effect on performance. The effect of the new database server on this also needs to be evaluated, since it is speculated that the load comes from managing the accumulated connections created while waiting forthe database server.
Proposed: no hardware change at present, evaluate and reconsider in one to two months

Database[edit]

It appears that the database is currently a bottleneck at peak times. A replacement for Geoffrin is being considered and that appears likely to eliminate the current bottleneck.
Suda would benefit a lot from more RAM, a kernel upgrade is necessary to make use of it though and the risk to try this update without a second db is too high. Switching to a 64 bit OS appears to be a good option as soon as a new machine is in place and has been serving the main site load for long enough to be proved to be reliable.
Data confirms that Suda is highly limited by disk performance, so investment in high speed disks and other methods to increase disk seek rate capability are clearly a good choice.
Proposed: buy the replacement for Geoffrin, specs discussed below, concentrating on superb disk performance and biggest possible ram

NFS upload space[edit]

Maybe buy a pair of large IDE drives for Zwinger, since NFS is not a very high seek rate task. Perhaps two of 200GB to be a mirrored pair, about the current price/capacity sweet spot for IDE? Or larger to reduce number of hardware changes in the future, at higher $/GB now?
LiveJournal seems to have a mirroring solution which puts the drives on different machines, to further increase redundancy. No data on what they are using for this. Offers the possibility of increased redundancy. They use gigabit ethernet to reduce latency on their internal network.
Proposed: two of 200GB SATA or EIDE drives, any well known brand

A backup nfs would be a good idea. Actually our nfs server is a single point of faillure, if it goes down, all wikipedia is down. Shaihulud 20:49, 9 Apr 2004 (UTC)

A better network file system like coda would benefit wikipedia a lot, because it provides automatic failover, disk caching, syncing after server failure etc. It would also provide a way to use the apache's hd's in a more useful way. However, setting this up isn't trivial as none of the current admins have experience with it.

Replacement for Geoffrin discussion[edit]

Disks[edit]

Some discussion of RAID type apeared on the list. RAID 10 approximately doubles sequential data rate and read (but not write) seek rate compared to a single drive. RAID 5 makes more efficient use of disk space and increases sequential read rate but not seek times or write rates compared to a single drive. LiveJournal is switching from RAID 5 to RAID 10 and is a write-speed-limited site, suggesting that they have found a benefit to RAID 10 over RAID 5 in that environment.

Geoffrin was 4 of 10K SCSI in RAID 10 and very fast. Suda is 3 disk RAID 5 and slow. This suggests that RAID 5 is a less than ideal choice for our use, but it could be some detail of that setup, like the limited number of disks. However, the difference is so great that it appears unlikely that adding one disk is really the difference.

There is some discussion of how important the striping part of Geoffrin's RAID 10 setup was. The performance could have been from the extra seeks, the striping or both and there appears to be no way to know their relative significance without testing. One proposal would use a mirrored pair now and expand that mirrored pair when capacity requires. That switch from mirroring to mirroring plus striping could provide useful data for future decision-making.

Seek times are critical to database performance. Seek times can be improved by using faster RPM or by splitting the work between two or more disk arrays, which can approximately double average seek speed if the work is a 50-50 split between the two arrays (commonly discussed in terms of increasing the number of spindles able to seek independently at the same time, with more in different clusters/arrays being better for performance). En/the rest might produce a fairly even split for the peak site load period (evening in Europe, morning in US). Splits are harder to manage. Putting log files on non-redundant or simple mirrored storage can decrease the number of seeks and improve performance relatively easily and without as much management time overhead as splitting the main databases, but probably with less performance gain. General database performance guidance is to maximise the number of independent spindles as far as possible, to maximise seek rates.

Splitting the logs to a different set of drives looks to be painless and probably useful. Seek rates for log drives are probably low, so 15K SCSI probably isn't necessary. Something faster than 7200RPM IDE probably is. Western Digital 10K RPM IDE or 10K or higher RPM SCSI is probably the best choice, to ensure that this doesn't become a bottleneck on a performance-critical system. If Suda has such drives, buying 15K and putting them in Suda, then using Suda's old drives for this may be a good option, since it would help Suda performance as the fallback system.

Use of large and inexpensive IDE drives, possibly on more than one machine for redundancy, has been discussed. These could hold logs and other backup-type data on something cheaper than costly very high seek rate SCSI drives because seek rates for this type of activity are low. At least one pair in the 200GB range of commodity IDE drives seems very likely to be useful. Just a single drive if redundancy of this data is considered insufficiently important. Perhaps put two such drives, one each in two web servers, for redundancy?

A raid 10 with 4 (as geoffrin) or 6 disks and 1 hot spare would be a good idea. One of the faster system. 15000 rpm disks would be fine, but they are a bit expensive. 10 000 rpm should be enough, if we use raid10 with 6 disks :). around 200G of logical disk space would be fine. Shaihulud 20:46, 9 Apr 2004 (UTC)

Not really keen on a hot spare. With mirrored sets we can tolerate one failure per pair and we have at least one complete spare system. I'd rather save that money and use it for another system. I'm also considering suggesting not using redundant drives but going for a deliberate complete spare machine policy instead, because we can spread load over those machines with database splits and get a higher ratio of RAM to database size. The catch is that we'd need automatic failover from machine to machine or we'd endup with too many service interruptions. The network file system LiveJournal uses has this but I don't know if their database drives are split over their network or not. If they are and it is fast, this would be quite an interesting way of getting redundancy, since every web server could also be serving part of the database job. Imagine a three disk mirrored set with each disk in a different machine. It would take failures in three different computers to lose that disk and interrupt service. Since the local RAM can cache, that also offers some interesting potential levels of caching compared to putting the space in just one machine with comparatively limited RAM capacity. Perhaps something to investigate in backup and then image storage before database, though, since it is comparatively new and unknown technology here.:) Jamesday 15:53, 12 Apr 2004 (UTC)

Squid discussion[edit]

Somebody should finally switch the dns to zwinger so that the squids are load balanced as intended. More squids wouldn't improve the situation significantly however as the backend is just overloaded at peak times. Putting the secondary db into read-only use might be an option for the near future (if the load balancing code is functional), but of course getting the new DB server as quick as possible will help as well. Reactivating the db-based parser cache (storing pre-rendered html in the db so that only messages and link status needs to be updated) would take some load off the apaches as well.

Order details[edit]

Links to the order or alternative system proposals, as they become available.

Jamesday proposes[edit]

Database[edit]

CPU load is low, so avoid the most costly CPUs but get motherboards which can take them later if it turns out that we remove the disk bottleneck and CPU becomes a factor.
Dual CPU system - there's a lot of multitasking going on.
- Isn't this a contradiction to the above? CPU load is low, but two are needed? Looking at ganglia charts, CPU doesn't seem to be a bottleneck.
  - What I'm trying to do is avoid finding that the disk side has become so fast that the CPUs become the bottleneck and we end up having to replace the whole box instead of just upgrading the CPUs. So, I suggest the slowest available dual CPU setup, which can be upgraded if necessary. I don't mind suggesting running a web server on the database server once we have two database servers and that will use the CPU for something. The price difference between single CPU and dual CPU in the SM-2280 system is US$262 and I'm comfortable enough with that - it's cheaper than getting the same CPU in a box of its own. I'm also dodging the possibility that the motherboard might be different or non-dual-rated CPUs or untested pieces might be in a single CPU system with an empty CPU socket. This way we know that all but any upgraded CPUs works once burn-in is complete, so we can be sure we do have the options we think we have.
System with 8 RAM sockets
System with 6 drive bays or more - needed to get capacity beyond one year in the future at current growth rates.
Either 4GB of RAM and borrowing 4GB to find out whether and how much 8GB helps or buying 8GB immediately.
One mirrored pair of 72GB 15K RPM SCSI drives
One mirrored pair of 146GB 10K RPM SCSI drives
One pair of drive bays left free for expansion.

Discussion by Jamesday

Test with all database files on each mirrored pair in turn so we know the impact of the disk RPM and can plan with that knowledge in the future. This disk capacity is sufficient for one year or so at current growth rates.
After testing, split the database chunks and logs over each pair so we get two sets of seeks and maximum performance.

Comments from others

NFS/Zwinger[edit]

Buy the pair of drives.

Other storage[edit]

Buy two IDE drives around 200GB each and put in two web servers so developers can use them for longer term log storage and other utility purposes, offloading this duty from database and other key systems.
For future purchases, buy largest economical IDE size instead of smallest available IDE size, so we have an easily available pool of inexpensive scratch and utility storage.
NFS/Zwinger redundancy: designate one web server as NFS backup system, put one of the two extra drives proposed in it and copy files from NFS to that drive perodically so we have a fallback system ready if Zwinger dies.

O.k., so for now what I'm reading here is that we need a knockout database server as good as geoffrin was (except, functional!) or better, and we need also some general disk space pronto. Since we have enough money to do it, my plan at the present time is to max out a nice opteron from silican mechanics.

Would it be more sensible to have 4x73 15K disks plus 2x146 10K disks, or to have 6x73 15K disks? For either of those configurations, which MegaRAID makes the most sense?

Anyhow, decked out, this looks like about $12,000 -- we can afford it, we'll grow into it, and I actually think we'll grow into it *soon* at the rate things are going.

--Jimbo Wales

If you don't want to find out whether 15K SCSI makes a difference and is worth the money, I suggest only buying two 146GB SCSI drives for now. If we grow no faster than we are now, that's space for about 60 weeks of growth. I hope and expect we'll grow faster than that but doubt it'll be less than six months before we need more. Since prices fall and capacities rise, it's better to defer the spending on more drives until later. That will maximise the use of the available disk bays and the storage the money buys.

If it turns out that the pair of drives can't deliver the speed, can add another pair to increase the number of seeks per second available. BUt we can wait on that until we know, since two solves the capacity problem.

For controller, LSI MegaRAID SCSI 320-1 - 1 Channel U320 RAID 64MB CACHE w/ BBU. The battery lets us turn on write caching. Seems to be agreement that we're seek rate rather than transfer rate limited, so two channels don't seem worth it unless you want them for a bit more redundancy in case one fails. I'm inclined not to spend that money and to rely on database replication instead.

For Suda getting a quick disk space upgrade, the best they could suggest was a SCSI hot swappable drive, which could even be plugged in without turning it off, though doing that is a bit safer. All 6 bays in the system are connected (they checked the last order to confirm). We'll need the space anyway if we're honest about it being a fully capable backup server, so I suggest a 146GB 10K SCSI and then seeing how to integrate it into the existing array - may have to be standalone drive relying on replication for the moment, or may be easy enough to merge it - I don't know the setup well enough to say. Adding IDE anything would require using an internal 5.25" bay and they estimated it taking an absolute minumum of 30 minutes with an hour or more being more realistic. So I asked them to quote a price for that if you want to go with it.

Asked them to tell us the price for putting a 200GB ATA drive in the internal bay so we can use it for scratch storage and as a last ditch safety measure if we have a similar space crunch in the future. We can try that for database logs, which probably don't need much speed. Up to you whether you want to do it - just getting the price so you know the cost. I'm inclined to say yes and do it for both systems.

See [1] this quote for a minimal sort of setup on the disk and CPU side. Seems to be agreement that we don't really need high end CPUs for this system. Not shown there are these special instructions for that system and the quote as a whole: "Please show Ken H as followup to our phone call earlier - I answered my own question about controller RAM - wanted to know if it could take more - no. Itemise price to add 200GB 7200RPM ATA in the internal 5.25" bay (leaving all 6 SCSI hot swap bays free). Also itemise price of kit to add the same internal ATA to our existing SM-2280S (ordered by jwales@boomis.com around Jan 13, 2004). Disk and CPU configuration of our eventual order may change; read end of http://meta.wikipedia.org/wiki/Upgrade_discussion_April_2004 to see the discussion; feel free to edit to reply if you wish but note it's a public site. Need to confirm that the 5.25 internal bay doesn't use any of the SCSI bays also. Please also quote a price for a Seagate Cheetak 10K 146GB U320 10kRM SCA SCSI to add to our existing SM-2280S in one of the SCSI bays." If the quote shows $7352 don't believe it - they will need to get back to me by email with the real pricing for the options.

Please include a request for permission to make the specs, including pictures, of the sytems we buy from them available here under the GFDL (and worth doing that for every order). No problem getting it for our site alone on the phone but confirming in the call today they say they do prefer a written request for full GFDL permission. You can point them to [2] to show them what we have in mind for ourselves. Jamesday 22:04, 14 Apr 2004 (UTC)

Just made the request via a custom quote. I was asked to use standard quote earlier, not custom, but with no response I'm assuming that no human looks at them quickly. Jamesday 00:25, 15 Apr 2004 (UTC)

No news. At this point I'm assuming that you're talking with them or will do so yourself. Jamesday 02:21, 15 Apr 2004 (UTC)

Longer term I see that LiveJournal seems to be using a network file system (not NFS) to spread its drives across multiple systems. Nobody here has experience of that yet but it looks like a good way to get better reliability and avoid really costly database systems in the future. Looks as though at least one developer is interested in trying it, so we might end up going with that first for utility storage, then NFS and finally, if it's all working really well, for database. Then we'd end up having single unit database boxes with the RAM being the main difference between them and the others, with the disks distributed around the system. Lots of fault tolerance ... but can't do it today.

They seem to use AFS, the thing Coda is based on. Did some experiments with coda, but don't have real experience with it. It's much better for the task than nfs of course, but somebody needs more experinece with it. Also note that the Linux Virtual Server project recommends Coda for HA applications. -- Gwicke 21:39, 14 Apr 2004 (UTC)

Sounds interesting. I suppose I should encourage you to experiment with it if there's nothing else you would prefer to experiment with.:) Jamesday

Coda / ASF are not a reliable solution for mysql AFAIK. Lustre ( http://www.lustre.org/ ) might be interesting but I'm not sure about the feasability for database stuff - probably useful for that either. A dream would probably be a Single System Image Cluster ( http://openssi.org/ ) - but time for this might just not be ripe yet and it's surely something that sounds like it needs lots of dedicated people's time to be deployed.

Hot off the press database clustering which sounds ideal in this case: MySQL AB Launches MySQL Cluster – the First Open Source Database Clustering Solution -- 14th April 2004 press-release: http://www.mysql.com/news-and-events/press-release/release_2004_14.html --Eagletm 17:26, 15 Apr 2004 (UTC)

MySQL Cluster is an in-memory-database. Including old, the DB is about 60 GB I'm told. We don't have boxes with that much memory. -- JeLuF 07:33, 17 Apr 2004 (UTC)

Mid May update[edit]

Squid: Experience with a failed Squid has shown that site performance is affected significantly with one failed Squid out of two, producing an Alexa ranking drop of perhaps 50 places. It seems prudent to use one of the new machines for Squid. If the database drops the load enough switching back to two seems possible but three probably remains the most prudent choice.
Web servers: The web servers are showing CPU load near 100% at peak times. More machines processing web requests are clearly needed unless the database upgrade somehow has an effect on their load levels.
Database: A machine for this was purchased. As of mid May it has not yet been delivered, so the results are unknown.
NFS: Drives for this were purchased and have significantly helped with the space issues.

Purchased machines[edit]

Details of the equipment purchased can be found at Hardware order May 2004