The hardware order in May 2004 was for four 1U multipurpose machines and a 2U database machine to replace Geoffrin. It was based on the upgrade discussion in April 2004. Delivery started on 21 May 2004 when three 1U machines arrived and a fourth 1U followed a few days later. Memory tests on the four 1U machines started on 26 May 2004, shortly after adding them had caused an overload of the 30 amp power supply to the rack, which shut down most machines briefly. The rack power supply was upgraded to two 30 amp circuits. Total vendor cost for the five machines was US$20,047.
General purpose machines
Zwinger file storage replacement
One, named Will, has a hardware RAID controller and pair of 200GB hard drives added so it can replace Zwinger for file storage, adding an extra measure of data protection.
The RAID controller is a 3ware 8006-2LP half-length, half-height SATA RAID controller with two ports for two drives total, supporting RAID levels 0, 1, JBOD. Manuals are here, see the 8000 series. Drivers are here. The controller has a 64 bit PCI bus but can also run in a 32 bit slot.
Notes from a phone call with 3ware resemble: look for lib/modules/kernel/drivers/scsi for the 3ware 3x.xxxx.0 (3w.80006.0 or perhaps 3w.8000.0 if no 8006-specific version) driver and use insmod. Should be similar for all linux flavors. Get the latest driver from the web site
- Main Board: AccelerTech HDAMA (ATO2161) [Manual]
- Dual AMD Opteron 248 - 2.2Ghz - 1MB L2 Cache
- Memory: 8GB (8x1GB) PC2700 Registered ECC
- HDD (6): Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI (ST373543LC, the fastest, largest 15K RPM Seagate sold at this time)
- SCSI Controller: LSI MegaRaid SCSI 320-2 - 2 Channel U320 RAID 64MB Cache w/BBU (battery backup, so write buffering can be turned on safely - MySQL strongly recommend write buffering)
- 460W hotswap redundant power supplies (to be placed on two different circuits!)
Vendor order summary
The following shipped on: May 14, 2004 3 of SM-1151SATA CPU: Intel P4 2.8GHz - 1MB Cache - HT - 800 FSB RAM: 4GB (4 x 1GB) Unbuff ECC DDR 400 Interleaved HDD 1: Seagate 80GB 7200.7 (7.2Krpm-8MB Cache) SATA Floppy: 1.44MB Floppy O/S: Fedora Linux Core 1 - Preload, No Media Rail Kit: 2 Piece Ball Bearing Rail Kit WARRANTY: Standard 3 Year - Return to Depot $2199.00 $6597.00 The following shipped on: May 18, 2004 1 of SM-1151SATA CPU: Intel P4 2.8GHz - 1MB Cache - HT - 800 FSB RAM: 4GB (4 x 1GB) Unbuff ECC DDR 400 Interleaved HDD 1: Seagate 80GB 7200.7 (7.2Krpm-8MB Cache) SATA Floppy: 1.44MB Floppy O/S: Fedora Linux Core 1 - Preload, No Media Rail Kit: 2 Piece Ball Bearing Rail Kit WARRANTY: Standard 3 Year - Return to Depot $2199.00 $2199.00 The following shipped on: May 25, 2004 1 of SM-2280S CPU: Dual AMD Opteron 248 - 2.2GHz - 1MB L2 Cache Memory: 8GB (8 x 1GB) PC2700 Registered ECC HDD 1: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI HDD 2: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI HDD 3: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI HDD 4: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI HDD 5: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI HDD 6: Seagate Cheetah 15K.3 73GB U320 15KRPM SCA SCSI SCSI Controller: LSI MegaRAID SCSI 320-2 - 2 Channel U320 RAID 64MB CACHE w/BBU Low Profile CD-ROM: Slimline 24X CD-ROM Floppy: 1.44MB Floppy Power Supply: 460W Hotswap Redundant O/S: Fedora Linux Core 1 - Preload, No Media WARRANTY: Standard 3 Year - Return to Depot NOTES: 219GB HW RAID10 (Split over 2 Channels) $11251.00 $11251.00
Suggested RAID controller settings
|FlexRAID PowerFail||on||resuming reconstruction after power failure is good.|
|Disk Spin up timings||allow 10 sec per disk, so we don't stress rack or power supply load limits, particularly if many systems are powering up at once. Max power draw is during power on spinup. (note also that if many systems are down, turning all off by hand then restoring power and turning them on with ten second delays will also help stay within power limits)|
|Chip Set Type||whatever it is|
|Cache flush timings||10 seconds||With battery backup, longer times increase chance or repeat writes to same sector being eliminated, cutting disk load. Also helps to maximise chance of elevator seeking rearranging activities for best performance.|
|Rebuild rate||something slow||no redundancy during this but we don't want to kill the site. Can use higher setting by hand when site is in low load situation|
|Alarm control||off||inappropriate to use audible alarm in a colo|
|Auto rebuild||on||has it rebuild as soon as we replace failed drive. Easier for colo staff to do this and less to go wrong.|
|Force boot||on||want the system up again after power failure so we don't have to phone colo.|
Configuration utility advanced submenu
|Stripe size||16k per drive for 1 drive stripe, 8k for 2, 8k for 3||InnoDB writes pages of default size 16k and allocates chunks of 64 consecutive pages to reduce fragmentation. Default MySQL random read size is 256k, sequential 128k. These stripe sizes are chosen to read/write as much as MySQL is trying to read/write at once - basically, trying to pick sizes which are evenly divisible by the sizes in use. Would be nice to change InnoDB to use 24k pages so 1, 2 or 3 disk stripes divide evenly into that size.|
|Write policy||write-back||Default is write-through. We have battery-backed up write cache so write-back is safe. MySQL recommended battery-backed up write cache as an important performance option so the controller could return immediately instead of making the system wait for the real write.|
|Read-ahead||Adaptive||Lets the controller try to work out if it should use read-ahead. Default is normal, don't try to work it out.|
|Cache-policy||Direct I/O||MySQL does lots of read buffering; assumed that this option would just duplicate that and it's better to let the RAM be used for writes. Should be tested both ways.|
At the time of the April discussions the databases were split into one ~4gb file, one ~14gb file, one ~39gb file for a total of 57gb. Because many wikis have had their old tables uncompressed in preparaton for UTF8 conversion the large file is now about 65GB. That's close to the 73GB size so even though the space is going to be free for expansion after conversion and recompression there's some concern about that being close to the limit.
Suggest two physical arrays and logical drives.
- Array 1/ logical drive 1, 4 drives 146GB in RAID 1+0. 8k stripe size.
- Array 2/ logical drive 2, 2 drives 73GB in RAID 1. 16k stripe size.
Make sure to split the drives in the arrays evenly between the two SCSI channels - want one half of each on each channel to keep channel contention as low as possible.
Array 1 can hold OS and big database, array 2 the smaller ones. Experiment where to put logs and temporary files - put them on whichever is least loaded - probably the smallest one. The 146GB array leaves no pressure to partition the tables into chunks until those doing that are free and persuaded that it's a good thing, while the mirrored pair does some load splitting to increase the number of available seeks.
Could set array 2 to 8k stripe size and use logical drive sizes which don't match the physical drive sizes via spanning. But the physical sizes seem like a fair choice for the database chunk sizes we have.
There was discussion about whether the controller would be clever enough that forcing it to do the right thing with array splits was necessary. It turned out not to be required - the controller does do the sensible thing and spread the work over all drives in the array.
MySQL my.cnf setting: innodb_thread_concurrency default is 4 or 8 depending on version. Recommended setting is commonly CPUs plus disks plus 1 = 9 now. Use many to try to keep disk and CPU queues full. Use SHOW INNODB STATUS to look for many threads waiting for semaphores. If many are, reduce setting. If none are, raise this until it happens sometimes - how often is judgement question.
Changing RAID setup
This is a followup for Jimmy Wales on some discussion.
If the RAID controller doesn't support a neater way to do it, here is a step by step guide to how to redistribute space from two mirrored pairs of 73GBeach to a RAID 1+0 array with 146GB.
- Check that the controller manual does not say that a change in arrays will erase all data. It is highly unlikely that it will in any high end controller but it's not completely impossible.
- Start out with two mirrored pairs consisting of drives AB and CD.
- cancel the mirroring. You now have independent drives A, B, C and D.
- Have the controller combine B and D into a 146GB striped array.
- Copy all data from A and C to the BD striped array. You may need to boot into Linux to do this. Triple check everything to be certain that you have all of the data - the next step will destroy anything you miss.
- Tell the controller to make A a mirror slave of B and C a mirror slave of D. This will cause it to copy the data from B to A and from D to C.
- At this point you have RAID 1+0 with BA and DC the two mirrored halves of a 146GB array.
It should be fairly simple to test this prior to putting any real data on the disks. That could provide detailed notes on the steps required in the conroller firmware.