Wikimedia budget/Hardware efficiencies

From Meta, a Wikimedia project coordination wiki

Or "how to avoid spending $One Billion Dollars$ on hardware in 2009" (projection).

I. Isolate the core of desired functionality. Scale the most important 80% of functionality, and allow the rest to lag (We already do this to an extent, by not allowing certain Specialpages queries to run.).

  • Example 1: Most visitors read but don't write. Give them blindingly-fast servers, db- and cache-optimized for serving pages. Redirect queries like "History" pages, or even less central requests like "Talk" pages, to less efficient servers.
  • Example 2: Most cycle-intensive hits come from
    1. Search requests
    2. Recent Changes
    3. Long page histories for main pages (probably ~500 pages in all)
    4. Other in-the-know users visiting various Special pages (many currently disabled)
    5. Devs running live db queries (almost never done any more) & updates
So, we can set aside a certain short period of time each day when sloth/downtime may take place, and try to cut any needed downtime for db updates into small chunks; can arrange special caching solutions for the most active page histories and special pages (requiring active effort, and a deprioritized db handle, to get an up-to-the-minute result); and can have a separate optimized search server.
  • Example 3: Certain groups of users (New pages patrol, welcoming committee, volunteer fire dept) will want frequent access to many varieties of RC, new pages, & new user registrations. Design different tools & interfaces to give them what they need without unduly taxing the db or web servers.
  • Example 4: DDOS efforts: Sometimes editing can grind to a halt thanks to a combination of vandals & bots. Set aside a separate db server for bots & anon/new accounts, so that this sloth won't affect everyone.

II. Identify the right tool for the bottleneck, then find out how to acquire it in bulk.

  • Examples: finding a high-NFS-op machine to cope with an NFS bottleneck, getting machines that can effectively use 16G+ of RAM to cope with RAM bottlenecks, and choosing toolchains that all rely on the same elements to maximize economies of scale generated by these hardware decisions.
  • Examples: developing a relationship with bulk server providers, or bulk RAM providers, instead of going through intermediaries. On the plus side, we have wikimedians (and may even be able to offer suitable publicity) in whichever parts of the world produce our target tools.

III. Make efficient use of requests for in-kind donations.

  • Asking NetApp to donate a nice machine is easier than asking them to donate $10,000. Ditto for db donations from Oracle, server donations from IBM, etc.

September 2004 analysis of this.

  • Most visitors have the majority of their work served by the Squid cache servers and we buy more of those as needed to keep up with the demand. High traffic peaks are very efficiently and quickly served in this way. Where the Squids don't have it, memcached is used to help to reduce load and improve performance.
  • Cycle-intensive (really disk-database intensive) hits include
    • Search, being addressed with a revised table design, much more efficient query and a dedicated search server, planned to scale to more as required.
    • Recent changes is an inexpensive query.
    • Long page histories are OK; old page histories are not. This is being addressed by a table design change and by a plan to move the text portion off the primary database servers, onto a server which can inexpensively handle the growth in size of this data collection.
    • Special pages are cached and updated by a batch job which is run several times a week at low load times.
    • Devs are good at working out what is likely to be slow and at doing those queries at off peak times. They also have access to the backup servers and can use those at times.
  • Various tools and enhancements in this have been and are being developed, from recent changes bots in IRC to an indication of which changes have already been examined.