Talk:February 2005 server crash

From Meta, a Wikimedia project coordination wiki

I still get occasional crashes and corrupts:

Gerrit 21:18, 23 Feb 2005 (UTC)

Error's 404[edit]

Before the crash, was redirected to It now is Error 404. The same holds for the statistics pages at

Error's 404[edit]

Before the crash, was redirected to It now is Error 404. The same holds for the statistics pages at Gerrit 15:55, 24 Feb 2005 (UTC)

Pretend wikimedia had an unlimited budget[edit]

How can wikimedia's services be scaled "infinitely" and designed to survive all but the most catastrophic of failure? Through massive global replication.

Expensive part: maintain a consistent version of the dataset across the replicated facilities.

Solution: eliminate single master database. Build decentralized modules that are accessed over a "virtual" private network, load balanced by DNS.

Without knowing the particulars of MediaWiki, you could divide it into a user module, an article module, a search module, a history module, a discussion module, a session module, etc. For each module, come up with a domain name and develop a protocol for accessing them (xmlrpc, nc, whatever). user.wikimedia.vpn, article.wikimedia.vpn, etc. Lets also pretend Wikimedia has 5 colo spaces around the globe.

Start with "user.wikimedia.vpn". This does not describe one server, but is rather a round-robin alias that points at any one of a.user.wikimedia.vpn, b.user.wikimedia.vpn, c.user.wikimedia.vpn, d.user.wikimedia.vpn, e.user.wikimedia.vpn (for the respective 5 colos). Any one of these servers can handle a user account query/update, and they make decisions based on some kind of internal protocol as to whether or not the update needs to happen synchronously. A synchronous operation may be as simple as having a server take a request, contact all of the other servers, and if they all agree with the request, it's committed. The number of truly synchronous requests for user management could be rare, such as "create user account" and "lock user account". The rest are details that can probably fall behind a few seconds or (in the worst case), get lost.

Each body of data has idiosyncracies that can be leveraged to make them effective in ways that you just can't do with a single gigantic master database.

If one goes down, your resolvers know this in a minute and have broadcasted updates to all of the colos to remove the server from the round-robin. The user database is also the kind of thing that's small enough that if there's a catastrophe, you can just blast new versions to all of the servers quickly so the integrity controls don't need to be that tight.

Obstacles: Massive rewrite of MediaWiki code. Lots of money needed. Lots of testing needed. Reinventing lots of wheels. Lots of debugging nightmares.

Idealistic goal: bliss.

Dead downtime link[edit]

I'm getting a 404 at this address linked to in the text: