Grants:PEG/WM DE/Improve toolserver reliability

From Meta, a Wikimedia project coordination wiki
Funded
This submission to the Wikimedia Foundation Grants Program was funded in the fiscal year 2009-10. This is a grant to an organization.

IMPORTANT: Please do not make changes to this page now. They will be reverted.

  • This project has been funded, completed, and a project report has been reviewed and accepted by WMF Staff.
  • To review a list of other funded submissions by fiscal year, please visit the Requests subpage, and to review the WMF Grants Program criteria for funding please visit the Grants:Index.


Legal name of chapter
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Grant contact name
Daniel Kinzler
Grant contact user-name or e-mail
User:Duesentrieb
Grant contact title (position)
Software developer / Only Technology Officer
Project lead name
Daniel Kinzler
Project lead user-name or e-mail
User:Duesentrieb
Project lead title (position), if any
Software developer / Only Technology Officer
Full project name

Improve toolserver reliability

Amount requested (in USD)

40000 USD (ca. 30000 UER)

Provisional target start date

July 2009

Provisional completion date

Deployment complete by end of August 2009

Budget breakdown[edit]

The granted money would be invested into buying hardware. Specifically

  • Three Sun Fire X4250 with 32GB RAM and 16x 146GB Disks
    • ca 10000 USD (8000 EUR) each
    • 30000 USD (24000 EUR) total
  • Setting up an additional server rack and equipping it with a terminal server and a switch:
    • ca 7800 USD (6000 EUR)

This comes to a total of 37800 USD, the remaining 2200 USD would be used on shipping and insurance.

Project scope[edit]

The scope opf this project is deploying a redundant set of database copies of the three main database clusters, s1, s2, and s3. At the moment, the Toolserver maintains only one copy for each of those. This project aims to provide two copyies for each, in order to improve reliability, availability and performance.

Project goal[edit]

The goal of this project is to improve the toolserver's reliability, availability and performance. The toolserver's core function, and also it's weak spot, is live replication of the three main database systems, s1, s2, and s3. Whenever replication fails for one of these for a bit too long, for instance because of a broken disk, we need to create and import a fresh dump to be able to restart replication. This takes a few days if all goes well, and weeks if not. During that time, most tools are not usable with the wikis served by the database system in question.

Because of this, we had full availability of the toolserver cluster for less that 90% of the last year, which comes to more than a month downtime for at least one set of databases. By keeping a redundant copy, we hope to improve availability to 99%, that is, reduce downtime to less than four days per year.

Without this grant, our budget will merely cover the cost of sustaining operation on the current level. We are facing several challanges:

  • Increasing database size (the wikis are growing)
  • Increasing user base (more people get and use toolserver accounts)
  • Increasing number of requests (more people use more tools on the toolserver, more often)

We will be invsting into hardware to deal with these issues. But without a grant, we will not be able to provide redundant database copies in the foreseeable future.

Non-financial requirements[edit]

We would need Mark to set up the hardware at the amsterdam cluster.

Fit to strategy[edit]

By improving the stability of the toolserver project, the Wikimedia foundation would help to:

  • Keep the tools hosted on the toolserver working. Many of these have become invaluable in the every day tasks of wiki maintenance and administration.
  • Provide reliable access to wiki data to researchers and enthusiasts who use it directly on the toolserver.
  • Keep the data of all the Wikimedia projects safe (see blow).

Other benefits[edit]

Keeping redundant off-site copies of wiki content safeguards against a disaster in the tampa data center. If the servers there are lost, we would have to rely on the copy in the toolserver cluster. Should one of the copied be offline, all edits made after it went offline would be lost. Thus, it would improve data safety to keep two redundant off site copyies of each database cluster.

Another benefit is that having two servers for each database set, load can be balanced between them, or one could be reserved for the many short queries that result from web-based tools while the other would be used for long running research/analysis queries.

Measures of success[edit]

The prime measure of success is the availability of all database copies to toolserver users. If we can significantly reduce the time in which the data for some wikis is not available in real time to toolserver users, the project was successful.

A secondary measure would be performance, that is, reduced load and better query times on the database servers. When evaluating this, any increase in use has to be taken into account, of course.