Distributed Media Storage/Internship report/Task description

Context

Multimedia content within the fast growing Wikimedia projects, like Commons and Wikipedia is posing one of the most difficult challenges with scaling. The method currently used for storing these multimedia items is an often used, but not very sophisticated and well-suited one: all multimedia files are stored on a single network file server used by all other servers. This method has several severe disadvantages, which will be discussed briefly.

Lack of scalability

A single NFS server doesn't scale, i.e. it's hard or impossible to grow required capacity with increased load. While other, scalable systems allow increasing capacity by just adding more hardware and software, that's not possible with NFS as the NFS protocol is inherently tied to just one server. Using multiple separate NFS servers is possible, but that increases complexity for all client systems and their management.

This is posing a problem for the extraordinary growth of the Wikimedia projects that heavily depend on multimedia content and static files, like Cascading Style Sheets, as they are effectively limited by the capacity of a single server - a bottleneck.

Vulnerability

Because of the infeasability to increase capacity, this system - and consequently the whole site - is very vulnerable. That means for example that it's very easy to grind the entire site to a halt by simply overloading this single server. As all projects are in one way or another dependent on static content on this server, this effectively is the achilles heel of the entire system that endangers the availability of the Wikimedia projects.

Another consequence is the difficulty of making backups of the content on the server. Starting a backup job immediately overloads the server, rendering the entire site slow or unreachable. Therefore backups are made rarely. A hardware defect concerning this server would thus mean data loss and extended unavailability of the entire site. Making all content available to third parties as required by the GFDL license, is difficult as well.

Difficulty of scaling to multiple clusters

In the current situation it's impossible to extend the system to multiple, geographically separated clusters, like the Kennisnet cluster in Amsterdam, and the Yahoo cluster in Korea to be set up soon. Replicating the multimedia content to other systems in other clusters is unfeasible, rendering these alternative clusters less effective and increasing dependability on the Florida cluster.

Alternative solutions

It appears that the few alternative solutions in existence are either only available under a commercial license, or not stable/mature enough for usage by Wikimedia, or have other disadvantages, making them unsuitable for Wikimedia.

The assignment

Wikimedia needs a long term solution for storing multimedia content in a distributed, scalable and redundant way. The assignment comprises of designing and implementing a system that takes care of storing and retrieving arbitrary binary objects (e.g. multimedia files), with emphasis on scalability and redundance.

Software requirements

The systems needs to be able to store and retrieve arbitrary binary objects (8 bit clean) (R1), identified only by a single binary number(R2). Storage should be divided over N systems, where N is a configurable number, $2\leq N<100$ (R3). A front-end needs to be provided for communication with other systems (mostly webservers) through a specified API, preferably using a standard protocol like HTTP(R4). The front-end has to support the following operations:

PUT: Storing an object in the system (R4.1)
GET: Retrieving a previously stored object (R4.2)
DEL: Removing a previously stored object from the system (R4.3)
TRAVERSE: Traversing (part of) the content of the system, facilitating backups (R4.4)

The system needs to have the following properties:

scalability - the system needs to scale to tens of systems (R5)
redundancy - objects need to be stored redundantly, i.e. loss of a single node should not cause any data loss (R6)
atomic operations - an operation will either be completed fully, or not executed at all (R7)
failover - the loss of a single node should not cause unavailability of the system (R8)
monitoring - operation of the system can be monitored through logfiles, and possibly using other monitoring interfaces (R9)
extendability - the system needs to have a clean and extendable design, and be prepared for at least these extensions: (R10)
- alternative load balancing algorithms - the original system will likely have only very minimal and simple load balancing. Integrating alternative load balancing algorithms must be easily possible. (R10.1)

The system will be developed under a free software license, likely the GNU General Public License, guaranteeing availability and extendability of the system by other developers, after the internship is over. (R11)