Distributed Media Storage/Internship report/Comparison to current system

Model and description of current situation[edit]

Wikimedia's current image/media storage architecture consists of 3 components:

The front-end caching proxy servers (Squids)
The application servers running the MediaWiki software (Apaches)
The image servers, storing all images on their filesystems and delivering them through their own webservers (HTTP, NFS)

For illustration, the paths through the architecture of a typical upload and retrieval request for an image will be described.

Image upload[edit]

Users can upload files through a HTML form that is part of the MediaWiki software. They can select a file on their local computer, enter a destination filename and accompanying wikitext as description and metadata. Upon submit, the file is transferred as part of a HTTP POST request to the Wikimedia cluster.

The POST request reaches one of the caching proxy servers, selected by a transparent load balancer. POST requests are inherently non-cacheable, and consequently will be forwarded on to one of the Mediawiki application servers (Apaches), again selected by a transparent load balancer.

The MediaWiki software accepts the incoming uploaded data, and checks the file's characteristics (allowed/banned file type, virus checks) and stores some metadata (filetype, size, etc.) in a central SQL database that also drives the rest of the wiki. The file itself is then stored on the filesystem on one of the image servers. This filesystem is remote mounted from the image servers using the NFS protocol.

The MediaWiki software will also send a purge message to all the caching proxy servers, to notify them that the uploaded image and corresponding image pages have changed, and should be removed from the cache. After this, the upload request is complete, and the user is redirected to the newly created Image-page, where they can see the results.

Image download[edit]

A typical image request is part of a larger HTML/wiki page, in which the image is embedded - although the actual HTML and image files are retrieved using separate HTTP GET requests, of course.

The GET request for the wiki page will end up on one of the caching proxy servers. If the requesting user is not logged in in the wiki system (meaning the page has been rendered for an anonymous user, for which its content is always identical), and the page is not an inherently dynamic page, the corresponding HTML text may have been cached by the caching proxy server in an earlier request. In that case, the proxy server retrieves it directly out of its memory or disk cache, and sends it back to the requesting user. This happens for about 60-80% of all requests, the hit rate of the Squid proxy servers.

If the request had not been cached by the caching proxy server, it will be forwarded to one of the Mediawiki application machines. The Mediawiki software renders the wiki page using its normal procedures, which are outside the scope of this description. However, whenever it encounters an image link, where an image (or other media) will be embedded, it needs to look up some metadata like the size of the image. First, it will try to gather this metadata from a quick memory cache shared by all application servers, Memcached. If the relevant data has not been stored there by a previous request, it will gather it from the source file, the media file itself, on one of the mounted NFS filesystems of the image servers.

The metadata is then cached in the shared memory cache, and used to render the rest of the HTML page, which is sent back through the Squids to the requesting user. On parsing the HTML text, the user's web browser encounters the media links, and issues extra HTTP GET requests to retrieve these media files.

These GET requests again end up on the caching proxy servers, which may have cached the files before - media files are static rather than dynamic, they don't change very often. In case of a cache miss, the request is forwarded, though not on to the application servers, but directly to either one of the image servers.

The image servers are running HTTP servers, and can serve part of the image files directly from disk. Images not local to the particular image server, are retrieved over NFS from the other image server.

Model and description of future situation[edit]

A possible new architecture using the new distributed media storage system is one where the image servers are replaced by the new system, while the rest stays mostly the same. However, instead of using an NFS interface between the application servers and the media servers, there will now be an HTTP interface. Furthermore, the application servers can talk to only a subset of the nodes (configured as such) participating in the node web of the storage system.

The HTTP interface between the application servers and the storage system can be abstracted using a customly written library, which hides the underlying HTTP protocol and can also take care of load balancing and possibly some metadata abstraction. Because the protocol used is standards compliant HTTP, it's also possible to use suitable products that are generally available and support HTTP.

Image upload[edit]

The POST request for the image upload will pass the same path through the architecture as in the current situation up to the point where the MediaWiki software on the application servers want to store it. Instead of writing it to the remote mounted NFS filesystem, it will hand over the file to a client library. This library then contacts one of the media storage front-end servers, and issues a PUT request, handing over the data. Depending on the HTTP return code delivered by the storage system, the transaction is either regarded as completed successfully, after which it's the storage system's responsibility to maintain reliable storage of the object, or regarded as failed, in which case an error is returned to the user.

Image download[edit]

Image download (HTTP GET) requests go through the caching proxies as usual, which may answer requests from their cache. Cache misses are forwarded on to the media storage front-end servers, which directly accept the GET request and deliver the corresponding objects. In case of errors / system unavailability or missing objects, a regular HTTP error message is returned and delivered to the user.

Because the distributed media storage only takes care of key/data storage, it's MediaWiki's responsibility to find the corresponding object identifiers for the requested objects by examining its own database with metadata. Should more user-friendly URLs be wanted, then this can be accomplished by an additional layer between the caching proxy servers and the storage system (possibly affecting performance), or a relatively thin mapping within the storage system itself.

Comparison[edit]

This section discusses the most important differences between the new and old architectures regarding performance, reliability and manageability.

Performance aspects[edit]

The most important performance aspect of the new system is that it attempts to scale better than the current architecture, because it has been designed to work with multiple servers/nodes from the start. In principle this is possible with the old architecture as well, but its manageability problems prevent that, restricting it to just one or at most a few servers.

The storage load of the complete working set can be scaled to almost arbitrary numbers of participating nodes, depending on available I/O resources and storage space per node. The distribution of objects over nodes is very dynamic, and it makes little difference to the performance of the system whether objects are distributed over few or tens of nodes. Administrative overhead of extra nodes is very small and insignificant when compared to the order of the amount of objects.

The request load of client requests can be distributed over several (or all) nodes participating in the system as well. All nodes answer requests for objects they have a replica of, but some nodes additionally handle requests from clients as well. The amount of front-end nodes for this task can be scaled as necessary and as resources permit. Because there is no additional administrative overhead involved in this, it makes no difference to the system.

The component that scales least in the new system obviously is the master SQL index, which handles all queries and index updates of all nodes. This can be mitigated somewhat by using SQL slave replication for some index read queries; the same strategy as used by MediaWiki within the Wikimedia cluster today. Another possible solution for this problem might be the use of technologies like MySQL's clustered in-memory database NDB for the central SQL index. As explained in the final design notes, the system has been somewhat prepared to be extended with alternative index components, like a fully distributed hash table based model.

Most of all, the new architecture can make better use of already available but unused resources. The Wikimedia cluster currently consists of over a hundred servers, most of which are application servers that make very little use of their local disk storage and I/O resources. By making these resources available for media storage without enormous manageability and performance overhead, Wikimedia effectively gains a lot of resources for free.

Reliability aspects[edit]

Reliability in the current architecture is one of its biggest problems. All data is stored only once, on either one of the two image servers. Although these servers are using redundant RAID setups, they are still very prone to failure; a single component failing can cause the unavailability or complete loss of half or all media objects.

On top of this, due to the lack of scalability and consequently overloaded state of these servers, making backups is very hard. The extra I/O load caused by the traversal of all directories immediately makes them too slow for other production service, under which the entire site is suffering. Making backups is only possible in a very careful and slow process, meaning that backups take multiple days to finish and cannot be run very often.

In case of a disaster (happening because of simple, single component failure) it's likely that many recently uploaded files back to the most recent image dump backup run - which can be weeks old - will be lost. This is not acceptable in the context of exponentially growing projects of Wikimedia.

Redundancy is one of the key aspects of the distributed storage system that distinguishes it from many other solutions. The system automatically monitors the redundancy level of all objects and makes sure to restore it once it drops below a certain threshold - during which the service remains fully functional and available.

The fact that nodes participating in the system can be geographically separated also is of great help for reliability. Disasters happening to one cluster/datacenter can no longer destroy all data because even recently added objects are present on other nodes in other clusters.

Manageability aspects[edit]

For big clusters with very large datasets that are required to work around the clock, manageability of data and services becomes very important.

A long time Wikimedia has been using a single image storage server, exporting its NFS filesystem to all application servers. Once this server proved insufficiently capable to handle the entire load, another server was added as a quick hack. But immediately scalability of management problems of this approach became obvious: all application servers now had to mount filesystems of two fileservers, and all data had to be split in half and moved to the extra server. This was implemented by splitting the toplevel directory in half, and moving half the subdirectories to the new server. Furthermore, because both image servers needed access to all data, they needed to mount eachother's exported filesystems as well, and use symbolic links to make this transparent.

It was quite evident that this is not a feasible approach in the long run, when even more server are added. It's cumbersome and inefficient to mount all remote filesystems of all image servers on all application servers - creating a full mesh with a lot of accompanied state. Also, maintaining an even distribution of objects over the separate file servers by hand is unmanageable, especially because it requires downtime for every move, and the configuration of the MediaWiki software and/or filesystem structure needs to be altered to map to the new media objects locations.

These aspects have been taken into account for the design of the distributed media storage system. Not only is dynamic (re)distribution of nodes more reliable, it also helps greatly with manageability. Individual nodes can be taken down or added as needed without affecting general operation, offering an automatic system of failover. Object distribution can be altered more flexibly to demand, and redundancy of individual object classes can be tuned to needs dynamically.

A coupling between all application servers and multiple image storage nodes still exists, but is of a very different sort than when using NFS mounts: temporary HTTP client connections, load balanced to available DMS front-end servers, with no additional state overhead. And what's more: no reconfiguration of application servers and filesystem structure is necessary, when things need to change.