Move Text to Filesystem

From Meta, a Wikimedia project coordination wiki

Another approach that might improve Wikipedia performance is file based storage of this data. I'm unsure of how well the various filesystems would handle the storage of millions of files, but there are very large Squid proxy installations that seem to hold up quite well. (for example, Rabbit could be /wikipedia/r/a/b/rabbit)

  • For storing the text, or for storing attributes? (Timestamps, author attributions, link relationships, edit comments, etc.) A database has obvious advantages over a traditional filesystem for keeping track of things we want to search, group, and sort on. The text itself is another matter; currently we use the database for the fulltext search, but already using a munged-to-remove-markup copy of the text which is used exclusively for that purpose. Hypothetically we could put the regular text of articles (and likewise cached HTML) into files which only the web server needs to read/write. (That would put us somewhere around a million files at present, expect it to grow, with average file size at a couple of kilobytes.) Would this be more efficient for some purposes? How would this complicate or ease mirroring etc? --Brion VIBBER 20:36 24 Mar 2003 (UTC)
    • Good chance this would be considerably faster for storing the text. Let's face it - databases just weren't meant for storing large amounts of freeform text (that is what filesystems are for). I do know that there are quite a large number of programs out there built for searching HTML files, might be best to build this index at night. Using ReiserFS for this purpose would probably be a good idea (this is what it is built for). In fact, you don't even need the directory seperation with ReiserFS, but would probably be good for portability on other systems (Solaris/UFS for example). You could probably stop using the database for determining existence of a page (you can just do a stat(2) on the files, or file_exists() might be even faster). As far as backup or mirroring, I imagine that it wouldn't be any more or less complicated than MySQL - although distributing data could be difficult (but it is difficult with a database, too). --Marumari 22:22 24 Mar 2003 (UTC)
    • MySQL seems to try to load all the records into memory; it's likely that the underlying operating system will do a much more efficient job at retrieving freeform text. Dwheeler 20:53 2 May 2003 (UTC)
      • The OS also tries to load all accessed files into memory, it's called a disk cache. :) --Brion VIBBER 00:04 23 May 2003 (UTC)

As of mid-may 2003: we use the filesystem to cache rendered, ready-to-output pages for anonymous users. Yee-haw.