Wikimedia cache strategy evolution during 2003

Note: the content here is pretty much out of date, see Wikimedia servers for more up-to-date information on the status quo. -- Gwicke 18:06, 9 Mar 2004 (UTC)

Wikipedia is big, popular, and hit by bajillions of googlers each day. Our lovely dual Athlon server is under a bit of strain, and the wiki can run rather slow, particularly during peak hours (day in US through evening in Europe).

What are the main causes of lag?

There are a number of things still to be done, such as:

Separate database and web servers
Summary tables for special pages
Upgrade to MySQL 4.0 (but is it stable?)
One-pass parser
Kernel upgrade?
PHP caching and optimization
Compiler optimizations
Move Text to Filesystem (an option, no commitment to do this)

And, of course, a coherent cache strategy. No matter how fast we make the parser, it's still going to take longer to parse wikitext and check for the existence of linked pages than it would be to spit out a pre-built blob of HTML -- or avoid sending any data at all. And it should be fast enough to maintain those pre-built blobs of HTML that it's an overall win.

Client-side caching

Web browsers pretty much all have caches built into them. When a user requests a previously visited page, the browser can show the version it already has instead of fetching it all over again, which saves network bandwidth and avoids loading the server.

If a second request is sent to the server, we can compare the last-modified time sent with the request against the current page and, if it hasn't changed, avoid rendering and sending the data again.

Issues:

How to force validation of pages consistently on many browsers?
- On every view? Or just on reloads / post-edit visits?
  - I can't think of a way around this. If we don't force revalidation on every view, the browser will happily skip the times when we wish it would check. Validation is fairly speedy, so not too big a deal.
Need a second invalidation timestamp field that can be updated when linked pages are created/destroyed, so that red links are updated.
- Presently running on all servers.
Need a per-user invalidation timestamp so that pages are reloaded with preferences after login.
- See above; current test code invalidates on login, change of preferences, and hopefully change of newtalk status -- but in theory this could be refined to skip preference changes that won't alter the look of pages.
- What about logout?
Need a site-wide invalidation timestamp for changes to the software that affect rendering
- Partially implemented; doesn't yet work for client-side, and is set in LocalSettings
Invalidate on new messages flag set/cleared... may be tricky to clear for anons
- Session variables may be suitable here, for those with cookies enabled at least.

Status: currently enabled for all browsers except IE 5.0. Mostly works, but there are some cases not caught right.

Server-side caching

When multiple people request the same page, but the page hasn't changed in that time, we could save a little time by caching the rendered HTML on the server, and sending that saved material out the second time.

The most frequent case is probably random people visiting from google. These are anons, and are probably going to spike on a relatively small number of popular topics. It would be worth doing some analysis of logs to confirm this.

Various issues of cache consistency...
- Invalidating caches of linking pages; see also above
Some user options alter the html output of the page itself. And of course there's variation throughout the skin.
Anonymous users all have the same defaults -- we could even reuse the skin with a little templating for the IP.

Storage:

In the database?
As files? (cf separate database and web servers)
Using 404 handler caching?

Status:

Running on test.wikipedia.org and www.wikipedia.org
Storing full HTML of pages in files, for anons on plain pages only. Have removed the IP address and IP talk link from the skin for anons; we could template it if these are desired.

Here are some comments from my perspective as a non-developer, partly to give some background as to why this is so important.

Over the past year, the English Wikipedia has grown roughly fourfold in every dimension - four times as many articles, four times as many edits per day, four times as many users. Most people have still never heard of Wikipedia, and while we can't be certain, it's not inconceivable that this growth could continue at a similar rate. We could quite possibly have a site 16 times bigger, with 16 times the activity, in just a couple of years. It is that kind of usage level that we need to think about and target.

I would imagine that this kind of growth will be difficult to keep up with on the software side - noone has had to build a wiki like this before. Slow software has been a big constraint on use of the site before, and is likely to be again in the future as growth occurs so quickly.

So, site performance should be of the highest priority, and take precedence over features that are not vital, but merely useful, and have a performance cost. If the cache strategy makes some current features not work so well, we should tolerate that. For example, I think it would be perfectly OK for red links not to be updated to blue as soon as articles are created, if this has a performance cost (provided that an update occurs every few hours/days).

Lastly, thanks to the developers for your hard work! --Enchanter

Stop showing red links for anonymous people

Here is my proposal. Can we stop checking whether the page exists or not to render red links when an anonymous user send a page request? If we stop discriminating links we can simply render links <a href=bahabaha>. It seems to me that it will improve the performance significantly. Or am I missing something important? -- TakuyaMurata 01:42 13 Mar 2003 (UTC)

I'm sorry, I'm not quite sure what you're referring to. Obviously we check if the page exists or not when we load it! If you mean marking broken links; not doing that would negatively impact the usability of the wiki. --Brion VIBBER 02:51 13 Mar 2003 (UTC)

Exactly. I wonder what I was thinking. Anyway, this should make sense now. -- 140.190.66.4 04:09 13 Mar 2003 (UTC)

The fastest thing would of course be to not have to render anything for most anonymous users, since we can simply read a cached full-page file off the disk and spit it out. The ideal fast path is roughly:

(load php)
check database connection (should be persistent, already open)
check cookies - find we are not logged in
ask the db if there are any new messages on this IP's anonymous talk page (it responds that there are none)

user has asked to view a current revision of the page; ask the database for its last modified / cache invalidation timestamps
check cache directory, find that there is a prerendered html file which is still valid
pass the file through to output and exit.

This would require only two simple queries on the database. However, I've not done any stats to see how often we can expect this best case scenario, vs how often we'd need to regenerate pages in their entirety. For maximum efficiency we'd have to remove the IP address from the corner of the screen (if it doesn't change, we can even pre-gzip the pages and send compressed pages to browsers that support it without a performance penalty on subsequent sends), though for minimal effort we could use a placeholder string in the cached copy and replace it on the fly.

Remember, no matter how fast we make the parser, it will always be faster to send a static file. --Brion VIBBER 21:53 19 Mar 2003 (UTC)

Another common case would be logged in users who are not sysadmins and who have the default preferences set. Perhaps we could encourage people to use the default preferences by promising them a more responsive wikipedia experience?

Just musing...
Why is there the need to re-parse the article for each request? - Because the existence of articles linked-to might have changed.
Now, new pages are a relatively rare event (compared to the number of requests made). Thus, it should be cheaper to maintain a flag in each article, which is set if the article has been updated, if a previously broken link has been filled, or if a linked-to article has been deleted.
Thus, whenever a change to the article itself, or a change to the existence of an article is made, a number of flage (in the affected articles) is set.
When someone requests the article, the flag is checked. If the flag is set, parse the article, generate 'generic HTML' (if possible) that is usable by all skins (or alternatively generate pages for all skins). -- (Or just the default page for anonymous users, or...)
If the flag is not set, a page will exist already.
Please check one of the following: The idea

is stupid
not practical
could work

-- 142.176.177.106 13:36 21 Mar 2003 (UTC)

Yes, that's pretty much covered by the above "Need a second invalidation timestamp field that can be updated when linked pages are created/destroyed, so that red links are updated." --Brion VIBBER 17:43 21 Mar 2003 (UTC)

No, that is a client side flag, I am talking about a global database flag, thus requests by any user would be directed to the db, check the flag, and then generate the file if the flag is set, otherwise send the existing file. This does not address the question of network bandwidth, only server processing time. -- 142.176.177.106 19:52 21 Mar 2003 (UTC)

It seems we are in heated agreement. This flag serves both uses, as comparison with the timestamp of a cached version (either on the client or on the server) would show if the cached version is out of date and a new copy needs to be generated/sent.

To clarify: this would add a timestamp field to the article database (say, cur_touched) which would be updated whenever the article was changed, and whenever an article linked from it is created or deleted. On receiving a request from a web client which contains an If-Modified-Since header, we check if it is identical to cur_touched. If so, there has been no change and we inform the client of this: no need to generate or send the page. Otherwise (new request, client that doesn't support caching, or page has changed since client's cache) we move on: now we check our cache table/filebase/whatever and see if there is a cached copy of the page. If there is, we check it's timestamp against cur_touched: if it's identical, then the cached copy is valid and we sent it to the client with minimal need for additional processing. Otherwise (no cache, or invalid cache), we generate HTML from the wikitext, save it in our cache, and send it to the client.

(Identical comparisons probably aren't necessary; with the last touched timestamp properly updated for things like moves and redirects and, it should be fine to use the timestamps of generation to store on the cache and send as Last-Modified to the client. A check that they're equal to or later than cur_touched should be sufficient to ensure consistency.) --Brion VIBBER 21:47 21 Mar 2003 (UTC) (Thanks -- 142.176.177.106 22:10 21 Mar 2003 (UTC))

Quick question. Does Wikipedia seriously do a SELECT for every single link in an article? So, say we are looking at Rabbit - are you saying it does 28 SELECTs to verify the existence of those links? --Marumari 22:07 21 Mar 2003 (UTC)

At present, it does one SELECT which gets the IDs of all existing linked pages (link relationships are stored in a 'links' table). Any links that don't go to existing pages are checked one by one (but if you link a nonexistant page a zillion times, it's only checked once per render). That does sound silly on second reading. ;) They're double-checked as the link database might not be reliable, such as on rerendering a page after it's saved: this process creates the contents of the link tables, and as such obviously can't rely on their contents to say what the new page contents do and don't link to. It could just do a second SELECT on the 'brokenlinks' table, but at this point filled links are much more frequent than broken links and I never bothered to put in the second preload. (Easy to do though if anyone cares to.) On previews and first saves, there can be more checking of course as the link tables won't cover all possibilities. See LinkCache.php in the source for our dreadful shame. ;) --Brion VIBBER

That sounds terrible. Aren't there some 100000+ rows in that table? And it does this SELECT on every page load, eh? And that's one of those things that is going to get slower and slower as the Wikipedia grows (if I understand you correctly). I've been doing a lot of thinking about this, and I almost feel like a SQL database isn't the best way to store the Wikipedia. A true link cache is going to be hard to implement in such a way that it takes less overhead than no link cache at all, but maybe I'll be pleasantly surprised. --Marumari 16:43 22 Mar 2003 (UTC)

It's indexed (and a mind-numbingly simple tuple between names and ID numbers), so the size doesn't make too much difference. An SQL database is very VERY good for making arbitrary queries. --Brion VIBBER 18:37 22 Mar 2003 (UTC)

I understand that it probably doesn't place much strain on the database, but what I am curious about is strain on the memory subsystem of the server, and some possible future problems with Separate_database_and_web_servers. Let us say that the average length of a name/ID tuple is 20 bytes (which is probably conservative). If there are indeed 111085 articles in the English Wikipedia, then every time this query is run, 111085 * 20 / 1024 / 1024 = 2.12MB of data is transferred between PHP and MySQL. This might not be a problem with Linux's domain sockets, but it could definitely be an issue with a 100Mbit connection between the two servers. Or maybe I'm just misunderstanding the SQL query... --Marumari 19:33 22 Mar 2003 (UTC)

That would only be true if every article linked to every other article. Thankfully, this is not the case so far. ;) Here's the query:

SELECT HIGH_PRIORITY cur_id,cur_namespace,cur_title

FROM cur,links

WHERE cur_id=l_to AND l_from='{$dbkeyfrom}'

($dbkeyfrom being the normalized title of the page being rendered. It probably ought to just foreign key cur_id like the l_to field does, I'm not quite sure why it uses the text string -- which is still indexed, I hasten to add.) We may be stupid, but we're not so stupid that we don't use where clauses. :) --Brion VIBBER 19:44 22 Mar 2003 (UTC)

Oh, that's much better, then. I was worried for a bit. :) See my comments in the MySQL page for a couple ideas. --Marumari 16:24 24 Mar 2003 (UTC)

Is there really any reason for all language wikis to be hosted on the same machine? What approximate percentage of CPU time is the English wiki responsible for? 68.80.194.255 17:54 25 Mar 2003 (UTC)

I'm betting most of it. :) The regular Wikipedia doesn't allow inline interlanguage links, so it doesn't even validate the existence of pages in other language Wikipedias. However, at this point, I don't think moving the non-English Wikipedias to other machines would help much in comparison to Seperating database and web servers. Feel free to correct me, Brion. --Marumari 18:41 25 Mar 2003 (UTC)

Yes, I thought everyone already agreed on that. But that many not end all our troubles for all times :-)

What's more, separating the other-language wikis would have the advantage that each wiki could be run on a machine located geographically close to most of its users. 68.80.194.255 19:10 25 Mar 2003 (UTC)

Well, one thing that a lot of people have been asking for for some time is a single-login, so you only need to create one account and use one login session across all languages and meta. (See thoughts on language integration.) This would be most easily implemented if the entire wiki, all languages and the meta section, are in one big happy database. It could also be done with some sort of remote authentication, where local servers talk to a central user database to handle logins.

Similarly issued are the interlanguage links; we'd like to be able to improve it such that we can easily keep track of what exists, and mark links as existing or potential, and provide links to alternate languages even for pages that don't yet exist in the presently-viewed language (wouldn't that be nice!) Again, easiest if everything is in one big database and can talk to itself (what do the voices say? they tell me to kill kill KILL -9 rogue processes!!). But, a way of exchanging information on pages available for interlinking could enable this to be done remotely. (Cf Wiki:SisterSites) Magnus has some experimental code for same-server-different-databases interlanguage link management, but it needs some work still.

Another possibility is to have read-only local mirror sites which slurp updates from the big central server and transparently direct all submissions and user changes to the center. This wouldn't apply just for the smaller languages; a couple of spare servers to balance the load on the English side could be a great help too. Of course, this is all theory. ;)

Anyway, it's best for everyone to make the system as efficient as it can be, we've still got a ways to go. --Brion VIBBER 23:17 25 Mar 2003 (UTC)

Since PHP doesn't support LDAP connection pooling, it probably isn't the best place to store actual data, but it is certainly designed for storing information about users and their preferences. It would be a great way to integrate users for all of the various Wikis, and support world-wide authentication. Given anymore thought to kernel upgrades, and/or filesystem storage at all, Brion? It would be nice to have some kind of IRC discussion about this, as this is terribly non-realtime.

--Marumari 02:00 26 Mar 2003 (UTC)

Just some quick notes on things to fix:

[ ] image pages need to be invalidated when the image is changed
[ ] there may be oddities with talk page links
[ ] client-side cache should have a maximum time-out, probably
[ ] store an invalidation time in the session variables; this should
    catch logouts for client-side caching
[ ] there seem to be some oddities with redirects
[ ] make sure that undeletion is properly reflected
[ ] add an easy way to flush the server-side cache for a page
    without editing it
[ ] make HTML out sufficiently 'clean' that we can save it for users
    in place of / in addition to the anon full-page cache

Though obviously there are hardware costs, it seems to me that to achieve scalability, there must ultimately be several web servers, connecting to just one database server (or perhaps more than one, but in any case, fewer database servers than there are web servers). This can scale arbitrarily through round-robbin DNS addressing so that users are assigned to a random server. To be sure, caching HTML would help, but it provides a one-time improvement rather than the foundation for future scalability.

In time it may become fruitiful to direct queries from casual users to a separate database that contains a lazily-updated cache from the master. Not sure about MySQL, but most of the commercial databases have built-in support for this sort of thing. That would allow considerable scaling by exploiting the fact that casual users are chiefly interested in viewing existing articles and would be unaffected by a database that is even as much as an hour out of date.

Kat 16:26, 11 Aug 2003 (UTC)

Certainly. Several points:

I'm currently starting on support for memcached, a distributed in-memory object store which can be used to cache nonvolatile or more or less volatile data outside the database. It was developed for http://www.livejournal.com/ and is now being used on other fairly high-traffic sites such as http://slashdot.org/
To use multiple front-end web servers we'd have to make sure the upload directories are shared. A simple NFS mount should be enough.
PHP's session support might be broken by a round-robin system. I think you can write a custom handler, though, so it ought to be possible to put the session data into memcached, which can be shared among the web servers. Or, NFS-sharing the session tmp file directory.
MySQL does have replication support: http://www.mysql.com/doc/en/Replication.html and it would be fairly easy to use an alternate database connection for certain operations. (Right now there's temporary code on the English Wikipedia to use an alternate search index database, which isn't being updated and doesn't join with the cur database, avoiding locking trouble.)

--Brion VIBBER 22:06, 11 Aug 2003 (UTC)

I've summarized a possible caching strategy using a distributed Squid hierarchy and Super Sparrow at http://www.aulinx.de/oss/code/wikipedia/.

Gwicke 21:45, 4 Jan 2004 (UTC)

Wouldn't it be a good idea if the wiki-software created static files in the form of wikipedia.org/wiki/r/a/i/d if i wanted to access "raid" ? or am i missing something?

do you have experience with squid-servers in freebsd-jails?