Wiki database design

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Contents

[edit] What data do we have

  • article versions (lot of data)
  • uploaded files (lot of data)
  • user accounts (little data)

The interesting thing is that neither article versions nor uploaded files are ever modified - the new versions are uploaded, while the old stay in the archive.

[edit] Read operations

These are record-level:

  • get newest version of an article
  • get contents of newest version of some uploaded file
  • get specific version of an article
  • get history of an article
  • get diff between two versions of the article
  • get user preferences

These are not:

  • status of links
  • recentchanges
  • watchlist
  • search
  • list of articles that are longest/shortest/newest/orphaned/etc.
  • general statistics

While record-level read-only operations are going to be quite fast in any sane implementation, the others will need some special structures for reasonable performance.

[edit] Write operations

  • uploading new version of an article
  • uploading file
  • modifying user preferences

[edit] Acceleration of RecentChanges

[edit] sane

Special SQL table, just for that

[edit] insane

We could store RecentChanges as a in-memory table (recreated on database restart), using zero-locking queue.

[edit] Acceleration of Watchlist

[edit] sane

An index on creation time for versions.

[edit] insane 1

One insane-RC-style queue for each user.

[edit] insane 2

Store last requested version of each watchlist in cache. Use some smart updating when the watchlist is requested again in short time. Will improve performance mostly for people who use whatchlist frequently.

[edit] Acceleration of Search

[edit] sane 1

Use MySQL FullText index.

[edit] sane 2

Use external spider. No locking, but the data is going to be less up-to-date and we will have less control over the results.

[edit] sane 3

Use Google. Even less work for us, and even less up-to-dateness and control.

[edit] insane

Hack MySQL replication (with binary log of changes) to allow searches being handled by separate MySQL process. Almost real-time ( delay measured in seconds) and no locking involved.

[edit] Acceleration of link tables

[edit] sane

2 SQL tables:

  • links
  • brokenlinks
Personal tools

Variants
Actions
Navigation
Community
Beyond the Web
Print/export
Toolbox