T1wiki

From Meta, a Wikimedia project coordination wiki

T1wiki is a codename for some ideas of how a much better wiki software could look like and code that presents these ideas. It's not specifically aimed at Wikipedia, and in particular it doesn't have aim to become Phase IV software (but who really knows).

Parsing[edit]

See: Wikipedia lexer

Version system[edit]

Version system should allow 100%-reliable reading not affected by what is currently going on with data.

atomic version replacement algorithm[edit]

One idea is to use structure like that:

  • wikidb/versions/
  • wikidb/cur/
  • tmp/
#!/usr/bin/perl -w

use POSIX;
use Fcntl qw{:flock};
use Digest::MD5 qw{md5_hex};

# Configuration

my $wikidir = "/home/taw/t1/";

# Aux

sub random_filename
{
    sprintf "%07X%07X", (floor (rand(0x10000000))), (floor (rand(0x10000000)))
}

# Main

my ($name, $contents, $comment) = @ARGV;
die "Not enought arguments" unless $#ARGV == 2;

do {
    $tfn = "tmp/".random_filename();
} until sysopen (VERSIONFILE, $tfn, O_WRONLY | O_CREAT | O_EXCL);

print VERSIONFILE $contents;
close VERSIONFILE;

$vfn = $wikidir . "versions/" . md5_hex($contents);

if (rename ($tfn, $vfn)) {
    # nothing
} else {
    die;
}

do {
    $sfn = "tmp/".random_filename();
} until symlink ($vfn, $sfn);

$version = md5_hex($contents);

sysopen (CLFILE, "cl/".$name, O_WRONLY | O_CREAT | O_APPEND);
flock CLFILE, LOCK_EX;
# check if last entry of log file is consisnent, if not fix it
# write attempt to log file
if (rename ($sfn, "cur/".$name))
{
    # finish writing, success
    flock CLFILE, LOCK_UN;
    close CLFILE;
} else {
    # remove last entry from changelog
    flock CLFILE, LOCK_UN;
    close CLFILE;
    die;
}

What could possibly go wrong:

  • if could break before, during, or immediately after creation of tmp file containing data. cron job will clean such things once an hour or so, it doesn't break database.
  • it could break on rename. as rename is atomic it can't break during rename.
  • Because version name is md5 if it breaks immediately after rename we just have unused version file in versions/ directory. Unless first md5 clash in history happens it could collide only with identical version, so it's not a problem. Some more complex cron job could remove such files.
  • creation of symlink in temporary directory is safe
  • file opening is safe
  • if it breaks between open and flock it's still ok
  • now we must check consisnetcy of log. It's explained in detail later
  • we write into changelog that we attemp to install new version LOG IS INCONSISTENT
  • we replace symlink with new symlink LOG IS INCONSISTENT
  • we write to log that we succeded LOG IS INCONSISTENT BEFORE END OF IT
  • we unlock
  • we close

Lock is automatically released if we die.

Log consistency[edit]

Log entries should have constant length. This way we can check consistency by single fstat.

Log entry structure may look like this:

struct log_entry {
    md5 old_version; // do we need it ?
    md5 new_version;
    ...
    byte finished;
};

So if size of last entry is less that sizeof(struct log_entry-1) then it must have broke during writing about attempt. So we just truncate file, and it's fixed. We could even ignore that as we would overwrite it with own entry anyway.

It size is equal to sizeof(struct log_entry-1) then we need readlink call. If link point to old version we can fix it (log is no more and no less consistent than before). Then we write final byte. Log fixed.

Problems[edit]

  1. temporary inconsistency of log file
  2. RecentChanges
  3. broken links data
  4. search


Doing RecentChanges by scanning all files doesn't seem reasonable. Finding which links are broken by checking if their files exists neither.

On the other hand edit conflicts fit this scheme nicely. We just do one more readlink.

temporary inconsistency of log file[edit]

Because we are commiting logs anyway if they went to the point of replacing symlink, that won't be a problem.

RecentChanges[edit]

While watchlist could be implemented by scanning changelogs of all files listed in it, that wouldn't scale for RecentChanges.

We could use single log file for all changes which would work like individual log files. See below.

Broken links data[edit]

We need very fast way of checking which links work and which don't. Checking files is too slow. We could use links file which would be symlink to current version of links database in a form that allows very fast checking. See below.

Problems with this approach to RC and BL[edit]

They create single bottleneck. While I think it would be acceptable with BL, as it is only used in relatively rare article creation and very rare deletion and keeps very simple data (tree of md5(title) would be enough), it would be greater problem with RC.

Creating more RC files, for example 16 (first 4 bits of md5(title)), could allow more parallelism at expense of slower reading of RC.

Speed requirements priorities[edit]

  1. Reading articles (must be very fast)
  2. Modifying articles (must be fast)
  3. Uploading files (must be fast)
  4. RecentChanges (must be fast)
  5. Creating new articles (must be fast, may be a little slower than modifying)
  6. Passive admin actions like popularity, list of broken links, etc. (some may be slow)
  7. Active admin actions - deletion etc. (may be slow)