User:TomLord/distributed wikis

[Note: Oddly enough, although I'm the author of GNU Arch which is mentioned on the Versioning and distributed data page, I think the idea of using Arch for distribution is a red herring in this case (mostly).]

A Clear Data Model is Key

The vague idea which sparks this discussion is for a kind of p2p take on wikis. I should be able to selectively import articles from yours to mine, edit them locally and upload. If I have a specialized wiki of my own, you should be able to merge it into your more general wiki -- that kind of thing.

Taking the idea from vague to specific means getting real clear about exactly what the "atomic units of sharing" are -- what kinds of subsets can be shipped from one wiki to another; what does it mean to extract an article, perhaps with links to non-extracted articles, into a different wiki? How do higher-level global structures, such as categories and namespaces, make sense across wiki boundaries? How do namespaces work across wiki boundaries? Templates?

One of the advertised use cases is:

To coordinate a small local wiki about water purity with the Wikipedia category on the same subject?

That suggests good examples, I think, of the semantic issues. For example, perhaps categories in the local wiki, which is dedicated entirely to water, should be subcategories or sub-sub-categories in the more general wikipedia. It's also not clear whether it would make more sense to copy (sub*)category pages to wikipedia or to somehow merge them with categories already there. What if the Wikipedia and the WaterWiki both have images of the same name, but different contents?

Problems defining a clear data model

MW seems to have grown largely by accretion of features in response to demand --- which is neat --- but mostly based on the assumption of a central server and a particular set of global indexes in an RDMS -- which makes achieving really nice distribution at this stage a little tricky.

This history is reflected in the lack of concisely expressed data model. Editors, on the one hand, seem to learn the data model as a series of tricks ("here is how you add an image; here is how you get added to a category link"). Readers, on another hand, learn searching tricks. Developers, on the third hand, are guided by low-level database schema and a sense of what kind of changes to the codebase are currently easy enough to pull off.

Missing from this picture is a simple, abstract data model which makes sense to all three groups (when expressed in their respective languages). An analogy: if I go buy a physical encyclopedia odds are I can find (hopefully in the introduction to index volumes or similar) a clearly stated explanation of the document's structure and how it relates to typesetting conventions. "This is a topic heading; at the end of the article are the names of other topics; cross references look this way; here is the index of figures; .... ; period." MW is a bit weak for not having a central clear model like that.

That universal model is key to distribution, too. We're talking about making subsets of a wiki "portable" between multiple environments. They have to be isolatable in some sense. The conceptual model of structure is the most important place to explain what it means to transport data around in the ways we vaguely have in mind.

Pushing on anyway -- I'll consider a simplified model

I have to defer to the experts on the handling of hard issues like namespaces, images, and categories (mostly). To make progress in suggesting a technological approach, I'll (have to) make up a slightly oversimplified model and hope that discussion can lead to closing the gaps. I'll try to make my model suggestively "user oriented" -- considering the needs of editors and readers ahead of developers (and the current code base).

My simplified taxonomy of the MW world initially has just these elements:

articles	...small but possibly multi-page multi-media documents. the primary unit of editting; editors update articles.
cross reference links	...relative links; ordinary links from one article to another, by (possibly namespace-qualified) title. A cross reference link is resolved by looking within the host wiki for the named article. intra-article links can be lumped into this category.
bibliographic links	...absolute links; links from an article to another article that unambiguously identify that other article or external resource, regardless of whether it is present in the current wiki or not. These should not be confused with URL links which are only a subspecies of bibliographic links. A URL link is a bibliographic link that names a network service; other kinds of bibliographic link might name specific documents without reference to network topology (e.g., a link to http://some-reference-book.org vs a link to The XYZZY Professional Society Handbook, 1987 edition
discussions	...every article is linked to a corresponding discussion: a roughly mailing-list style forum for community commentary about the article.

We assume that it is easy to compute, given an article, the list of cross reference and bibliographic links it contains.

General character of expected use cases

We expect editors to edit one article at a time and, in the most noted applications, wish to discourage "bulk edits" that would simultaneously alter many articles at once.

We have deployments in the range of O(100k) articles and anticipate growth. As we contemplate the idea of many distributed wikis feeding into a few "collector" wikis (such as Wikipedia), we should anticipate millions of articles and peak periods of 10s if not 100s of thousands of article edits per day. In spite of these scales, interactive responsiveness for readers and editors is a critical concern and a perpetual focus of improvements to the code.

Wikipedia especially (and other deployments besides) are noteworthy for the social significance they have taken on. Continuity of service availability is critical.

Social concerns also suggest that, morally, the information in many articles belongs, effectively, to us all -- collectively and individually. This highlights the desirability of a clean rather than ad hoc approach to distribution features: the costs of failure to cleanly achieve the goal include locking too much data down in dead-end formats and hording too much data on just a few politically contentious central servers. As much as anything else, the distribution efforts are about ensuring the democratization of data contributed to Wikipedia and other MW deployments.

I take it as given that the current format of articles -- simple text files containing a particular flavor of wiki markup -- are not the final word. Vastly better wiki syntaxes (by multiple metrics) exist already in other contexts. Potential editors, especially experts, often create documents that would be suitable as articles other than that they are not formatted either as single files or in the current wiki syntax. I presume that, eventually, MW lives up to its name and supports a very wide range of article formats (not an arbitrary range, of course).

Links (both bibliographic and cross-reference) are normally created, by-hand, by users operating within a non-global "domain of reference". In other words, while it's conceptually simple to imagine a global namespace of names for use in links, in fact, user's want to type short, human-friendly names that make sense in their local environment but that may need translation when transported to a different environment.

Modelling MW as a broadcast medium

Because of their dynamic nature, I suggest that it is useful to understand each article (each topic) as a 'broadcast channel' for discrete packets.

Each packet on the channel is either a new version of the article, submitted by an editor, or a contribution to the discussion, contributed by anybody.

In this view, what is http://wikipedia.org? One could compare it to the web services which collect USENET articles, index them 5 different ways from tuesday, and make them web browsable.

As with USENET, in this view, anybody could set up a site to collect wikipedia articles, perhaps mixing and matching sources -- selectively choosing topics and authorized editors.

A MediaWiki installation, in this view, is a highly structured newsreader. An editor or discussion participant, a poster.

One beauty of this conceptualization is that the technical solutions are well known, at least in the abstract. People have accumulated decades of experience at managing the kind of distributed database that a news system comprises. All that remains to do is to reinterpret this experience in a way that is MW / wiki* specific.

Another beauty of this conceptualization of the problem is scale: it is well understood how to make a net-news-style distributed database with fuzzy domain boarders scalable to millions of users.

Yet another beauty of this conceptualization is democracy: news-style approaches are intrinsically democratic and intrinsically distributive.

Elaborating details of a "netnews-style" approach

How divergent is this approach from the current user model, reader model and developer model? How significant would the impact be on the existing code base? An elaboration of some details of what this approach implies can help to answer these questions:

What if it were 1983?

Suppose it were about 1983 and, given what tech was deployed at that time, we wanted to invent Wikipedia.

Actually, there's no need to suppose. It happened. It happened in an uncoordinated, spontaneous, way --- but it happened.

The Wikipedia was invented around that time in the form of newsgroups and newsgroup-specific FAQs. The equations are, roughly:

 newsgroup == topic (aka title)
 faq == article
 other newsgroup traffic == discussion
 newshub == distributed "wikimedia" site

In a very real sense, all improvements since then are incremental improvements to namespaces, syntax, user interface, and marketing. And in a real sense, abandonment of the p2p architecture of usenet is, if anything, a regression.

The point of this comparison isn't to say that MW and wikipedia are somehow "not innovative" (of course it's quite innovative) -- only to point out that some vintage technology suggests a simple approach to solving the distribution goals for MW.

The question now is how to reunify the technology developed then with the social participation currently going on.

General characteristics of netnews-style distribution

A p2p graph for sharing discrete packets with unique ids

Netnews systems comprise a graph of servers. News articles propogate along the edges of this graph. Each article includes a head part which contains routing information and message identity information. Each article contains a body, containing the message content.

Each message is assigned, at the time of its creation, a globally unique id (at least probabilistically so). These textual ids can be used as protocol elements in "have/need" negotiations between servers. The unique ids are critical to preventing "loops" -- servers sending the same news article back and forth.

Messages may also be assigned server-specific, newsgroup sequence numbers which can be substituted for message ids as protocol elements. Message numbers impose a convenient ordering on messages, allowing systems to simply and systematically perform tasks such as "fetch the 5 most recent messages".

Source: http://www.dsv.su.se/~jpalme/e-mail-book/usenet-news.html

Topic-based routing

The concept of routing is usually used to describe systems for point-to-point communication such as email: a packet bears a destination address which in context (such as DNS) determines how to deliver the packet to the desired endpoint.

Netnews systems use a different kind of routing, one that seems well suited to MW applications: topic-based routing.

A topic-based route (e.g., a list of newsgroups to post to) does not imply a specific end-point to which to deliver a packet. Instead, throughout the graph, each node imposes its own site-specific translation of topic-addresses to routes (e.g., your small flycasting equipment manufacturing business might decide to drop all packets for alt.fan.some-scifi-show and concentrate mostly on rec.fishing).

Missing elements in the current data model

So, if you're with me so far, we started with a problem: a vague idea of a p2p goal for MW.

Now, after careful reasoning and analysis, we have two problems! The original vague idea and some new vague ideas about how to implement it ("uh, make it like netnews").

Now that's progress.

A good next question is: What is missing in the current structure of MW articles that would be needed for a USENET-style approach to p2p goodness?

A global, unambiguous topic namespace; domain-labled articles

Topic names are used as titles, for searching, and in links. Under the suggestions above, topic names are also relevant to topic-based routing in a P2P realization of MW.

Topic names are selected and typed by users and are purely symbolic. By "purely symbolic" we mean that, when used in links, topic names do not bind to specific content but rather to whatever happens to be stored under that name at a given point in time.

In the case of a shared, singular MW installation (such as wikipedia) this is no problem at all: the community forum creates exactly the environment needed for editors to negotiate the topic->content mapping; features such as disambiguation links make differences of opinion a relatively low-stakes game.

In the case of P2P wikis, topic names become a little touchier. For example, continuing the "water purity" use case (see Versioning and distributed data), a water-specific web site user could easily and reasonably choose the title Sierra-Nevada snow pack for an article that relates that topic to water supplies. In a general purpose encyclopedia, that title should probably link to a much more general commentary about the snow conditions in that mountain range.

What becomes, then, of a link like [[Sierra-Nevada snow pack]], occuring in a water article, when that article is placed in the general encyclopedia? For that matter, when a user configures a p2p node for articles, which articles do topic-based routing rules for "Sierra-Nevada..." apply to?

What seems to me to be needed here, though proposing a detailed solution is out of scope for this position paper, is an unambiguous way to name domains of discourse and a definitive way to label each article with the name of the domain of discourse to which links contained in the article are relative.

Two design constraints I would like to strongly suggest for topic-domain names are that:

They should not refer to externally contentious namespaces
They should not depend on people to generate unique strings

For example, it would be a mistake to use an internet domain name as a disambiguating component of a topic domain name: the actions of ICANN and similar should have no semantic impact on MW-hosted content. And: it might make sense to use cryptographic checksums or signatures to generate from a user-supplied string a genuinely (albeit probabilistically) unambiguous domain name.

Isolatable, multi-part, format-neutral articles

This discussion is ostensively about distribution, versioning, and semantic issues related to those. At first glance, then, it may seem gratuitous to bring up questions about the current wiki syntax and the current rather limited format of articles.

I argue that it is not gratuitous for three reasons:

distribution strategies should not preclude future article format changes
distribution and cross-wiki merging raise the stakes: a "big-tent" approach should encourage content contributors, as far as practical, to use the formats they are most comfortable creating (and are already creating, ableit for different publication media)
many articles are, logically, multi-part entities including such things as images and equations -- the P2P protocol needs a clean way to bundle these related parts which suggests a broader conception of article format than the current code reflects

When we talk about a USENET-style architecture for distribution, a key question to answer is: what are the discrete packets? Simplistic interpolation from the original USENET suggests a text file divisible into head and body. Simplistic interpolation from MW today suggests a text file of wiki content. The argument above suggests that neither is sufficient. "Fixes" (such as invoking MIME), while not irrelevant, are inadequate: MIME is very general -- what constrained subset is appropriate?

As a starting point, I suggest something simple: an article is simply a tree of directories and files, with some tree formats being recognizable by MW for special case display. For example, if an article is simply a file with filename extention ".mw" -- it is processed as articles currently are. On the other hand, if the article is a directory containing subdirs and files, perhaps the presense of a top-level "index.html" is significant.

There may not be a rush to expand MW to handle these other formats in many modules, but there is no reason to preclude them when designing and building support for distribution.

Less server burden -- greater client reliance -- finite protocols

In the brave new world being contemplated here, a given MWUA ("MediaWiki User Agent" -- a web browser) may find itself speaking to multiple servers operating at a very wide range of scales from huge (Wikipedia) to tiny (a MW installation on my dinky laptop that I use to maintain the articles I contribute to most).

As distribution and P2P features take off, and people therefore find more and more applications for MW functionality, divergent MW implementations are likely if not inevitable.

As a consequence of such considerations, a slight philosophical shift is called for in how new MW features are conceived, implemented, and deployed:

The attitude that new features are, by default, provided by new server-side extensions is a dead-end, incompatible with a P2P conception. Must every node load up every extension called for by any article it happens to import?

Instead, MW design should focus more on what can be done client-side and on containing, explicitly defining, and locking down the protocols between user-agent and MW server. The closer a fully MW server can come to a simple, mostly passive state machine that buffers articles and provides only a basis set of indexing, the better. The more fancy features that can be pushed off to client-side computations, the better MW is suited for the future.

At the same time, it is important to preserve the property that a minimally functional MWUA, with limited client-side computational resources, is nevertheless a useful tool for accessing the pool of MW-ized resources.

Picking technologies for implementation

TBA

(I have a bunch of unreleased code that could be helpful.)

A Clear Data Model

Note: I time-limited my writing of this document and ran out of time before writing this section. Perhaps this is a topic for discussion during the meeting assuming we agree on large parts of what's written above. -t

Versioning in the face of Very Asynchronous Edits

Note: I time-limited my writing of this document and ran out of time before writing this section. Perhaps this is a topic for discussion during the meeting assuming we agree on large parts of what's written above. -t

Pleasant side effects and possible futures

Note: I time-limited my writing of this document and ran out of time before writing this section. Perhaps this is a topic for discussion during the meeting assuming we agree on large parts of what's written above. -t