GNE Architecture

From Meta, a Wikimedia project coordination wiki

This is part of the Historical Wikipedia pages collection.


My (MikeWarren) current favourite name is GNE: GNE's Not an Encyclopedia! (Dan Geiser's suggestion.)

In any case, my ideas for how this should all work:

OVERVIEW[edit]

Since the goal of GNE is to keep submissions almost completely concat(barring completely obvious spam), some method of classification is needed. Since almost everyone has suggested a different way of classifying articles, it makes the most sense to keep the classification information separate and allow for multiple classifiers.


THE BACK END ARTICLE REPOSITORY[edit]

Hence, only information absolutely essential to the article should be kept in the actual article repository. I think keeping this in XML has some advantages: it is readily human readable; some simple semantic hints can be included by the author if she chooses (here I mean things like <date>Jurassic</date> or <name>Mike Warren</name>); changing the DTD/Schema can be quite easy in many cases (unlike changing the schema of a database). Unique IDs will need to be assigned to each article, so that the classifiers can reference them. Anything from the really simple (sequentially-assigned 128-bit integers) to the complicated (MD5 or similar hashes of the content) can be employed for this purpose.

Using this method to store the articles, it seems to make sense to just use an existing Web server like Apache to serve these articles, which can then be accessed easily:

   http://www.gne.org/article/unique-id-12345.xml

If a single directory becomes insufficient (as seems likely at some point), then the first bits of the unique IDs can be used to make sub-directories, and the URL re-writing ability of Apache (and presumably other Web servers) can be used to change the above URLs into the actual URL. This has the advantage that existing free software is employed to implement the back-end and mirroring the data is extremely easy (just tar it, or rsync, or FTP). It will also be necessary to include an index of all articles which exist on the server. This need be nothing more than a complete list of all the unique IDs of the articles, enabling easy access by the classifiers (see below).

Versioning of the articles is also desired. This can be simply:

 http://www.gne.org/articles/unique-id.version.xml

Whichever method of assigning the IDs is used, many have expressed the need/desire to digitally sign the articles. Since the article itself needs to be signed, it makes little sense to include the signature in the actual XML of the article (since then it would need to be removed to check the signature, and raises questions like, ``exactly which bytes do I remove to take out the signature). The above scheme for serving the articles fits in well here: signatures can just go in well-known signature files:

   http://www.gne.org/article/unique-id-12345.xml.asc

for the above example. If such a file doesn't exist, the article is not signed.


THE CLASSIFIERS[edit]

No actual user or client program should be accessing the repository directly, in all likelihood. A number of classification databases will exist, which will index all the articles in the repository according to their own criteria. One of the simplest ones (which the GNE project itself will likely supple) will be a simple author and title index. Every night (say) this classifier will request the article index from one of the back-end repositories. Looking through the list of unique IDs, it will note any IDs which are not already in its own database. For these, it will request the XML file from the repository and parse it, extracting whatever information is relevant to this classification (in this example, just the author(s) and title). If this classifier is more complicated (perhaps its the Nupedia one), then these new articles will be sent to mailing lists for comments about which category it should go in (or if it should be excluded from the classification, by putting it in an ``ignore category).

The classifiers can do absolutely anything imaginable, from providing a kids-only view of the information, to a keyword search capability to more complex things. As classifications become irrelevant or unused, they can simply be deleted; nothing needs to change in the article repository. If someone is dissatisfied with the current classification schemes, they can take some base software provided by GNE and modify it to suit their needs; classifiers based on voting of users, voting of ``experts or many other schemes can be created.

These classifier systems should probably store much of their information in a database like MySQL since they will be accessing it a lot.


CONVERTERS/CLIENTS[edit]

No matter what classifier is used, the user needs to actually see the articles. This needs conversion from the XML format into some other suitable format, like HTML or DVI. The GNE project will produce such converters, which will be employed by the classifiers when showing an article to an actual user. It might also be useful (in the future) to make clients which interact (using some known protocol) with the classifiers and then allow the user to choose from whatever formats the client software knows to convert to. I wouldn't suggest doing such a thing until well after the classifier software is stable, if ever.


DIAGRAM[edit]

So, here's what will happen:


                                 +---------------------+
                                 | Classifier Projects |
                                 +---------------------+
                                          . . .                                    +---------+
  +------+                       +---------------------+                           | Backend |
  | user | <--- Web browser ---> | Dewey Decimal Sys.  | <--- fetches article ---> +---------+
  +------+                       +---------------------+                           |12345.xml|
                                 | Library of Congress |                           |33213.xml|
                                 +---------------------+                           |77662.xml|
                                 | Children-only       +                           |  ...    |
                                 +---------------------+                           +---------+
                                          . . .                                               
                                                                                              
  Web browser/GNE client          Database-driven lookup                      Directory(s) of 
                                       mechanism                              XML files, with
                                                                              index & optional
                                                                              signature


SOFTWARE BY GNE[edit]

The software which GNE would need to write includes:

  • A PARSER which will take almost-plain-text and convert it into XML in the DTD/Schema decided upon for the back-end articles. This will include author, revision and content. It will not include a digital signature or any classification information whatsoever. The content can contain optional semantic hint tags intended for use by the classifiers.
  • BASIC CLASSIFIER software, which will ease the task of writing some classifier. As a minimum, this should include options to send articles to mailing lists and some basic hierarchical classification mechanism (since many classifiers will likely be hierarchical).
  • CONVERTERS to take XML content in our DTD/Schema and produce LaTeX, HTML, DVI or other formats, for use by the classifiers when serving content to users. This means no load on repository servers, and flexibility for the classifiers, which might like to change the HTML to conform to their design style.


CONCLUSION[edit]

A simple, efficient and almost completely inclusive back-end article repository is indexed by a number of classifier projects, with which the user interacts to get at content they're interested in. These classifiers are limited only by the imagination, and don't affect what is stored in the article repository; no classifier group can ``censor the GNE project, since another classifier group could start up to correct the wrong (in their opinion) classifications. No articles are rejected from the back-end repository, which is easy to mirror, easy to maintain and presents no major load on the server beyond its duties as a Web page server. Classifier databases can run on any system, anywhere in the world and use any mirror of the repository.