Jump to content

GNE Project Design

From Meta, a Wikimedia project coordination wiki

This is part of the Historical Wikipedia pages collection.

More thoughts by MikeWarren about GNE. See also GNE Architecture and Wikipedia:GNE Project Files.

As discussion rages over moderation and back-end design, I thought I'd write a more cohesive version of my thoughts. There should be one article repository (with potential for many mirrors of it) and many different classifications, hopefully by different groups of people.



Authors may like to take advantage of editors. Volunteers who wish to edit work can organize themselves onto one or many editing mailing lists, much like Nupedia has done. Authors can submit their articles to the list and make changes which editors might suggest if they see fit.

Once an author is satisfied with her work, she can then submit it to the article repository.

Article Repository


There should exist a large group of moderators (hopefully everyone in the project) for the repository. When an article arrives, they will vote on whether it is spam or not. Unless there is unanimous consent that the article is completely useless, it will go into the repository. This can prevent abuse of the system (i.e. someone sending in random binary files) while still keeping the process completely open and allowing all authors to put their work into the pool.

The only stipulation, obviously, is that the work be licensed under the GNU FDL.

There seems to be consensus that XML will be used to mark up the articles; submissions arriving which are not in the correct DTD ( TEI seems to be the front-runner currently) will be subject to the brutal treatment of various conversion scripts. The resultant XML document will be assigned a unique ID and placed in the repository. There has also been talk of using a Web-based form submission service; this is also a good idea.

The repository will be kept simple and will utilize the capabilities of modern Web servers by being a simple hierarchy of directories with the .xml files inside them. The directories will not be a classification of the content, but merely a way to get around limits on the number of files in a file-system. Each document will simply be called unique-id.[version].[language].xml. If it has been digitally signed, there will be a corresponding file unique-id.[version].[language].xml.asc

Version numbers can be anything, really, but simply sequentially increasing them seems like the best course. Language can correspond to the LOCALE meanings. So, one might have a directory like:


Which means that there are three versions of the article with ID 123456 and the last one (version 3) has been digitally signed.

The repository server will keep a list of all the unique IDs of all the documents it contains. This will allow the classification systems to easily update themselves with new documents by requesting this list (i.e. http://www.gne.org/articles/index). For the above example, the index would just contain:




Everybody seems to have their own favourite way of classifying articles, from voting to Dewey Decimal to Library of Congress. All have merit, and there are probably lots of users who would find each approach useful.

It seems like a good idea, then, to allow for multiple classification systems. Users would interact with the repository through one particular classification. Hence, the classifications systems will be doing all the searching and indexing that users might want; storing the information they use in a database makes a lots of sense.

How might this work? Taking the above example again, a fresh classifier which just lists articles by author and title downloads the index and notices that there are three versions of a single article. Since it doesn't know about this article yet (it looks in its database and finds nothing about the unique ID 123456) and only cares about the latest articles, it asks for the file http://www.gne.org/articles/123456.3.en.xml. The repository server re-writes the URL into http://www.gne.org/articles/a/123456.3.en.xml and sends the XML file back. The database updating program parses the file and extracts the author and title information, putting these into the database with the unique ID 123456. This particular classifier doesn't care about digital signatures, so it never requests the .asc file to see if there is one.

Next, a user visits the Web site for this classifier, and visits the author list which shows a single entry: the author of the article with unique ID 123456.

In a similar manner, other classifiers might do much more complicated things, like send the article to a mailing list of peer-reviewers (again, like Nupedia) or any number of schemes to classify the article in question.



So, what software does GNE need to write? If we use TEI as the representation format, the project can start receiving submissions immediately; a cursory reading of TEITools indicates that it can convert to HTML, TeX and RTF. All that needs creating is a method of getting the submission into TEI in the first place, which can be a Python (or whatever) script which accepts plain text and makes guesses at what things should be (i.e. bold face, references, etcetera). Then, editors can make sure this makes sense and the author can give the final okay.

We can serve submissions live from the XML repository using Apache and TEITools. Classification projects can thus begin work immediately, and this will be the major programming work of the project (besides building better anything-to-TEI conversion scripts). Splitting interested parties into groups for this can begin immediately as well.



I propose that four groups are formed immediately:

'Backend' :: This group will set up and manage the back-end server. It should do nothing more than accept submissions in DTD-compliant XML, accept revised versions of the same document and accept signatures for existing documents. From this, it should make the above-mentioned index file and serve XML pages to the classifiers. This group should also set up the moderation system which rejects things which are unanimously decided to be spam. This group should also determine whether or not multimedia will be inline in the XML or served as separate files.

'Editing' :: This group will provide editing services for authors. Any author can submit their article to the group for comments, although this will obviously not be a requirement.

'Classification' :: This group will write the first generic classifier project, which will be targeted as being a template for other more specific classifier projects to use.

'Conversion' :: This group will work on methods and programs for efficiently converting submitted articles into TEI. Emphasis should be on making it easy for (especially) academic groups to submit articles, so LaTeX might be a good first choice after plain text.

See also : Wikipedia:GNE Project Files