Offline Reader/RequirementSpecs

Status: DRAFT

Definitions

Mediawiki: A PHP-based implementation of the wiki concept, where everyone can read and contribute content to a set of webpages via a webbrowser.

Wikipedia: A collaboratively maintained encyclopedia based on Mediawiki.

the software: The software to be created as described in this document.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in'RFC 2119.

Rationale

Wikipedia is usually accessed and edited via the Mediawiki software, a web application. However, this is not suitable for offline use, e.g. for distribution on CD and DVD medias, since it requires both running database server and a running web server.

Moreover, an offline copy of Wikipedia has entirely different target groups compared to the online version:

Users who have no permanent access to the internet (i.e. dialup)
Users who cannot affort access to the internet, such as inhabitants of developing countries.
Users who need advanced methods of full text search and spelling tolerant searches which are currently unavailable on wikipedia.org

Thus, a software is needed that can be shipped with a media and runs on all common operating systems (Windows, Linux/Unix, Mac OS X).

Components

The software consists of several components:

Generator

Can interpret a mediawiki dump and generates an archive suitable for distribution on CD and DVD media. This application can have an OPTIONAL GUI. The generator MUST at least collect the following data and store them for a summary of:

All articles of a given Mediawiki SQL dump
Graphics and media data from the Mediawiki SQL dump
Graphics and media from an dependant image dump(e.g. commons)
A list with all authors that have contributed to the respective articles (as required by e.g. the GFDL).
A set of metadata for the given mediawiki SQL dump. Those MUST include:
- Title of the collected data (e.g. "Deutschsprachige Wikipedia")
- Date of snapshot
- License of volume contents. If the license is different per page, all contained licenses MUST be listed. Additionally, the correct license must be displayed when viewing an article or media. Note that images are not always licensed under the GFDL, so their license state must be exhibited seperately. All authors that contributed to an article or media
- Indexable keywords in the article
- category scheme
- personendaten (de.wikipedia.org)
- lexicographical correct lemma for articles about a person: "Torvalds, Linus")
OPTIONAL additions to the above are
- timestamp of last edit
- Metadate for articles marked as "brilliant prose", "good article", "cleanup"

Reader

A graphical frontend to browse the created archive. It MUST provide:

a fulltext search, which MUST appear to be instant (minimum hardware to be defined)
an index-based search
an alphabetical search
Search for metadata extracted from reader (e.g. "search in 'good articles'")
a reader window for articles which MUST
- show text
- be able to show/play all media content used in Wikipedia
- be able to browse external links (may open external browser)
- show the text in a structured form. this includes
- normal text
- bold
- headlines
- tables
- quotations
an option to export and print single articles.
compliancy with the license requirements from the GFDL. Names of authors of single articles have to be accessible, the original file source should be linked and the GFDL should be noted at the bottom of the article.
implementor must use the Qt 4 toolkit

Non-MUST items:

OPTIONAL: other kinds of searches
RECOMMENDED: Make use of KDE libraries (KHTML, etc)
MAY: list of articles that contain a given search string ranked by "relevancy"

Milestone Plan

Target platforms evaluated, minimal hardware specified
Generator formats specified
Search engines / algorithms evaluated, formats adjusted
First reference implementation of reader
Extended search options implemented
Product ready for extensive testing
Delivery of finished product