Static version tools

From Meta, a Wikimedia project coordination wiki

This is a central repository for sharing software - scripts and other tools - for use in static (offline) releases of Wikimedia projects such as Wikipedia 1.0. Several different language groups are working on similar scripts, so it makes sense for us to share the best that we have. The following needs have been identified:

  • Interactivity and interfaces : Front-ends to read and interact with different snapshot formats.
  • Reducing text : summarizing, auto-excerpting
  • Ranking text : bot-assisted reviewing/vetting/rating, metric analysis (apsp, grank, hit-popularity, edit-popularity, expertise, writing style, &c)
  • Metadata : bot-assisted annotation (audience, type, categorization)
  • Spellchecker, grammar checker
  • Copyvio checker
  • Image resizing & compression
  • Metadata extraction
    • History metadata (list of users, freshness, &c)
    • Image/media metadata
  • Index generation (for browsing)
    • Category tree generation

Some of these sections are represented below by actual scripts & other tools - please add more as you find appropriate.

Tools to do all the work[edit]

File storage[edit]

File readers[edit]

See one user's evaluation of these systems in these personal blog posts

Tools for assessing & cleaning up articles[edit]

WP_1.0_bot is used on the English Wikipedia to collect assessment information on articles from WikiProjects. For example, a member of the Chemistry WikiProject will manually assess a chemistry article for its quality and importance and record the information in a talk page project template. This information is compiled by the bot, which generates output such as tables, a stats table and a log, which are all valuable for the WikiProject. A complete list of participating WikiProjects and task forces (around 1400 as of August 2008) is available at the Index and Index2, along with a global statistics table. The information compiled by this bot is then used as the basis for an automated selection for offline releases (see the selection section below). An improved version of the bot is being discussed (August 2008).

The French Wikipedia also uses a similar bot, written independently; this includes some features not available in the English bot code.

Tools for selecting articles[edit]

See w:Wikipedia:Version 1.0 Editorial Team/Article selection

Tools for assembling the CD[edit]

See also Manual:Using content from Wikipedia. The following is a summary of w:User:Wikiwizzy/CDTools

Provided with an article list, or category list, or both, the task is to create a static HTML dump, browseable off the CD.

The raw dumps of all language Wikipedias are available as XML at Data dumps. These dumps can be manipulated using the MWDumper java program.[1] mwdumper accepts a --filter switch, that can be used to pick only a defined selection of articles, outputting a similar, much smaller and more manageable XML dump, wpcd-trim.xml.

This is a good time to remove unwanted sections, like interwiki links and External links, if desired.

Ideally, we would like to create the HTML dump from this XML, but the need for Categories and need for a tool to convert mediawiki markup to HTML markup means that at present creating a mediawiki installation seems the best way to go.

An empty mediawiki installation (including mysql and apache) can then be loaded with the article subset, giving a 'wikipedia' with only the required articles, and trimmed sections. However, category links will not work yet, as they are stored in a different XML dump at Data dumps.

To load Category information, the wpcd-trim.xml file is read again, and all needed articles are scanned for their categories. All categories that have at least 3 articles in them are filtered out of the complete category dump, and loaded into the mediawiki installation.

Now, the dumpHTML.php script from the mediawiki software can be run, to create a static HTML dump.

The Wikipedia Offline Server was released publicly in 2007 - it allows you to browse the pages from any language html dump (wikipedia-*-html.7z files) on your localhost. It consists of a small ruby script with an embedded webserver, and uses 7zip to selectively extract contents. (We are working on improving 7zip to make this faster). See the initial announcement at http://reganmian.net/blog/2007/02/15/wikipedia-offline-server-02/.

Spotting vandalised sections

There is a useful tool for listing all of the "bad words" that are often a red flag for vandalism - the Perl script is available.

wiki2cd software

This software just takes a list of topics and does all automation to create a local repository ready to distribute in CD. Used for creating Malayalam wikipedia selected artciles CD. More details available here http://wiki.github.com/santhoshtr/wiki2cd/

Tools for reading files offline[edit]

Main article: Offline Projects

Kiwix[edit]

Kiwix brings internet contents to people without internet access. It is free as in beer and as in speech.

As an offline reader, it is especially thought to make Wikipedia available offline, but technically any kind of web content can be stored into a ZIM file (a highly compressed open format) and then read by the app: there are currently several hundred different contents available in more than 100 languages, from Wikipedia, Wikiquote, the Wiktionary to TED conferences, Gutenberg library, Stackexchange and many others.


Tools for adding pictures to HTML dump[edit]

For the English Wikipedia, there is a corresponding picture dump, that has the fullsize version of all pictures referenced in articles. (Commons ??) This dump runs about 100 Gigabytes. Alternatively, the pictures can be fetched from the live wikipedia - but this should probably be done by an approved Bot or it risks being blocked as a spider. User Kelson has tools to do either of these.

Search tools[edit]

The amount of information on the CD or DVD begs for some method of searching. Search is normally done via a server - usually a Database search. On the CD we have neither - and the only computer in sight is the browser. The only generally portable browser language is Javascript, and there is a solution called ksearch-client-side[2] that allows a javascript program to be built to act as the index. It contains two arrays and a program :-

  1. a list of all articles, with the first 15 words of the article
  2. a list of words, with the article index of all articles that contain the word.
  3. a simple form, that allows a search of the second array, and returns the summary from the first array.

WikiMiner[edit]

A dedicated search engine has been created for DVD edition of Polish Wikipedia. Its documentation is available.

Alternative parsers[edit]

See Alternative parsers on MediaWiki.

Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools[edit]

Lots of new ideas just made by me that may interest you, and those interested in the project of creation of Wikipedia CD/DVD, and the need to better patrol the contents, work better in teams with supervizors, and enforce the copyright and national legal restrictions.

These new concepts concerns ALL Mediawiki projects, not just those hosted by the Fundation and not just Wikipedia.

See the discussion just started here:
Wikipedia on CD/DVD#Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools.

Most probably a new great project to build.

verdy_p 12:10, 16 November 2008 (UTC)[reply]

References[edit]