TomeRaider 3 conversion script

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Wikipedias in six languages (de,en,eo,fr,nl,pl) are available in TomeRaider format (commercial software) for offline browsing on a handheld (Pocket PC, Palm OS or EPOC) or a Windows desktop or notebook PC. All articles are conserved in the database. However all meta information (user pages, upload info, discussions, etc.) has been discarded to save memory (which comes at a premium on handhelds).

Screenshots and more info about Wikipedia for TomeRaider:

On October 18, 2004 TomeRaider 3 has been released, which adds support for images and categories and offers even better search facilities.

Very soon a Wikipedia to TomeRaider 3 conversion script will be released for beta test, which supports these new image and category features, and which has been completely reworked to handle new very strict input requirements, patching many thousands of user input errors (html syntax). By the way, support for TomeRaider 2 has not been discontinued yet, very soon a new set of Wikipedia databases for TomeRaider 2 will be online at the Wikipedia server.

The new script supports both generation of a text only database or a version with images included. The text only version is already large (Oct 2004 English Wikipedia 450 Mb), but the version with images included will be really huge (Oct 2004 English Wikipedia 1.5 Gb). For many owners of a handheld/PDA the purchase of new memory card in order to accommodate this new image enhanced version is not a decision they will make lightly.

The new conversion script offers the possibility to add custom code which filters only images for certain article categories. Also some templates that are irrevelant or of minor interest for an offline read-only Wikipedia will be omitted. This page is meant to collect ideas, code snippets and lists of templates/categories that will help towards minimizing the memory requirements for the TomeRaider 3 Wikipedia databases.

Templates to omit[edit]

You can help to collect templates that are of little or no interest for offline read-only versions of the Wikipedia, or occur on so many pages that their inclusion would impact the file size noticeably. Every now and then the conversion script will be updated to filter these templates from the database.

In order to see the actual content of the listed templates copy the tags to a sandbox article on the Wikipedia concerned and click preview.

  • Dutch Wikipedia: stub: {{beg}}, disambiguation page: {{dp}}
  • English Wikipedia: {{stub}}, see also: .. {{wikiquote}}, creative commons license: {{cc\-nc}}
  • Esperanto Wikipedia:
  • French Wikipedia:
  • Esperanto Wikipedia: stub: {{stumpo}}
  • Polish Wikipedia:

Images to select[edit]

The conversion script provides a hook, a call to a function 'SelectImage' in a language specific file (where XX = language code), where custom image selection and resize criteria can be coded.

The function is called for each image, with the following parameters: 1: Image file name / 2: Image display size specified in the article (if any, mostly for thumbnails) / 3:Article title / 4:Full article text / 5: List of all categories to which this article belongs

The function returns three values: 1: A flag, whether to include this image at all / 2: Image display size to be used / 3: An extra optional extra category tag for this article, e.g. for debugging

You might use this for instance to include only png and gif images, which covers most maps and diagrams, or to include only certain categories (remember the category feature is relatively new in Wikipedia, hence not all pages are categorized yet).

The perl code for this function looks like this:

sub SelectImage
  my $image    = shift ; # image file name
  my $width    = shift ; # suggested size (part of image tag: ..px)
  my $title    = shift ; # article title / underscores haven been changed to spaces
  my $entry    = shift ; # $title + || + $article
  my @categories = @_ ;  # zero or more categories attached to this article

  my $select   = $true ; # select this image for inclusion in output file
  my $category = "" ;    # add a category, e.g. for debugging

# test code:
# if ($image =~ /constellation/i)
# { return ($true, 99999, "! Constellation") ; } # select, use original size, add category name 

  return ($select, $width, $category) ;

One way to generate a TomeRaider 3 Wikipedia file with a much reduced size would be to collect all categories for a broad cross section of Wikipedia articles, e.g. all categories that relate to sports or science, and use function SelectImage to filter only images that belong to these categories.

Would it be a good idea to collect these categories here, or is there already a good place in Wikipedia to look for such collections? Erik Zachte 22:01, Oct 24, 2004 (UTC)