Static version tools
This is a central repository for sharing software - scripts and other tools - for use in static (offline) releases of Wikimedia projects such as Wikipedia 1.0. Several different language groups are working on similar scripts, so it makes sense for us to share the best that we have. The following needs have been identified:
- Interactivity and interfaces : Front-ends to read and interact with different snapshot formats.
- Reducing text : summarizing, auto-excerpting
- Ranking text : bot-assisted reviewing/vetting/rating, metric analysis (apsp, grank, hit-popularity, edit-popularity, expertise, writing style, &c)
- Metadata : bot-assisted annotation (audience, type, categorization)
- Spellchecker, grammar checker
- Copyvio checker
- Image resizing & compression
- Metadata extraction
- History metadata (list of users, freshness, &c)
- Image/media metadata
- Index generation (for browsing)
- Category tree generation
Some of these sections are represented below by actual scripts & other tools - please add more as you find appropriate.
- 1 Tools to do all the work
- 2 Tools for assessing & cleaning up articles
- 3 Tools for selecting articles
- 4 Tools for assembling the CD
- 5 Tools for reading files offline
- 6 Tools for adding pictures to HTML dump
- 7 Search tools
- 8 Alternative parsers
- 9 Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools
- 10 References
Tools to do all the work
Tools for assessing & cleaning up articles
WP_1.0_bot is used on the English Wikipedia to collect assessment information on articles from WikiProjects. For example, a member of the Chemistry WikiProject will manually assess a chemistry article for its quality and importance and record the information in a talk page project template. This information is compiled by the bot, which generates output such as tables, a stats table and a log, which are all valuable for the WikiProject. A complete list of participating WikiProjects and task forces (around 1400 as of August 2008) is available at the Index and Index2, along with a global statistics table. The information compiled by this bot is then used as the basis for an automated selection for offline releases (see the selection section below). An improved version of the bot is being discussed (August 2008).
The French Wikipedia also uses a similar bot, written independently; this includes some features not available in the English bot code.
Tools for selecting articles
SelectionBot is beginning to be used on the English Wikipedia for making a selection of articles based on quality and importance. It depends on WikiProjects providing extensive metadata, via the WP_1.0_Bot (see above), but as of August 2008 such data are available on 1.4 million articles. Preliminary test code for SelectionBot is available here, and test output here, but please note that these are only at the testing stage (as of August 2008).
Tools for assembling the CD
- See also Manual:Using content from Wikipedia and Extension:DumpHTML. The following is a summary of w:User:Wikiwizzy/CDTools
Provided with an article list, or category list, or both, the task is to create a static HTML dump, browseable off the CD.
The raw dumps of all language Wikipedias are available as XML at Data dumps. These dumps can be manipulated using the MWDumper java program. mwdumper accepts a --filter switch, that can be used to pick only a defined selection of articles, outputting a similar, much smaller and more manageable XML dump, wpcd-trim.xml.
This is a good time to remove unwanted sections, like interwiki links and External links, if desired.
Ideally, we would like to create the HTML dump from this XML, but the need for Categories and need for a tool to convert mediawiki markup to HTML markup means that at present creating a mediawiki installation seems the best way to go.
An empty mediawiki installation (including mysql and apache) can then be loaded with the article subset, giving a 'wikipedia' with only the required articles, and trimmed sections. However, category links will not work yet, as they are stored in a different XML dump at Data dumps.
To load Category information, the wpcd-trim.xml file is read again, and all needed articles are scanned for their categories. All categories that have at least 3 articles in them are filtered out of the complete category dump, and loaded into the mediawiki installation.
Now, the dumpHTML.php script from the mediawiki software can be run, to create a static HTML dump.
The Wikipedia Offline Server was just released publicly - it is still under heavy development, but already today it allows you to browse the pages from any language html dump (wikipedia-*-html.7z files) on your localhost. It consists of a small ruby script with an embedded webserver, and uses 7zip to selectively extract contents. (We are working on improving 7zip to make this faster). See the initial announcement at http://reganmian.net/blog/2007/02/15/wikipedia-offline-server-02/.
- Spotting vandalised sections
There is a useful tool for listing all of the "bad words" that are often a red flag for vandalism - the perl script is available here. The English Wikipedia plans also to use the Wikitrust approach to identify good versions of articles.
- wiki2cd software
This software just takes a list of topics and does all automation to create a local repository ready to distribute in CD. Used for creating Malayalam wikipedia selected artciles CD. More details available here http://wiki.github.com/santhoshtr/wiki2cd/
Tools for reading files offline
Kiwix brings internet contents to people without internet access. It is free as in beer and as in speech.
As an offline reader, it is especially thought to make Wikipedia available offline, but technically any kind of web content can be stored into a ZIM file (a highly compressed open format) and then read by the app: there are currently several hundred different contents available in more than 100 languages, from Wikipedia, Wikiquote, the Wiktionary to TED conferences, Gutenberg library, Stackexchange and many others.
Tools for adding pictures to HTML dump
For the English Wikipedia, there is a corresponding picture dump, that has the fullsize version of all pictures referenced in articles. (Commons ??) This dump runs about 100 Gigabytes. Alternatively, the pictures can be fetched from the live wikipedia - but this should probably be done by an approved Bot or it risks being blocked as a spider. User Kelson has tools to do either of these.
- a list of all articles, with the first 15 words of the article
- a list of words, with the article index of all articles that contain the word.
- a simple form, that allows a search of the second array, and returns the summary from the first array.
A dedicated search engine has been created for DVD edition of Polish Wikipedia. Its documentation is available here.
See Alternative parsers on MediaWiki.
Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools
Lots of new ideas just made by me that may interest you, and those interested in the project of creation of Wikipedia CD/DVD, and the need to better patrol the contents, work better in teams with supervizors, and enforce the copyright and national legal restrictions.
These new concepts concerns ALL Mediawiki projects, not just those hosted by the Fundation and not just Wikipedia.
See the discussion just started here:
Wikipedia on CD/DVD#Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools.
Most probably a new great project to build.
verdy_p 12:10, 16 November 2008 (UTC)