Static version tools
|Static content group (talk)|
|CD/DVD on meta|
|WP 1.0 on meta|
|German CD on meta|
|Polska DVD on meta|
|Mandriva on meta|
|Offline task force
on strategy wiki
|Alt parsers (on MediaWiki)|
|GERMAN WP 1.0 (t)|
|de info in English|
|POLISH WP 1.0|
|ITALIAN WP 1.0 (it)|
|Malayalam WP 1.0 (ml)|
|ENGLISH WP 1.0 (t)|
|Bot (t) Criteria.|
|SOS Children DVD online browsable (t)|
|Version 0.5 (t) (bot)
|Core topics — Torrent|
|Work via WikiProjects|
|Wikipédia Junior (active)|
|FRENCH CD (very old)|
This is a central repository for sharing software - scripts and other tools - for use in static (offline) releases of Wikimedia projects such as Wikipedia 1.0. Several different language groups are working on similar scripts, so it makes sense for us to share the best that we have. The following needs have been identified:
- Interactivity and interfaces : Front-ends to read and interact with different snapshot formats.
- Reducing text : summarizing, auto-excerpting
- Ranking text : bot-assisted reviewing/vetting/rating, metric analysis (apsp, grank, hit-popularity, edit-popularity, expertise, writing style, &c)
- Metadata : bot-assisted annotation (audience, type, categorization)
- Spellchecker, grammar checker
- Copyvio checker
- Image resizing & compression
- Metadata extraction
- History metadata (list of users, freshness, &c)
- Image/media metadata
- Index generation (for browsing)
- Category tree generation
Some of these sections are represented below by actual scripts & other tools - please add more as you find appropriate.
- 1 Tools to do all the work
- 2 Tools for assessing & cleaning up articles
- 3 Tools for selecting articles
- 4 Tools for assembling the CD
- 5 Tools for reading files offline
- 6 KIWIX in a Nutshell
- 7 Why offline matters
- 8 Projects that involve KIWIX
- 9 See also
- 10 References
- 11 Tools for adding pictures to HTML dump
- 12 Search tools
- 13 Alternative parsers
- 14 Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools
- 15 References
Tools to do all the work
Tools for assessing & cleaning up articles
WP_1.0_bot is used on the English Wikipedia to collect assessment information on articles from WikiProjects. For example, a member of the Chemistry WikiProject will manually assess a chemistry article for its quality and importance and record the information in a talk page project template. This information is compiled by the bot, which generates output such as tables, a stats table and a log, which are all valuable for the WikiProject. A complete list of participating WikiProjects and task forces (around 1400 as of August 2008) is available at the Index and Index2, along with a global statistics table. The information compiled by this bot is then used as the basis for an automated selection for offline releases (see the selection section below). An improved version of the bot is being discussed (August 2008).
The French Wikipedia also uses a similar bot, written independently; this includes some features not available in the English bot code.
Tools for selecting articles
SelectionBot is beginning to be used on the English Wikipedia for making a selection of articles based on quality and importance. It depends on WikiProjects providing extensive metadata, via the WP_1.0_Bot (see above), but as of August 2008 such data are available on 1.4 million articles. Preliminary test code for SelectionBot is available here, and test output here, but please note that these are only at the testing stage (as of August 2008).
Tools for assembling the CD
- See also Manual:Using content from Wikipedia and Extension:DumpHTML. The following is a summary of w:User:Wikiwizzy/CDTools
Provided with an article list, or category list, or both, the task is to create a static HTML dump, browseable off the CD.
The raw dumps of all language Wikipedias are available as XML at Data dumps. These dumps can be manipulated using the MWDumper java program. mwdumper accepts a --filter switch, that can be used to pick only a defined selection of articles, outputting a similar, much smaller and more manageable XML dump, wpcd-trim.xml.
This is a good time to remove unwanted sections, like interwiki links and External links, if desired.
Ideally, we would like to create the HTML dump from this XML, but the need for Categories and need for a tool to convert mediawiki markup to HTML markup means that at present creating a mediawiki installation seems the best way to go.
An empty mediawiki installation (including mysql and apache) can then be loaded with the article subset, giving a 'wikipedia' with only the required articles, and trimmed sections. However, category links will not work yet, as they are stored in a different XML dump at Data dumps.
To load Category information, the wpcd-trim.xml file is read again, and all needed articles are scanned for their categories. All categories that have at least 3 articles in them are filtered out of the complete category dump, and loaded into the mediawiki installation.
Now, the dumpHTML.php script from the mediawiki software can be run, to create a static HTML dump.
The Wikipedia Offline Server was just released publicly - it is still under heavy development, but already today it allows you to browse the pages from any language html dump (wikipedia-*-html.7z files) on your localhost. It consists of a small ruby script with an embedded webserver, and uses 7zip to selectively extract contents. (We are working on improving 7zip to make this faster). See the initial announcement at http://houshuang.org/blog/2007/02/16/wikipedia-offline-server-02/ (houshang.org site is gone as of January 2008).
- Spotting vandalised sections
There is a useful tool for listing all of the "bad words" that are often a red flag for vandalism - the perl script is available here. The English Wikipedia plans also to use the Wikitrust approach to identify good versions of articles.
- wiki2cd software
This software just takes a list of topics and does all automation to create a local repository ready to distribute in CD. Used for creating Malayalam wikipedia selected artciles CD. More details available here http://wiki.github.com/santhoshtr/wiki2cd/
Tools for reading files offline
- TntReader (as of August 2008) is a tool for reading zeno files. These files are used on German offline releases, and are being adopted elsewhere; they consist of compressed wikipedia articles + an index to access them. Also see tntzenoreader.
KIWIX in a Nutshell
Kiwix is an offline reader for web content. It's especially thought to make Wikipedia available offline. This is done by reading the content of the project stored in a file format ZIM, a high compressed open format with additional meta-data.
- Pure ZIM reader
- Content and download manager
- Case and diacritics insensitive full text search engine
- Bookmarks & Notes
- kiwix-serve: ZIM HTTP server
- PDF/HTML export
- Multilingual (UI in more than 110 languages)
- Search suggestions
- ZIM indexing capacity
- Support for Android / MacOSX / Linux / Windows / Sugar
- DVD/USB launcher for Windows (autorun)
Why offline matters
We're featuring a quote here from the UN Broadband Commission from their September 2013 report, because it's the easiest, most pragmatic and straight-forward way to show you the importance of disseminating knowledge - and information - offline, complementary to all activities that we do online: "“While more and more people are coming online, over 90% of people in the world’s 49 Least Developed Countries remain totally unconnected.”
Projects that involve KIWIX
KIWIX in Jail
Since March 2013, prisoners who request can have an access to Wikipedia offline. The idea is to stimulate or to support the interest for education of prisoners who were, for a large majority, condemned to long-time sentences. After three months of pilot phasis, the project is successful: Among the 36 prisoners of the Bellevue’s prison in Gorgier, 18 possess or rent a computer. All of them requested the upload of Wikipedia offline on their PC. For security reasons, swiss prisoners have a very restricted access to internet. The feed-backs are unanimously positive: they reveal that Wikipedia is seen as an improvement of the education and/or information activities in jail. The follow-up of the project aims to use Wikipedia in the training program of the prisoners: use of Wikipedia in the classes, organization of general culture contests, even train new Wikipedia editors. The goal is also to extend the project to other swiss prisons and to detention centers for minors. The partnership between Wikimedia CH and the direction of the prison aims to be durable: Wikimedia CH installed the Kiwix files and trained the IT team of the prison, who can now upload the software for every new prisoner who requests.
Canada, France and Belgium also have have projects that involve KIWIX in prisons. More info on this will follow soon.
To get information on the project Afripédia of Wikimedia France, you can go to the page of Afripedia here on Meta.
Enciclopedia de Venecuela
Wikipedia for Schools
"At SOS Children, we wanted to bring this fantastic resource to children without internet access around the globe. So we began work on an ambitious project to get the very best content from Wikipedia into a self-contained selection which could be distributed on a CD. We checked every article for child friendliness and structured the content around the national curriculum. Today, Wikipedia for Schools is in its fourth incarnation, and the new version is ready to go - this time on USB. At EduWiki 2013, we will show you how the project has benefited students and teachers here in the UK, and in countries across the developing world. With the help of others, we have distributed copies globally, and we have had an amazing response from the people who count. In the UK, Wikipedia for Schools has been a great classroom companion for students and teachers alike.” 
- (English) (Français) (Español) Official Web site
- (English) RSS/Atom Planet
- (English) Follow our last improvements...
- translatewiki:Translating:Kiwix for localisation
- Wikimedia endorsement (recent)
- KIWIX in JAIL Summary on Commons
- Presentation on KIWIX on Commons
- MWDumper program
- http://www.broadbandcommission.org/Documents/bb-annualreport2013.pdf Annual UN Broadband Commission Report 2013
- https://wiki.wikimedia.org.uk/wiki/EduWiki_Conference_2013/Abstracts#Workshops by Jamie Goodland, who works with the international children’s charity SOS Children
Tools for adding pictures to HTML dump
For the English Wikipedia, there is a corresponding picture dump, that has the fullsize version of all pictures referenced in articles. (Commons ??) This dump runs about 100 Gigabytes. Alternatively, the pictures can be fetched from the live wikipedia - but this should probably be done by an approved Bot or it risks being blocked as a spider. User Kelson has tools to do either of these.
- a list of all articles, with the first 15 words of the article
- a list of words, with the article index of all articles that contain the word.
- a simple form, that allows a search of the second array, and returns the summary from the first array.
A dedicated search engine has been created for DVD edition of Polish Wikipedia. Its documentation is available here.
See Alternative parsers on MediaWiki.
Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools
Lots of new ideas just made by me that may interest you, and those interested in the project of creation of Wikipedia CD/DVD, and the need to better patrol the contents, work better in teams with supervizors, and enforce the copyright and national legal restrictions.
These new concepts concerns ALL Mediawiki projects, not just those hosted by the Fundation and not just Wikipedia.
See the discussion just started here:
Wikipedia on CD/DVD#Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools.
Most probably a new great project to build.
verdy_p 12:10, 16 November 2008 (UTC)