Static version tools

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
STATIC CONTENT

Static content group (talk)
CD/DVD on meta
WP 1.0 on meta
German CD on meta
Polska DVD on meta
Mandriva on meta
Offline readers
Offline task force
on strategy wiki

Software tools
Alt parsers (on MediaWiki)
WikiMiner(pl,en)
Kiwix
wiki2cd
GERMAN WP 1.0 (t)
de info in English

POLISH WP 1.0

ITALIAN WP 1.0 (it)

Malayalam WP 1.0 (ml)

ENGLISH WP 1.0 (t)
Bot (t) Criteria.
SOS Children DVD online browsable (t)
Version 0.5 (t) (bot)
(Nominations) (t)
Core topicsTorrent
Work via WikiProjects

Wikipédia Junior (active)
FRENCH CD (very old)

This is a central repository for sharing software - scripts and other tools - for use in static (offline) releases of Wikimedia projects such as Wikipedia 1.0. Several different language groups are working on similar scripts, so it makes sense for us to share the best that we have. The following needs have been identified:

  • Interactivity and interfaces : Front-ends to read and interact with different snapshot formats.
  • Reducing text : summarizing, auto-excerpting
  • Ranking text : bot-assisted reviewing/vetting/rating, metric analysis (apsp, grank, hit-popularity, edit-popularity, expertise, writing style, &c)
  • Metadata : bot-assisted annotation (audience, type, categorization)
  • Spellchecker, grammar checker
  • Copyvio checker
  • Image resizing & compression
  • Metadata extraction
    • History metadata (list of users, freshness, &c)
    • Image/media metadata
  • Index generation (for browsing)
    • Category tree generation

Some of these sections are represented below by actual scripts & other tools - please add more as you find appropriate.

Tools to do all the work[edit]

Tools for assessing & cleaning up articles[edit]

WP_1.0_bot is used on the English Wikipedia to collect assessment information on articles from WikiProjects. For example, a member of the Chemistry WikiProject will manually assess a chemistry article for its quality and importance and record the information in a talk page project template. This information is compiled by the bot, which generates output such as tables, a stats table and a log, which are all valuable for the WikiProject. A complete list of participating WikiProjects and task forces (around 1400 as of August 2008) is available at the Index and Index2, along with a global statistics table. The information compiled by this bot is then used as the basis for an automated selection for offline releases (see the selection section below). An improved version of the bot is being discussed (August 2008).

The French Wikipedia also uses a similar bot, written independently; this includes some features not available in the English bot code.

Tools for selecting articles[edit]

See also http://www.wikihow.com/Import-XML-Dumps-to-Your-MediaWiki-Wiki

SelectionBot is beginning to be used on the English Wikipedia for making a selection of articles based on quality and importance. It depends on WikiProjects providing extensive metadata, via the WP_1.0_Bot (see above), but as of August 2008 such data are available on 1.4 million articles. Preliminary test code for SelectionBot is available here, and test output here, but please note that these are only at the testing stage (as of August 2008).

Tools for assembling the CD[edit]

See also Manual:Using content from Wikipedia and Extension:DumpHTML. The following is a summary of w:User:Wikiwizzy/CDTools

Provided with an article list, or category list, or both, the task is to create a static HTML dump, browseable off the CD.

The raw dumps of all language Wikipedias are available as XML at Data dumps. These dumps can be manipulated using the MWDumper java program.[1] mwdumper accepts a --filter switch, that can be used to pick only a defined selection of articles, outputting a similar, much smaller and more manageable XML dump, wpcd-trim.xml.

This is a good time to remove unwanted sections, like interwiki links and External links, if desired.

Ideally, we would like to create the HTML dump from this XML, but the need for Categories and need for a tool to convert mediawiki markup to HTML markup means that at present creating a mediawiki installation seems the best way to go.

An empty mediawiki installation (including mysql and apache) can then be loaded with the article subset, giving a 'wikipedia' with only the required articles, and trimmed sections. However, category links will not work yet, as they are stored in a different XML dump at Data dumps.

To load Category information, the wpcd-trim.xml file is read again, and all needed articles are scanned for their categories. All categories that have at least 3 articles in them are filtered out of the complete category dump, and loaded into the mediawiki installation.

Now, the dumpHTML.php script from the mediawiki software can be run, to create a static HTML dump.

The Wikipedia Offline Server was just released publicly - it is still under heavy development, but already today it allows you to browse the pages from any language html dump (wikipedia-*-html.7z files) on your localhost. It consists of a small ruby script with an embedded webserver, and uses 7zip to selectively extract contents. (We are working on improving 7zip to make this faster). See the initial announcement at http://houshuang.org/blog/2007/02/16/wikipedia-offline-server-02/ (houshang.org site is gone as of January 2008).

Spotting vandalised sections

There is a useful tool for listing all of the "bad words" that are often a red flag for vandalism - the perl script is available here. The English Wikipedia plans also to use the Wikitrust approach to identify good versions of articles.

wiki2cd software

This software just takes a list of topics and does all automation to create a local repository ready to distribute in CD. Used for creating Malayalam wikipedia selected artciles CD. More details available here http://wiki.github.com/santhoshtr/wiki2cd/

Tools for reading files offline[edit]

  • TntReader (as of August 2008) is a tool for reading zeno files. These files are used on German offline releases, and are being adopted elsewhere; they consist of compressed wikipedia articles + an index to access them. Also see tntzenoreader.

Kiwix[edit]

Screenshot of the version 0.9 (screencast)
KIWIX Flyer - Your Wikipedia Offline
KIWIX Brochure - Your Wikipedia Offline

KIWIX - Wikipedia Offline in a Nutshell[edit]

Kiwix is an offline reader for web content. It's especially thought to make Wikipedia available offline. This is done by reading the content of the project stored in a file format ZIM, a high compressed open format with additional meta-data. KIWIX also gives you the freedom to copy, modify and distribute the data.
To sum up: KIWIX allows you to store the whole Wikipedia offline on your device, USB flash drive or DVD and access content incredibly fast.

Why offline matters[edit]

We're featuring a quote here from the UN Broadband Commission from their September 2013 report, because it's the easiest, most pragmatic and straight-forward way to show you the importance of disseminating knowledge - and information - offline, complementary to all activities that we do online: "“While more and more people are coming online, over 90% of people in the world’s 49 Least Developed Countries remain totally unconnected.”[2]

Projects that involve Wikipedia Offline[edit]

KIWIX is mostly installed in schools that cannot afford broadband internet access. In these cases, it's so much faster to use Wikipedia offline

Wikipedia offline in jails[edit]

Since March 2013, prisoners in the prison Bellevue in Gorgier (western Switzerland) who request it can have an access to Wikipedia offline, because Swiss prisoners have very restricted access to the Internet. The idea is to stimulate or to support the interest for education of prisoners who were, for a large majority, condemned to long-time sentences. After a three month pilot phase, the project was proven very successful. Among the 36 prisoners of the Bellevue’s prison in Gorgier, 18 possess or rent a computer. All of them requested the upload of Wikipedia offline on their PC.

The feedback is unanimously positive: it reveals that access to Wikipedia is seen as an improvement of education and/or information activities in jail.

The followup of the project aims to use Wikipedia in the training program of the prisoners. The use of Wikipedia in the classes, the organization of general culture contests, and even the training of new Wikipedia editors. The partnership between Wikimedia CH and the direction of the prison aims to be durable. Wikimedia CH installed the Kiwix files and trained the IT team of the prison, who can now upload the software for every new prisoner who requests. Detention Centers for minors are excluded from this program in Switzerland as they get access to the Internet and don't have the need to access Wikipedia offline.

In 2014, WMCH started to collaborate with the Swiss Insitute for Education in Detention Centers to expand the coverage of Wikipedia offline in prisons all over Switzerland. As of May 2014, all prisons in the German-speaking part of Switzerland have access to Wikipedia offline, thanks to the Swiss Institute for Education in Detention Centers.

Canada, France, Belgium and Italy (jail in Pavia, where a Kiwix server runs in a dedicated computer room, led by http://www.informaticisenzafrontiere.org) also have have similar projects in prisons that involve Wikipedia offline.

Afripédia[edit]

To get information on the project Afripédia of Wikimedia France, you can go to the page of Afripedia here on Meta.

Enciclopedia de Venezuela[edit]

A selection of articles about Venezuela are made accessible for pupils and students, among others on OLP devices.

Wikipedia for Schools[edit]

"At SOS Children, we wanted to bring this fantastic resource to children without internet access around the globe. So we began work on an ambitious project to get the very best content from Wikipedia into a self-contained selection which could be distributed on a CD. We checked every article for child friendliness and structured the content around the national curriculum. Today, Wikipedia for Schools is in its fourth incarnation, and the new version is ready to go - this time on USB. At EduWiki 2013, we will show you how the project has benefited students and teachers here in the UK, and in countries across the developing world. With the help of others, we have distributed copies globally, and we have had an amazing response from the people who count. In the UK, Wikipedia for Schools has been a great classroom companion for students and teachers alike.” [3]

Mesh Sayada[edit]

Mesh Sayada[4] is a collaboratively designed and built wireless network. The town of Sayada is located in Tunisia. The network serves as a platform for locally-hosted content, such as Wikipedia Offline in Arabic and French thanks to Kiwix software, free ebooks and Open Street Maps. The Mesh is serviced and maintained by a local NGO, CLibre[5] with the help of local volunteers.

User Feedback[edit]

  • "Very important and helpful source of information" (User from Bahrain)
  • "Thank you for your help! Now my school can use Wikipedia offline."' (User from Mexico)
  • "I like to browse my favourite encyclopedia even when there is no network" (User from Yemen)
  • "I have no internet in my house. KIWIX is such a help, because I need Wikipedia for my study."' (User from Cuba)

Features[edit]

KIWIX provides a range of opportunities and here you go with a shortlist of the most important ones:

  • Portable: Kiwix is a portable application you don't need to install. KIWIX supports a wide range of systems and architectures.
  • User-friendly: KIWIX works like your web browser and is translated into your native language.
  • Library: KIWIX own library allows you to gather content at first sight.
  • Search Engine: KIWIX has got a title suggestion system. This helps you to quickly get the information you need.
  • Web Server: Kiwix allows you to share content on your LAN with kiwix-serve, the KIWIX HTTP server.
  • Open: KIWIX uses open formats and protocols. KIWIX produces open-source software.

Technical Specifications[edit]

  • Pure ZIM reader
  • Content and download manager
  • Case and diacritics insensitive full text search engine
  • Bookmarks & Notes
  • kiwix-serve: ZIM HTTP server
  • PDF/HTML export
  • Multilingual (UI in more than 110 languages)
  • Search suggestions
  • ZIM indexing capacity
  • Support for Android / MacOSX / Linux / Windows / Sugar
  • DVD/USB launcher for Windows (autorun)
  • Tabs

Do you want to get involved?[edit]

There are many ways to participate and to work with us in order to develop the KIWIX - Wikipedia offline project. The following list features many topics where help would really be appreciated:

  • Translations: The KIWIX user interface is translated into more than 100 languages. We still have some more work to do here.
  • Support: KIWIX has a broad community - we need to care for it! It's essential to maintain good communication internally and with our users; both should be able to quickly get the information and the help they need.
  • Projects: We have a lot of ideas and we try to implement the best ones. Supported by the Wikimedia Foundation, Wikimedia national chapters and a few other organizations, KIWIX is able to set up ambitious projects.
  • Development: KIWIX software development is assured by a really small team of developers. To continue the development of KIWIX, new talented developers are welcome. Mentored by an experienced team, they may work on new features or help to maintain the existing solution.

Get in touch[edit]

  • kiwix.org
  • twitter.com/kiwixoffline
  • facebook.com/kiwixoffline
  • contact@kiwix.org

See also[edit]

References[edit]

  1. MWDumper program
  2. http://www.broadbandcommission.org/Documents/bb-annualreport2013.pdf Annual UN Broadband Commission Report 2013
  3. https://wiki.wikimedia.org.uk/wiki/EduWiki_Conference_2013/Abstracts#Workshops by Jamie Goodland, who works with the international children’s charity SOS Children
  4. Case Study: Mesh Sayada by Ryan Gerety, Andy Gunn and Will Hawkins Open Technology Institute
  5. Association pour la culture numérique Libre

Tools for adding pictures to HTML dump[edit]

For the English Wikipedia, there is a corresponding picture dump, that has the fullsize version of all pictures referenced in articles. (Commons ??) This dump runs about 100 Gigabytes. Alternatively, the pictures can be fetched from the live wikipedia - but this should probably be done by an approved Bot or it risks being blocked as a spider. User Kelson has tools to do either of these.

Search tools[edit]

The amount of information on the CD or DVD begs for some method of searching. Search is normally done via a server - usually a Database search. On the CD we have neither - and the only computer in sight is the browser. The only generally portable browser language is Javascript, and there is a solution called ksearch-client-side[1] that allows a javascript program to be built to act as the index. It contains two arrays and a program :-

  1. a list of all articles, with the first 15 words of the article
  2. a list of words, with the article index of all articles that contain the word.
  3. a simple form, that allows a search of the second array, and returns the summary from the first array.

WikiMiner[edit]

A dedicated search engine has been created for DVD edition of Polish Wikipedia. Its documentation is available here.

Alternative parsers[edit]

See Alternative parsers on MediaWiki.

Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools[edit]

Lots of new ideas just made by me that may interest you, and those interested in the project of creation of Wikipedia CD/DVD, and the need to better patrol the contents, work better in teams with supervizors, and enforce the copyright and national legal restrictions.

These new concepts concerns ALL Mediawiki projects, not just those hosted by the Fundation and not just Wikipedia.

See the discussion just started here:
Wikipedia on CD/DVD#Concepts for a dedicated "MediaWiki" client application with live update and patrolling tools.

Most probably a new great project to build.

verdy_p 12:10, 16 November 2008 (UTC)

References[edit]

  1. ksearch-client-side FAQ