Jump to content

WikiMiner

From Meta, a Wikimedia project coordination wiki

WikiMiner is a search engine dedicated to the DVD edition of Wikipedia. It was created in the year 2006 for DVD edition of Polish Wikipedia (235,000 articles), but can be easily localized to any other language version. By now two language files has been created: Polish and English one. The program is now being tested under various operating systems, and some minor changes are being implemented. The first name of the program (WikiBrowser) has been changed because of the conflict with other Wikimedia project, WikiBrowse.

Main features of the application:

  • The program is a standalone Java application. It requires Java Runtime Environment (JRE) ver. 1.5 or higher. Index can be placed on hard drive(fast search) or DVD.
  • It supports case-insensitive searching
  • Boolean phrases are supported.
  • Search result entries include page title, number of occurrences of the searched keywords, Wikipedia categories and excerpts from the article content
  • Index is to be installed on the user's hard drive, therefore the DVD is not used until a user clicks on a link to an article
  • The resulting index for the whole Polish Wikipedia, that is 2GB of text, takes 120 MB.
  • In a minimum installation mode, only Java Runtime Environment has to be installed on hard drive
  • Only part of the index is loaded into memory.
  • Searching is fast, once the index is loaded at the program startup.
  • You can search in Japanese as well as in French or Polish - full unicode is supported
  • Grammatical suffixes can be specified and cut during indexing and searching
  • Command line mode, stopwords, redirects are also supported
  • Alphabetical sorting is done using simplified UCA algorithm, which respects order of non-ASCII characters.
  • The index is being created from HTML pages which have to be formatted in a specific way (UTF-8 coding, article text should start from <p> and end with <div id='footer'>, etc.).
  • The program is released under the GNU GPL license.
  • It is independent from the operating system (at least tested and working under Windows and Linux)
  • In opposition to some GNU tools like Regain, no WWW server is being installed on a client hard drive, the program doesn't raise security alert on WinXP, and demands no special rights for any applet or application. Standard security settings are ok.
  • Search results are written to a temporary HTML file, and then the default HTML browser is called (or some other, depending on program configuration). While opening the result page, the program checks if Wikipedia DVD with article base is present, and if not, shows appropriate warning. Temporary files are removed when program exits.
  • The program doesn't use Javascript or Java applet (just Java).
  • The DVD is not required to perform searches.

Building index

[edit]

Section under construction

You have to prepare a set of HTML pages to be indexed. Each HTML file must be written in UTF-8 coding.
In WIKI.INI file the following options should be set up: TODO
Then you need to execute command:

java -jar WIKIMINER.JAR -make

Filenames of all files used by the program are capitalised to be consistent with ISO 9660 Level 2 standard. It allows maximum compatibility of the DVD installation.

Searching

[edit]
Snapshot of the search panel in English version of the interface
A – expression to be found
B – starts searching
C – link to DVD article #1 (for example to the main page of the project)
D – link to DVD article #2 (for example to the help page)
E – database size
F – maximum number of results on a single HTML page
G – number of the first result on the output page
H – previous page (substracts F from G) and starts searching
J – next page (adds F to G) and starts searching

Program searches for any words in all articles in the main Wikipedia namespace (ns=0). Grammatical suffixes (like -s in English, or about 30 suffixes in Polish) can be cut. Their list can be configured.

Search is case-insensitive. All unicode non-ascii characters similar to latin characters, like ą, Ü and about 500 other letters, can be typed as their nearest ASCII equivalent as well. Standard transliteration of German and Dutch letters (Ü=ue, etc.) is also supported.

While searching for a sequence of words, default and operator is assumed, and program finds all articles that contain all required words (in any order).

The resulting list can be navigated using hotkeys, which is especially important for the blinds. On Windows system Alt+1 jumps to the first result, Alt+2 to the second one, etc.

[edit]

Operators and, or, not and parenthesis can be used in search query.

Examples:

  • George and not Bush - looks for all pages containing word George and not containing word Bush.
  • Betty has (cat or hamster) - looks for all pages containing words Betty, has and at least one of the words: cat or hamster.

Versions:

  • Instead of and you can type &, &&. You can omit it as well.
  • Instead of or you can type |, ||
  • Instead of not you can type !, ~
[edit]

To search in article titles only, use title: keyword.
Examples:

  • title:Adam Mickiewicz looks for all pages with word Adam in a title and Mickiewicz in a title or in an article body.
  • title:(Adam Mickiewicz) looks for all pages with word Adam in a title and Mickiewicz in a title (in any order).
  • title:Adam or title:Eve looks for all pages with word Adam in a title or Eve in a title.
[edit]

To obtain all articles in:

  • categories with a given word in a category title and
  • their subcategories,

use categ: keyword.
Examples:

  • categ:History returns all pages from categories History, History of Poland, etc., and their subcategories.
  • key categ:Databases returns all pages with a word key in a database context.
  • categ:Islam categ:Christianity returns all pages connected both with islam and christianity.
  • sun and not categ:astronomy returns all pages with word sun not connected to astronomy

Operator precedence

[edit]

Operators, if not modified by parenthesis, are executed in the following order (from the first to the last one) :

  1. title:, categ:
  2. not
  3. and
  4. or

These keywords can be also translated to other languages with no programming required. For example in Polish Wikipedia we were able to use kateg, tytuł, i, lub and nie together with categ, title, and, or and not.

Stopwords

[edit]

Program removes from query:

  • too short words (while index creation, the minimum keyword length can be set),
  • too frequent words (for example the).

User can view list of stopwords using -stopwords command line option.

Command line options

[edit]

Program can be executed from command line as well. It allows using wikipedia search in scripts.

Command line options:

OptionDescription
-ini file.iniuse different configuration file instead of WIKI.INI
-lang file[.lng]use different language file. By now EN and PL are supported, but you can easily create other language files.
querysearch for a given query
-out filewrite result to a given file (applies to command line search only)
-format txtcreate text output instead of HTML file
-format wikicreate output in wikipedia format (links in form of [[title]]
-start nstart the list from result #n
-max nlimit number of results on the list to #n
-stopwordsshow list of stopwords
-asortsort results in alphabetical order of titles
-verboseshow debug information

Example:
Create text list for query John III Sobieski sorted by titles:

java -jar WIKIMINER.JAR John III Sobieski -format txt -out john.txt -asort

Configuration file: WIKI.INI

[edit]

The program uses configuration file WIKI.INI. It should be coded in UTF-8.

OptionsDescription
root=pathpath to the main directory with HTML pages. If more paths is specified (separated by semicolon), program will search all of them.
browser=pathHTML browser used to view result pages
lang=file[.lng]path to the language file. By now EN and PL are supported.
deleteTemp=yes/nodelete temporary HTML files in GUI mode. In command line mode created files are not removed.
index=pathpath to WIKIINDEX.DAT file
map=pathpath to MAP.DAT file
strings=pathpath to STRINGS.DAT
charorder=pathpath to CHAR_ORDER.DAT file
css=pathpath to SEARCH.CSS
logo=pathpath to file with logo of Wikipedia
icon=pathpath to png file with icon of Wikipedia
front=pathpath to the front HTML page (C link on the snapshot above)
help=pathpath to the help HTML page (D link on the snapshot above)
checkDVD=yes/noshould the program check if DVD is in drive ?
alwaysOnTop=yes/no if set to yes, search window will be always visible
WWWLinks=yes/no if set to yes, additional links to on-line wikipedia articles will appear in search results
maxLinksOnPage=number maximum number of links on one result page
maxCategInResult=number maximum number of categories presented in a search result item

Paths to files in WIKI.INI can be in form of:

  • absolute path, for example c:\wikidvd\map.dat in Windows
  • path relative to a directory with program .JAR file, starting from {{JARPATH}}, for example
    map={{JARPATH}}MAP.DAT
  • path relative to the database root directory on DVD, specified in root option. It should start from {{ROOT}}, for example
    index={{ROOT}}WIKIINDEX.DAT

Copyleft and author

[edit]

Program, in order to launch HTML browser and show the search results, uses modified BrowserLauncher class. Its copyleft:

This code is Copyright 1999-2001 by Eric Albert (ejalbert at cs.stanford.edu) and may be
redistributed or modified in any form without restrictions as long as the portion of this
comment from this paragraph through the end of the comment is not removed.  The author
requests that he be notified of any application, applet, or other binary that makes use of
this code, but that's more out of curiosity than anything and is not required.  This software
includes no warranty.  The author is not repsonsible for any loss of data or functionality
or any adverse or unexpected effects of using this software.

Credits:
Steven Spencer, JavaWorld magazine (Java Tip 66)
Thanks also to Ron B. Yeh, Eric Shapiro, Ben Engber, Paul Teitlebaum, Andrea Cantatore,
Larry Barowski, Trevor Bedzek, Frank Miedrich, and Ron Rabakukk

@author Eric Albert (ejalbert at cs.stanford.edu)
@version 1.4b1 (Released June 20, 2001)


Java sources of the program are included in its jar file. You can open it using for example WinRAR program.

Contact to the author: pl:User:Olaf (Olaf Matyja).

Program download: [1] (This is a part of the DVD edition of Polish Wikipedia. Only WikiMiner and its index are included. You can search, but links point to locations, where files are expected on DVD)