Installing Lucene search
This tutorial is about an obsolete version of MediaWiki lucene-based search engine. Please refer to mw:Extension:lucene-search for up-to-date information.
Lucene is a library that can be used to construct full-text search engines. One of such search engine was written by Kate Turner in about 2005 to improve the searching functionality of MediaWiki-based wikis. This original search engine was written in Java. Later, the search engine was ported to C# by Brion Vibber who conducted a number of experiments to determine the performance of different Java/C# implementations. It appears that the C# engine is what is used to search the english wikipedia and this page explains how to obtain and install this engine on another MediaWiki-based wiki.
The overall operation of the system is as follows:
- a user submits a search query to the wiki
- the wiki formats and forwards the search request to the search engine
- the search engine consults a local index to find relevant pages
- the search engine sends the results to the wiki
- the wiki presents the results to the user
To install the system we need to:
- obtain and install the search engine on a server
- create a local index based on the data stored in a wiki
- install an extension to the wiki that knows how to send/receive data to the search engine
The tools that are need for the installation are:
- The Subversion version control system
- A way to execute C# programs (I used mono)
- A way to compile C# programs (I used mcs)
There are premade Debian (or its derivative) packages (mono-mcs for C# compiler) for these tools. It is likely that other distributions also have these packages.
NOTE: Some old versions of mono-mcs from old Ubuntu, such as mono-mcs 1.0.1 from Ubuntu 4.10 will fail to compile this. However mono-mcs 1.1.3 from Ubuntu 6.06 is able to compile this.
Installing the Search Engine
The first step to installing the search engine is to obtain a copy of the source code. The source code resides in the MediaWiki SVN repository and can be obtained like this:
- This link is broken.... Where can I find the correct link?
This will create the a directory called mwsearch in the current directory.
Obtaining Extra Libraries
The search engine relies on a number of external libraries. Fortunately, most of those are already in the Subversion repository (take a look at mwsearch/libs/README.txt).
There are two libraries that are missing. The one is
wget http://download.wikimedia.org/tools/mwdumper.jar cp mwdumper.jar mwsearch/libs/
The other library that is missing is the C# port of Lucene which is called lucene.net. Unfortunately lucene.net is not easy to compile with mono (at least it was not for me but I am no expert).
Fortunately, there is a stable compiled version available from the Lucene.Net download page, at the time of writing, the latest version was 2.0-004-11Mar07, download the latest binary archive and then:
unzip Incubating-Apache-Lucene.Net-2.0-004-11Mar07.bin.zip cp Incubating-Apache-Lucene.Net-2.0-004-11Mar07.bin/src/Lucene.Net/bin/Release/Lucene.Net.dll mwsearch/libs/
NOTE: Someone report that they could not compile with Lucene.Net 2.0, however they could compile with Lucene.Net 1.4.3.
Now we are more or less ready to compile mwsearch. Before starting the compilation take a look at the Makefile and adjust any of the installation paths/tools if necessary.
cd mwsearch make make install
How to do this on Windows? --126.96.36.199 12:20, 27 December 2007 (UTC)
- Try install Linux... 188.8.131.52 14:04, 11 March 2008 (UTC)
This procedure works like it is on windows ?
Having compiled mwsearch now we need to configure it and create some indexes for it to search. The Subversion repository contains a sample configuration file called mwsearch.conf.example. Modify this file to match your needs---the important fields are specifying a name for the database that is to be searched, the location where to place the index file and the port on which the search engine should listen for incoming connections. This file should be placed in /etc (Note: other locations may work as well???)
- Theorically, yes (there's a parameter for that), but I had "not found" error when I did. DarkoNeko 12:46, 3 July 2008 (UTC)
Building a Local Index
The search engine does not search the wiki database directly. Instead, it has a local index of the information in the database which is organized in a fashion that is suitable for searching.
To create the initial index we first need an XML dump of the data that is stored in a wiki. MediaWiki comes with tools to dump the data in the database. These tools are in the maintenance subdirectory of a wiki installation:
cd mywiki/maintenance php dumpBackup.php --current --quiet > dump_mywiki_date.xml
This command should be able to access the database. Also note that the dump file may get quite large, depending on the amount of information that is present in the wiki.
Now we can ask the search engine to index the data:
MWSearchTool --import=dump_mywiki_date.xml my_wikidb
The value of the import argument is the name of the file containing the dumped wiki data, while the second argument is the name of the wiki database. Recall that in mwsearch.conf we specified the location of the index: some files should have appeared in this location.
Having done all this, we are ready to start the search engine:
We can test if everything works correctly like this:
telnet localhost 8123 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. GET /search/my_wikidb/test HTTP/1.1 200 OK Content-Type: text/plain Connection: Close 5 1 0 Test%26Test 0.450461 0 Test 0.450461 0 Test1 0.04926917 14 Cat 0.04223072 0 Main_Page Connection closed by foreign host.
The exact output of the previous example would depend on what data is present in the database.
Installing the Wiki Extensions
Finally, we need to install a MediaWiki extension that replaces the ordinary search functionality with code that contacts the search engine that we just installed.
Extension:LuceneSearch code is available from the MediaWiki repository:
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/LuceneSearch cp LuceneSearch/* mywiki/extensions
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch cp MWSearch/* mywiki/extensions
In addition, unless you already have it, you would need to get the file ExtensionFunctions.php which is in http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions. Place this file in the root of you wiki installation:
cp ExtensionFunctions.php mywiki/
All that is left to do is to modify the LocalSettings.php file to enable the extension:
$wgLucenePort = 8123; $wgLuceneHost = "localhost"; # or where the search engine lives # To load-balance with from multiple servers: # $wgLuceneHost = array( "192.168.0.1", "192.168.0.2" ); require_once("$IP/extensions/LuceneSearch.php"); $wgDisableInternalSearch=true;
Now searching should work.
The last thing that is needed is to see how to (incrementally) update the search engine index.
What about a function that reads out the line of the results and creates a link directly to that line? --184.108.40.206 09:04, 21 May 2007 (UTC)