Installing Lucene search

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search


Obsolete[edit]

This tutorial is about an obsolete version of MediaWiki lucene-based search engine. Please refer to mw:Extension:lucene-search for up-to-date information.

Introduction[edit]

Lucene is a library that can be used to construct full-text search engines. One of such search engine was written by Kate Turner in about 2005 to improve the searching functionality of MediaWiki-based wikis. This original search engine was written in Java. Later, the search engine was ported to C# by Brion Vibber who conducted a number of experiments to determine the performance of different Java/C# implementations. It appears that the C# engine is what is used to search the english wikipedia and this page explains how to obtain and install this engine on another MediaWiki-based wiki.

Components[edit]

The overall operation of the system is as follows:

  1. a user submits a search query to the wiki
  2. the wiki formats and forwards the search request to the search engine
  3. the search engine consults a local index to find relevant pages
  4. the search engine sends the results to the wiki
  5. the wiki presents the results to the user

To install the system we need to:

  1. obtain and install the search engine on a server
  2. create a local index based on the data stored in a wiki
  3. install an extension to the wiki that knows how to send/receive data to the search engine

The tools that are need for the installation are:

  1. The Subversion version control system
  2. A way to execute C# programs (I used mono)
  3. A way to compile C# programs (I used mcs)

There are premade Debian (or its derivative) packages (mono-mcs for C# compiler) for these tools. It is likely that other distributions also have these packages.

NOTE: Some old versions of mono-mcs from old Ubuntu, such as mono-mcs 1.0.1 from Ubuntu 4.10 will fail to compile this. However mono-mcs 1.1.3 from Ubuntu 6.06 is able to compile this.


Installing the Search Engine[edit]

The first step to installing the search engine is to obtain a copy of the source code. The source code resides in the MediaWiki SVN repository and can be obtained like this:

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwsearch
  • This link is broken.... Where can I find the correct link?

This will create the a directory called mwsearch in the current directory.

Obtaining Extra Libraries[edit]

The search engine relies on a number of external libraries. Fortunately, most of those are already in the Subversion repository (take a look at mwsearch/libs/README.txt).

There are two libraries that are missing. The one is

wget http://download.wikimedia.org/tools/mwdumper.jar
cp mwdumper.jar mwsearch/libs/


The other library that is missing is the C# port of Lucene which is called lucene.net. Unfortunately lucene.net is not easy to compile with mono (at least it was not for me but I am no expert).

NOTE: It appears that the following download version is incorrect. You need Lucene.Net Version 1.4.3
Source Here: [1]
Docs here: [2]
BINARIES HERE: [3] This is a slow link but keep trying.

Fortunately, there is a stable compiled version available from the Lucene.Net download page, at the time of writing, the latest version was 2.0-004-11Mar07, download the latest binary archive and then:

unzip Incubating-Apache-Lucene.Net-2.0-004-11Mar07.bin.zip
cp Incubating-Apache-Lucene.Net-2.0-004-11Mar07.bin/src/Lucene.Net/bin/Release/Lucene.Net.dll mwsearch/libs/

NOTE: Someone report that they could not compile with Lucene.Net 2.0, however they could compile with Lucene.Net 1.4.3.

Compilation[edit]

Now we are more or less ready to compile mwsearch. Before starting the compilation take a look at the Makefile and adjust any of the installation paths/tools if necessary.

cd mwsearch
make
make install

How to do this on Windows? --89.175.73.253 12:20, 27 December 2007 (UTC)Reply[reply]

Try install Linux... 151.49.79.122 14:04, 11 March 2008 (UTC)Reply[reply]
Very funny ... :-) --89.175.73.253 16:30, 19 March 2008 (UTC)Reply[reply]
I just compiled it under a Virtual Machine in Ubuntu. It had several dependencies, but after resolving them, it works (for Windows ;-) ) --193.27.220.82 12:15, 28 May 2008 (UTC)Reply[reply]

This procedure works like it is on windows ?

Configuration[edit]

Having compiled mwsearch now we need to configure it and create some indexes for it to search. The Subversion repository contains a sample configuration file called mwsearch.conf.example. Modify this file to match your needs---the important fields are specifying a name for the database that is to be searched, the location where to place the index file and the port on which the search engine should listen for incoming connections. This file should be placed in /etc (Note: other locations may work as well???)

Theorically, yes (there's a parameter for that), but I had "not found" error when I did. DarkoNeko 12:46, 3 July 2008 (UTC)Reply[reply]

Building a Local Index[edit]

The search engine does not search the wiki database directly. Instead, it has a local index of the information in the database which is organized in a fashion that is suitable for searching.

To create the initial index we first need an XML dump of the data that is stored in a wiki. MediaWiki comes with tools to dump the data in the database. These tools are in the maintenance subdirectory of a wiki installation:

cd mywiki/maintenance
php dumpBackup.php --current --quiet > dump_mywiki_date.xml

This command should be able to access the database. Also note that the dump file may get quite large, depending on the amount of information that is present in the wiki.

Now we can ask the search engine to index the data:

MWSearchTool --import=dump_mywiki_date.xml my_wikidb

The value of the import argument is the name of the file containing the dumped wiki data, while the second argument is the name of the wiki database. Recall that in mwsearch.conf we specified the location of the index: some files should have appeared in this location.

Having done all this, we are ready to start the search engine:

MWDaemon

We can test if everything works correctly like this:

telnet localhost 8123
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET /search/my_wikidb/test

HTTP/1.1 200 OK
Content-Type: text/plain
Connection: Close

5
1 0 Test%26Test
0.450461 0 Test
0.450461 0 Test1
0.04926917 14 Cat
0.04223072 0 Main_Page
Connection closed by foreign host.

The exact output of the previous example would depend on what data is present in the database.

Installing the Wiki Extensions[edit]

Finally, we need to install a MediaWiki extension that replaces the ordinary search functionality with code that contacts the search engine that we just installed.

MediaWiki can use mw:Extension:Lucene-search (pre MW 1.13) or mw:Extension:MWSearch (MW 1.13+) to fetch results from this search engine.

Extension:LuceneSearch code is available from the MediaWiki repository:

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/LuceneSearch
cp LuceneSearch/* mywiki/extensions

Extension:MWSearch

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch
cp MWSearch/* mywiki/extensions

In addition, unless you already have it, you would need to get the file ExtensionFunctions.php which is in http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions. Place this file in the root of your wiki installation:

cp ExtensionFunctions.php mywiki/


All that is left to do is to modify the LocalSettings.php file to enable the extension:

$wgLucenePort = 8123;
$wgLuceneHost = "localhost";  # or where the search engine lives
# To load-balance with from multiple servers:
#  $wgLuceneHost = array( "192.168.0.1", "192.168.0.2" );

require_once("$IP/extensions/LuceneSearch.php");
$wgDisableInternalSearch=true;

Now searching should work.

TODO[edit]

The last thing that is needed is to see how to (incrementally) update the search engine index.

see -> http://www.mediawiki.org/wiki/Extension:LuceneSearch#Incremental_updates

What about a function that reads out the line of the results and creates a link directly to that line? --213.214.18.64 09:04, 21 May 2007 (UTC)Reply[reply]