User talk:Brooke Vibber/MWDaemon

From Meta, a Wikimedia project coordination wiki

Feature requests[edit]

  1. Can we have it search numbers?
  2. Can we make word-proximity 'worth' more?
  3. Can we make the search terms "AND" by default?

Can you point to lines in the .CS where I could change? :) Or could you make these options?

Thanks!

Spiffy[edit]

) GCJ seems to be way to go. Though, there's C interface for lucene indexes written... :)

Preferences[edit]

I am assuming that this is what is giving me the wonderful search results on :en: right now, in which case might I request that it take account of my preferences when setting the number of results returned? It seems to pick up my preferred namespaces OK, but I'm only getting 10 results at a time: however if I manually mangle it to give me more, there is no discernable slow-down. HTH HAND --Phil | Talk 16:40, 11 Apr 2005 (UTC)

That should be fixed now. --brion 12:53, 13 Apr 2005 (UTC)

Sorting with multiple search terms[edit]

I searched for Esther Friesner and got a lovely bucket of results. Unfortunately the first result which actually had both of my specified search terms didn't apper until more than half-way down the first page (that's at 50 per page)…that's actually at #33 out of 726. Is there a way to tell it "if I specify multiple search terms, I want to see the results with more of them in towards the top of the list"? --Phil | Talk 12:46, 14 Apr 2005 (UTC)

That's the way Lucene usually works when sorting by score.
OK, so how do I make it work some other way. so I get the pages which include more of my search terms towards the top? --Phil | Talk 16:33, 22 Apr 2005 (UTC)
When I get a chance I'll probably change it to require all terms by default, as this fits better with typical search engine style and the idea of narrowing a search by specifying more terms. --brion 22:33, 22 Apr 2005 (UTC)

More complex queries[edit]

It seems to have a few problems with more complex queries, for example: pope -vatican +city -[observatory TO research]

The problem here is the [observatory TO research]: it produces a 'too many terms' error. A much smaller range, like [observatory TO obtuse] works. --brion 02:59, 20 Apr 2005 (UTC)
Okay, perhaps increase the number of BooleanClauses?

Mono performance.[edit]

Hello,

In Mono 1.1.6 we used a very slow mechanism to do IO: to implement the .NET semantics we required a helper process and as a result many IO based operations would force a round-trip to this helper process (that is why you see two copies of mono running when you run a single application).

Could you try the same test with Mono 1.1.7? The results will likely be different. If you ran Mono in its default mode, it merely uses the default code generation flags, you might want to check one or more of the most advanced optimization (at the cost of startup speed) using the -O=xxx option (use --list-opts for a full list).

Miguel.

1.1.7 speeds up indexing by 25% in the tests I did a few days ago. I haven't checked searching performance yet. --brion 23:09, 12 May 2005 (UTC)[reply]
So I assume you got 90 pages per second on that page? We would love to take a look at the code you are using to benchmark, to see if we can improve it. One thing to might be to run with mono --profile indexer.exe to find out who are the worst memory and cpu offenders --miguel 14 May 2005.
Here's output from --profile on the ru.wikipedia.org test build ("as seen in graphs"): http://leuksman.com/misc/MWUpdater-profiling-run.txt.gz
I've put some vague directions on obtaining/running the code below. --brion 04:40, 15 May 2005 (UTC)[reply]

Making gcj perform even better.[edit]

Try enabling platform specific optimizations with gcj. For instance, -march=pentium4. It's likely there are other small tweaks worth trying - but that's an obvious easy one.

Anthony

-march=athlon-xp didn't make any noticeable affect on indexing speed. On the other hand it looks like most of the CPU time is spent in the class library and VM internal functions, so maybe I should recompile GCC/GCJ/libgcj with extra optimizations... --brion 09:56, 13 May 2005 (UTC)[reply]
That doesn't seem to help either. Actually it seems to slow it down slightly. :)

Doug Cutting e-mailed me a couple weeks ago with some hints for the GCJ build support in the Lucene development branch; I'll fiddle with that some more too. --brion 19:35, 13 May 2005 (UTC)[reply]

gcj indexing.[edit]

I'd like to look into the indexing performance of gcj. How can I get a hold of any data files required? Also, does this require setting up a db as well? - green

I suspect that the poor performace has more to do with the regexp library code we use from GNU Classpath (gnu.regexp). See these benchmarks: http://tusker.org/regex/regex_benchmark.html . Sun's library code is roughly 6 times faster than gnu.regexp in this benchmark. It also points to an even faster implementation here: http://www.brics.dk/~amoeller/automaton/ . We should look at replacing the gnu.regexp code in libgcj. In the meanwhile, it may be worth recoding the indexing code to use this super-fast regexp implementation. - green

This line of code in SearchState.java is what's really killing gcj's indexing performance:

               text = text.replaceAll("\\{\\|(.*?)\\|\\}", "")
                       .replaceAll("\\[\\[[A-Za-z_-]+:([^|]+?)\\]\\]", "")
                       .replaceAll("\\[\\[([^|]+?)\\]\\]", "$1")
                       .replaceAll("\\[\\[([^|]+\\|)(.*?)\\]\\]", "$2")
                       .replaceAll("(^|\n):*[^'].*\n", "")
                       .replaceAll("^----.*", "")
                       .replaceAll("", "")
                       .replaceAll("(|</?[bB]>)", "")
                       .replaceAll("", "")
                       .replaceAll("</?[uU]>", "");

The GNU Classpath project (which virtually all free VMs use) has a slow regexp implementation. Sometime today or tomorrow I will try recoding this to use a fast free 3rd party regexp package and to avoid creating so many temporary garbage strings (if possible). - green


Compile your own (GCJ)[edit]

To build and test...

Check out:

 # http://cvs.sourceforge.net/viewcvs.py/wikipedia/lucene-search/
 cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wikipedia login
 cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wikipedia co -P lucene-search

Get the Apache Lucene 1.4.3 jar

Get the MySQL connector/J jar

Have a suitable GCJ 4.0 or higher installed and go:

 export GCJ=gcj-4.0 # if necessary
 export CFLAGS=-O99999 -mspeed=reallyfast # or whatever ;)
 make

You can get database dumps from http://download.wikimedia.org/. You'll need the 'cur' table from some wiki to test with; you should be able to just import it into a fresh empty database in MySQL. (Warning: to import some dumps you may need to set MySQL's max_allowed_packet to 16M and restart the server daemon.)

Copy mwsearch.conf.example to mwsearch.conf and make changes as necessary.

To use a language analyzer other than the default English one, add a mwsearch.suffix option, and name your database something like <languagecode><suffix> (eg ruwiki, entest, zhdatabase for suffixes wiki, test, or database). Supported languages currently are en, de, ru, and eo (English, German, and Russian analyzers are bundled with Lucene and I wrote a hacky test one for Esperanto.)

To build indexes:

 ./MWSearch -rebuild databasename

If you omit the name, it will rebuild indexes for all defined databases. Running a rebuild when there's an existing index will wipe them out and start from scratch.

Compile your own (Mono)[edit]

Check out:

 # http://cvs.sourceforge.net/viewcvs.py/wikipedia/mwsearch/
 cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wikipedia login
 cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wikipedia co -P mwsearch

Fetch library dependencies and drop the .dll's in the libs directory:

Or get them all in a single tarball from:

  • [1] and unpack on the checkout directory.

Build:

 make

('make install' will install to /usr/local, or you could run it locally)

The C# code's more experimental and wiggier. :) It expects an /etc/mwsearch.conf file by default. Format's a little different from the java version, I'll add an example... (/mwsearch.conf.example if it hasn't shown up in anon cvs yet)

To run the updater to build/rebuild a db:

 cd build
 mono MWUpdater.exe --rebuild databasename

There's a MonoDevelop project file in there also.

Warning: the German language analyzer in dotLucene is broken, but the others seem to work. I've a patch for this: http://leuksman.com/pages/lucene-bug


Odeum as an alternative indexer?[edit]

Odeum is the indexing library used in Estraier, bogofilter, Gonzui, and evrything which uses QDBM. It can be found here: http://qdbm.sourceforge.net/

There is a recent benchmark that seem to show that the QDBM/Odeum performance is better than lucene on HotSpot: http://zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html

QDBM is free software, runs on almost everything, and is plain ole' C code. The software has been around for a lot of time and is definitely stable and withouth memory/performance problems.

It is also actively maintained (latest release is 27 May 2005)


Labelling[edit]

I think it would be more clear to use the word "Sun JVM" instead of just "Java" in the charts. e.g.

 Java trial 1: Average time per request: 00:00:00.0587265
 GCJ  trial 1: Average time per request: 00:00:00.1769577
 Mono trial 1: Average time per request: 00:00:00.1981337

should be

 Sun JVM trial 1: Average time per request: 00:00:00.0587265
 GCJ  trial 1: Average time per request: 00:00:00.1769577
 Mono trial 1: Average time per request: 00:00:00.1981337

if it's too long, may be you can use just "JVM", as "Sun JVM" is the only JVM here (but it's not the only "Java").

Using Lucen is my local mediawiki install ?[edit]

So how can I use this Lucen in my local Mediawiki, is there a Howto somewhere ? I have installed ton of local notes, memos, informations etc. but I find the Mysql search really frustrating mainly because of the lack of * and ? search. thanks --Khalid hassani 22:56, 13 February 2006 (UTC)[reply]

Well, I am replying to myself, after some Googling I found this : http://cvs.sourceforge.net/viewcvs.py/wikipedia/lucene-search/README.txt?rev=1.7&view=auto I will try it, is there some more recent docs ? thanks --Khalid hassani 23:04, 13 February 2006 (UTC)[reply]

Possible problem of DotLucene[edit]

Although GCJ with Lucene won, I would still like to point out a small bug of DotLucene. As I know, DotLucene has some problem on its TF*IDF algorithm. It seems not to be correct in IDF calculation, so DotLucene may rank shorter documents higher, incorrectly. I'm not sure that if DotLucene team fixed it or not. --B6s 14:35, 9 May 2006 (UTC)[reply]

Any news?[edit]

Any news on this, or is it an abandoned idea? --Kingboyk 20:45, 15 November 2006 (UTC)[reply]