User:Brooke Vibber/MWDaemon

In late December, Kate wrote a spiffy Lucene-based search indexer+daemon for use. After taking it temporarily offline due to a problem with hanging connections, there was an unresolved dispute over whether Sun's Java VM was sufficiently 'free' and it didn't get returned to service at the time.

I fiddled with it a bit in January to test with GCJ and other free Java VMs, then set it aside while working on other things. In the last few days I've picked it back up again, making an experimental port to C# (using dotlucene) as a learning exercise and to provide for a comparitive benchmark of the Mono virtual machine.

Pope test

Crude, simple benchmark; does single-word fulltext search for "pope" 100 times sequentially. This is a fairly popular word, and returns 6890 results from the test database used (English Wikipedia, 2005-03-09 dump). Time returned is the average time spent for a request on the client, over the second set of 100 runs.

         GCJ 4.0 -O2: 588.3437 ms (fastest)
Sun Java 1.5 -server: 636.9209 ms
        Sun Java 1.5: 695.5374 ms
             GCJ 4.0: 707.2302 ms
          Mono 1.1.6: 894.4488 ms (slowest)

Each version of the daemon read from the same index set, which I had generated with the Mono-based version. (dotlucene 1.4 is index-compatible with the Java Lucene 1.4.)

Machine is Athlon XP 2400+ w/ 512mb RAM, running Ubuntu Linux (Hoary Hedgehog). Java version is using Lucene 1.4.3; C# version is using dotLucene 1.4.3. Sun's 1.5 JDK is installed from binary package, Mono 1.1.6 is compiled from source, and GCJ is compiled from CVS pulled on 2005-04-07.

Notes

Mono

At the moment my experimental C# port is missing the title prefix matcher (which uses BerkeleyDB on the Java version) and is single-threaded as I haven't finished redoing the threading bits, but it does run the index builder and return search and title results just fine.

I also expanded its logging (with log4net) and rearranged some pieces internally. If I move back to the Java version I may copy the logging bits over, using log4j. (I've used java.util.logging instead of log4j or Jakarta Commons logging as GCJ throws fits trying to compile those. Sigh...)

Performance on Mono 1.1.6 seems to be about 25% slower than Sun's 1.5 JVM on default, and a bit slower further compared to Sun's server VM setting. This would be acceptable, if not great. Mono's in very active development, with strong corporate support, and should continue to improve.

The indexer shows similar performance comparisons; about 72 articles/sec vs 100 art/sec under Sun's (client) JVM.

GCJ

GCJ doesn't get along with BerkeleyDB Java Edition at all well. It used to blow up with a bytecode validation error, which apparently has been fixed, but now segfaults on something else. In the past when I'd gotten it to compile, it would still die at runtime with threading problems... To get this test running, I commented out all the BerkeleyDB bits (used only for the title prefix match, so not exercised in this test).

Performance is better than I expected; about on par with Sun's JVM on default settings and actually a fair bit faster than it on -O2. If it can manage to be reasonably stable, it might be worthwhile to stick with Java and target GCJ after all.

I haven't tested the indexer on GCJ yet.

Note that GCJ 3.3 and 3.4 won't compile other parts of the software as well; building 4.0 from CVS is required. Another GCJ-related possibility is PyLucene, which wraps a GCJ-compiled Lucene as a library for Python.

IKVM

There's a Java compatibility layer for the CLR which can run under Mono, called IKVM. Out of curiosity I tried it briefly, but it's not working. The daemon spawns threads, which accept connections, and then just hang. No real need to investigate this unless someone is bored and wants to send bug reports/fixes to IKVM.

Indexing

Rough speeds running the index builder (MWSearch class):

Sun Java 1.5 -server: 120 pages/sec (fastest)
        Sun Java 1.5: 100 pages/sec
          Mono 1.1.6:  72 pages/sec
         GCJ 4.0 -O2:  33 pages/sec (slowest)

GCJ's doing much more poorly at this. Still should be fast enough for realtime indexing, but it's less encouraging. Will do further testing on the daemon before giving up on C#/Mono.

Multiple search terms

Doing the same search 100 times in a row isn't very "genuine", so I pulled 949 actual search terms for en.wikipedia.org from a squid logs for further testing.

Since some common terms match a *huge* number of pages, I've capped the number of results actually returned to 1000.

Otherwise same as the Pope test above; sequential requests for the first 100 terms in the search set. Average client time in milliseconds for all requests over the second run:

Sun Java 1.5 -server: 280.8036 ms (fastest)
         GCJ 4.0 -O2: 313.9672 ms
          Mono 1.1.6: 363.5854 ms (slowest)

GCJ's slight lead over Sun has dropped in this test, but it's still pretty speedy. Mono's still trailing but workable.

Simultaneous requests

I modified the benchmark tool to run 10 threads in parallel, for a total of 1000 requests (100 per thread). Since I haven't yet fixed the threading in the C# version, I have only tested the Java version so far:

Sun Java 1.5 -server: 2243.7663 ms (faster)
         GCJ 4.0 -O2: 2323.3297 ms (slower)

Understandably, wall-clock time for a response is several times greater with several simultaneous threads sharing time on a uniprocessor machine. Pleasingly, GCJ didn't fall down, and again comes in quite close to Sun's server VM.

Fuzzy matches

The above benchmarks were done with the fulltext search. There's also a title match, though, which uses a fuzzy match mode which turns out to be even slower than the full text regular matches. Using the same search term set as above, running 100 sequential requests.

Sun Java 1.5 -server: 1551,5211 ms (fastest)
         GCJ 4.0 -O2: 3182,1492 ms
          Mono 1.1.6: 3637.9899 ms (slowest)

There's a wider gap here between the Sun JVM and the competitors; GCJ takes twice as long to return these results, and Mono's a bit slower still.

Live testing

Graph of handled requests per second during testing on en.wikipedia.org, 10 April 2005.

Memory usage over similar time period... possible memory leak

I haven't been able to reproduce the memory leak on my test box, so it might be related to running on the old 2.4 kernel. I tried to upgrade Vincent to 2.6, but it didn't come back up after reboot. I'm poking on isidore (FreeBSD) again. --brion 07:12, 12 Apr 2005 (UTC)

Prefix matcher work

Since BerkeleyDB's giving me trouble under Java (even under real Sun java), I'm experimenting with SQLite as an alternative backend for the title prefix matcher. More to come...

Profiling

Top part of plat profile using gprof on a benchmark run against the daemon, hacked to run a single thread.

 %   cumulative     self               self    total           
 time   seconds  seconds     calls   s/call   s/call  name    
14.22     26.24    26.24   2389798     0.00     0.00  org::apache::lucene::search::FuzzyTermEnum::editDistance(java::lang::String*, java::lang::String*, int, int)
 8.60     42.09    15.86 255126290     0.00     0.00  org::apache::lucene::store::InputStream::readByte()
 8.06     56.96    14.87 181850817     0.00     0.00  org::apache::lucene::store::InputStream::readVInt()
 5.53     67.17    10.21  37297712     0.00     0.00  org::apache::lucene::search::BooleanScorer::next()
 4.82     76.07     8.90  53121900     0.00     0.00  org::apache::lucene::search::TermScorer::score()
 4.66     84.66     8.60  55442898     0.00     0.00  org::apache::lucene::search::BooleanScorer$Collector::collect(int, float)
 3.09     90.37     5.71   1722881     0.00     0.00  org::apache::lucene::index::SegmentTermDocs::read(JArray<int>*, JArray<int>*)
 3.03     95.95     5.58  59860063     0.00     0.00  org::apache::lucene::index::Term::compareTo(org::apache::lucene::index::Term*)
 2.87    101.24     5.29  37296910     0.00     0.00  org::apache::lucene::search::IndexSearcher$1::collect(int, float)
 2.81    106.42     5.19 128321303     0.00     0.00  org::apache::lucene::search::FuzzyTermEnum::min(int, int, int)
 2.78    111.54     5.12  44258202     0.00     0.00  org::apache::lucene::index::SegmentMergeQueue::lessThan(java::lang::Object*, java::lang::Object*)
 2.55    116.24     4.70  12039282     0.00     0.00  org::apache::lucene::store::InputStream::readChars(JArray<wchar_t>*, int, int)
 2.41    120.69     4.45   5998213     0.00     0.00  org::apache::lucene::util::PriorityQueue::downHeap()
 2.26    124.86     4.18  11963032     0.00     0.00  org::apache::lucene::index::SegmentTermEnum::readTerm()
 2.26    129.03     4.17 107690772     0.00     0.00  org::apache::lucene::search::TermScorer::doc()
 2.16    133.02     3.99  53123768     0.00     0.00  org::apache::lucene::search::TermScorer::next()
 1.87    136.46     3.45  11963212     0.00     0.00  org::apache::lucene::index::SegmentTermEnum::next()
 1.61    139.43     2.97  53225517     0.00     0.00  org::apache::lucene::search::Similarity::decodeNorm(char)
 1.59    142.36     2.93  39514291     0.00     0.00  org::apache::lucene::search::BooleanScorer::score()
 1.19    144.55     2.19   5803363     0.00     0.00  org::apache::lucene::util::PriorityQueue::upHeap()
 1.05    146.49     1.94  23955287     0.00     0.00  org::apache::lucene::store::InputStream::readVLong()
 1.01    148.36     1.87       488     0.00     0.20  org::apache::lucene::search::Scorer::score(org::apache::lucene::search::HitCollector*)

Profiling: index building

Index building is much slower under GCJ than under Sun's JDK.

Building an index for a ru.wikipedia.org dump on my Athlon box takes 118-120 seconds on Sun server JVM 1.5, versus 476-478 seconds compiled with GCJ 4.0 at -O3. (-march=athlon-xp makes no significant difference, at least not on indexing.)

I'm currently building the indexes in a RAMDirectory and merging that to disk every 10000 entries (so three merges are done on a database with 27807 indexable pages.)

Poking at it with oprofile, it looks like some big chunks are taken up by garbage collection and other JVM internal bits:

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        app name                 symbol name
54554    10.6732  no-vmlinux               (no symbols)
38660     7.5636  libc-2.3.2.so            (no symbols)
31226     6.1092  libgcj.so.6.0.0          __i686.get_pc_thunk.bx
27217     5.3248  libgcj.so.6.0.0          GC_mark_from
26677     5.2192  libgcj.so.6.0.0          _Jv_MonitorExit
24520     4.7972  libgcj.so.6.0.0          _Jv_MonitorEnter
23409     4.5798  libgcj.so.6.0.0          java::lang::Object::clone()
19136     3.7438  libgcj.so.6.0.0          _Jv_LookupInterfaceMethodIdx
15575     3.0472  libgcj.so.6.0.0          gnu::regexp::CharIndexedString::charAt(int)
14114     2.7613  libgcj.so.6.0.0          _Jv_IsAssignableFrom(java::lang::Class*, java::lang::Class*)
13375     2.6167  libgcj.so.6.0.0          _Jv_NewPrimArray
11935     2.3350  libgcj.so.6.0.0          java::lang::String::charAt(int)
11558     2.2613  libgcj.so.6.0.0          gnu::regexp::RE::getMatchImpl(gnu::regexp::CharIndexed*, int, int, java::lang::StringBuffer*)
10327     2.0204  libgcj.so.6.0.0          GC_local_gcj_malloc
9352      1.8297  libgcj.so.6.0.0          GC_local_malloc_atomic
8514      1.6657  libgcj.so.6.0.0          _Jv_CheckCast
8458      1.6548  libgcj.so.6.0.0          gnu::regexp::RETokenChar::match(gnu::regexp::CharIndexed*, gnu::regexp::REMatch*)
7857      1.5372  libgcj.so.6.0.0          anonymous symbol from section .plt
6449      1.2617  libgcj.so.6.0.0          gnu::regexp::REMatch::clear(int)
5801      1.1349  libgcj.so.6.0.0          gnu::regexp::RE::match(gnu::regexp::CharIndexed*, gnu::regexp::REMatch*)

Index building comparison

A couple interesting notes: the Mono and GCJ graphs are very similar, just the Mono version runs a lot faster. :)

The Sun Java graph is very different at the beginning, though. This is using the -server JVM, which is alleged to take longer when starting up, then go superduper fast. Once it gets past the first couple 10000-page merge cycles, though, it starts to look like the others too (just faster). I'll try running a larger database and see if this maintains.

I've fiddled a bit with the optimization options on the Mono runtime with little success. Adding -O=all makes this run a couple seconds slower. -O=all apparently includes all options except precomp and shared. Adding precomp breaks the program completely, and it doesn't run. (Details later.) Shared lets it work, but about 30% slower than without it.

This database only runs a few minutes, so it may be worthwhile letting the optimizations go on a longer run, such as the German wiki.

Looks like mono -O=all makes no difference in the long term either. :(

I also got some swapping on this machine (512MB memory) during the GCJ run, and the Sun VM crapped out claiming the heap was too large near the end.

Updates!

It seems we've started some kind of bidding war between Mono and GCJ developers. ;)

Anthony Green has pointed out that Classpath's regular expressions package is pretty slow and suggested some improvements which speed up index building on GCJ immensely. Using the regexps more efficiently on the wiki markup stripping should speed up index building on Mono too, which I'll try out when I have a chance.

Ben Maurer's been taking a look at our profiling output from the C#/Mono version of the search daemon, and suggested backporting this change from Lucene SVN which reduces the amount of allocating & deallocating of little objects. It looks like it helps a bit, but at that point I got sidetracked on what appears to be a compiler bug...

Some search queries return fewer or no hits when I compile Lucene.Net.dll with Mono's C# compiler, while the same source code compiled with Microsoft's compiler produces the same hits as the Java version. (I'm only running on Mono; I haven't tested the .NET VM and am not really interested in doing so... we have no Windows boxes in our server cluster and are not likely to start!) For the truly masochistic, I've put together a test case with a small search index and a quickie proggie to check the number of hits.

(Don't feel bad, Mono guys: I also came across an embarrassing a Classpath bug in the java.net.URI class while fiddling with this stuff. There's bugs enough to go around!)

On search performance; the difference between Mono and GCJ seems narrower for 'plain' searches than it is for fuzzy matches. We use fuzzy matches for a 'spelling suggestion' when no hits are returned; although this is relatively rare it ends up eating a disproportionate amount of runtime because they're so much slower than regular matches. Ben's also suggested switching to some more efficient special-purpose index for this, perhaps using an existing spell-checking library as a backend. Might be worth it.

Much summarized results of quickie comparison tests with the currentish Java and C# code (1000 sequential hits from the benchmark:

 Java trial 1: Average time per request: 00:00:00.0587265
 GCJ  trial 1: Average time per request: 00:00:00.1769577
 Mono trial 1: Average time per request: 00:00:00.1981337

With fuzzy searches disabled:

 Java trial 2: Average time per request: 00:00:00.0502676
 GCJ  trial 2: Average time per request: 00:00:00.1448828
 Mono trial 2: Average time per request: 00:00:00.1327008

Updated index build comparison

Here's the ru.wikipedia.org index building test done with Anthony's patch using jregex for the wiki markup stripping. Compare against earlier #Index_building_comparison.

This gives GCJ a bit of a boost over mono (which I haven't yet spent much time tweaking the regex usage, but preallocating Regex objects there doesn't help; I'm told they're interpreted, not precompiled). It's still running a bit slower than Sun's server VM in 1.5, but no longer embarrassingly so. :)

(Note that Sun's VM takes a little while to 'warm up'; the first few items in the database are also likely to be non-representative, and the average times during the first 10k pages don't include any time spent merging updates to disk.)

Updated live server figures

Extracted from daemon logs:

The big reductions in service hit time are IIRC from adding a couple more servers alongside the first.

I've got the threading fixed up in the daemon on the C# version, and I'm now experimentally running the C#/Mono daemon on one of the three search boxes. Here are some comparison graphs with a per-server breakdown for a few hours before (all Java/GCJ) and after (with the first one on C#/Mono and Maurus and Avicenna still on Java/GCJ):

I apologize for any incoherence in the graph labeling. ;)

The hits per second figures are pretty solid between the three servers because search requests are dispatched among the active servers randomly, so as long as they're not horribly broken they will tend to be about the same.

Service time is the elapsed milliseconds between starting the Lucene search and getting a result set back, and does not include traversing the returned set and outputting to the network, or network setup and protocol overhead. The various machines are slightly different so slight differences in times are not surprising. On these two graphs Vincent (the server I'm fiddling with) shows up generally between the other two in service time. They seem broadly consistent from the Java to the Mono version, certainly within acceptable limits.

The one interesting difference is that I don't see any sign of the memory leak I had with the GCJ version (see above somewhere) on the Mono version... I've had to regularly restart the GCJ-based server to keep memory usage from growing indefinitely and sending the machine into swapping. This may be due to different GC handling in the runtimes, or just to the term index enumerator backport that Ben recommended to reduce the amount of unnecessary allocations. I'll have to test some more on this and see if Lucene 1.9rc behaves better on GCJ.

Updater daemon

I'm replacing the command-line bulk updater with an updater daemon; MediaWiki sends update and delete requests to it over XML-RPC, and it churns away in the background.

There are a few key reasons for this:

Under MediaWiki 1.5 the database storage is more complicated; current article text might be compressed or stored via an indirection object in another table. It's easier to have the PHP scripts send text to a daemon than to reimplement all that.
Periodic runs through the bulk database kinda suck. Since Lucene makes us do wacky things with deletions and mass merges anyway, it's not much of a benefit over queuing updates.
Handling deletions after-the-fact is tricky. Running from an extension we can see them happen in the deletion hook.

The daemon can be told to close out the indexes and flush all updates; it'll continue to queue updates and start applying them again when it's told to start up again.

A couple things it does _not_ do yet:

Updated indexes are not automatically copied to readers. An rsync or something will have to get run.
The update daemon could probably just be folded into the search daemon, no need for separate ones. Maybe.

And... stuff.

I did run into a Mono remoting bug when setting this up. I'm told the fix has been committed to SVN, but to run this successfully on 1.1.8 you'll need to apply the patch on that bug.

At the moment I'm running the daemon and the update hooks are sending updates to it. Looks like it may just work. Will do a bulk-update later today/tomorrow to merge old pages, then copy the indexes out.

-- brion 7 July 2005 13:37 (UTC)