Use another search engine

From Meta, a Wikimedia project coordination wiki
(Redirected from Search Engine)
Jump to navigation Jump to search

<- MediaWiki architecture < Ugly MySQL hacks

Because our use of MySQL's search engine is kind of hackish, and there have been performance problems with it, it's sometimes been suggested to use an alternate search engine. Some possibilities include:

Google[edit]

A few times in the past the internal search has been disabled entirely and we've pointed people at the wonder that is http://google.com/ .

Pro:

  • It can do full-text searches of pages on a given domain (ie, ours)
  • it's fast
  • it doesn't use any of our server power (short of occasionally spidering the site, which is fairly well behaved and it does it anyway.)
  • Handling of non-ascii characters usually mostly works

Con:

  • We have no control over its workings
  • The index only updates monthly or so
  • Won't distinguish namespaces; articles may not have priority over eg talk pages
  • Searches web pages, not wiki pages; tends to put a lot of interface gunk into the summaries
  • Takes users out of our interface
  • Search results are censored (this doesn't only apply to Google)

ht://Dig[edit]

ht://Dig is a web indexing and searching system for a website or set of websites.

Pro:

  • Proven technology
  • Can be configured to search per namespace
  • Open source (GPL)
  • Results can be presented within our user interface

Con:

  • It needs to periodically spider and index the entire site(s). This puts a load on the servers and the index will lag behind editing work.
  • No UTF-8 support yet.

Jakarta Lucene[edit]

"Jakarta Lucene is a high-performance, full-featured text search engine written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."

Pro:

  • Since we would run it and it's open source, we can tweak the indexing scheme to our liking and present results within our user interface
  • ?

Con:

  • Need to run a Java VM, which may eat up extra memory
  • ?

Sphider[edit]

"Sphider a lightweight search engine in PHP"

Pro:

  • Free
  • Simple install
  • Fast Full text indexing with google like results
  • Possbility to exclude common words from being indexed.

Con:

  • Uses some CPU time
  • Uses MySQL to store indexes

OmniFind Yahoo Edition[edit]

http://omnifind.ibm.yahoo.net/

Pro:

  • Free
  • Simple install
  • Full site indexing
  • Attachment/file indexing
  • Fully customizable search interface
  • Fast full text returns with cache available

Con:

  • Shared host hostile - CPU intensive
  • Based on WebSphere - Memory intensive
  • Requires tweaking to not recursively index the same pages over and over

?[edit]

moved from Search Engine:

Search engine

---Ideas---

The MySQL full text search is often switched off due to performance issues.

It also has in its default configuration a word stop list that can make for a poor search i.e. try to find R Smith amongst the Smiths.


---Possible solutions----

  • The search engine should be a separate database server so as to not impact the main servers. (load spreading)
  • Create the search index ourselves. (Optimising the search terms and data tables)
  • The search index can either be created at edit save time. (slows responsiveness)
  • The index can be created in the background. (delays the index but increases responsiveness and allows the casual browser the best performance)
  • Another available timeslot is when a user spell checks an article.(medium impact)

Writing our own search engine from scratch can be optimised for our requirements and therefore be faster.

See also User:archivist in the few lists that i have seen for Full text search nowhere is there a mention of the Windows Desktop Search. I agree that it is based on windows but it has a .net as well as a com Interface definetely something that can be looked into other advantages being it has a GUI for Administration and we can also change a Lot of parameters using the Console.has the ability to throttle up or down as per needs. completely compatible with Unicode supports various file Formats. including HTML/RTF/TEXT etc. has add ons for File types like PDF. provides a interface to create you own filter for your own File type. Supports 23 languages. is Unicode compliant. sounds like a good option for me.