Wikimedia Blog/Drafts/Wikimedia moving to Elasticsearch

This was a draft for a blog post that has since been published at https://blog.wikimedia.org/2014/01/06/wikimedia-moving-to-elasticsearch/

Title

Wikimedia moving to Elasticsearch

Body

We're in the process of rolling out new search infrastructure to all of the wikis, so it's a good time to explain what's coming to all Wikimedia wikis in the very immediate future, why we're changing it, and how you can get involved.

First a bit of background. All Wikimedia sites have been using a home-grown search system based on Apache Lucene since 2005 or 2006. It was written primarily by volunteer Robert Stojnić and is called lucene-search-2. This is a fantastic search engine, which has powered the sites for years now, and has managed to scale very well for the past 8 years or so. Early in 2013 this became a point of significant operational problems; short-term we were able to patch some of the most glaring issues in lucene-search-2 but it became increasingly apparent that a replacement was needed. Robert is no longer around and the system is showing its age.

We're very happy with Lucene but we wanted to get out of the business of maintaining a special-purpose open-source search system when there are two very good general-purpose open-source search systems available: Solr and Elasticsearch. Both are based on Lucene and horizontally scalable for data and query volume. After experimenting with both and implementing basic MediaWiki integration we chose to settle on Elasticsearch for the following reasons:

Elasticsearch's reference manual and contribution documentation promised an easy start and pleasant time getting changes upstream when we've needed to
Elasticsearch's super expressive search API lets us search any way we need to search and gives us confidence that we can expand on it. Not to mention we can easily write very expressive ad-hoc queries when we need to.
Elasticsearch's index maintenance API lets us maintain the index right from our MediaWiki extension, so it's easier for us to deploy and test, and should be easier for MediaWiki users outside Wikimedia to use. At the time of the choice, Solr's schema API was read-only.
Rack awareness, automatic shard rebalancing, statistics exposed over HTTP, preference for JSON and YML over XML, and first-party Debian packages were also nice.

To provide the integration to MediaWiki, we've written a new extension called CirrusSearch that we've designed to be mostly backwards-compatible with the current search with the following exceptions:

Templates are expanded before indexing so text that comes from templates will be searchable but text inside templates no longer will be.
Page updates are reflected in search results pretty quickly after they are made, usually within seconds for single page edits.
Wiki communities can mark some pages as higher or lower quality and it will be reflected in the search results.
A few new "expert" options have been added (intitle: is negate-able, prefer-recent: etc).

We've documented all of these features and more on mediawiki.org, and the page is licensed in the public domain so people can feel free to copy it to their wikis as a basis of documentation.

We plan for this replacement search to be a Beta Feature for all wikis by the end of February and the primary search in March or April. See our ever-evolving timeline for ever-evolving specifics.

We've got a lot of exciting things on the horizon now that we've got a modern and stable search for Wikimedia. We're talking Wikidata, Commons metadata, faceting, real cross-wiki searching, etc. Please get involved by filing bugs, talking to us on the project page, or by finding us on IRC and pinging us there. On IRC, you can find us as ^d and manybubbles.

Chad Horohoe and Nik Everett, Wikimedia Foundation