Ugly MySQL hacks
MySQL doesn't do everything we would like.
Text sort order seems to be per-server, and we need at least per table. Additionally, we've got fields set to binary so who knows... and UTF-8 isn't supported in 4.0.
It's planned to add a separate sort key field to tables that need it, generating a munged key that can be indexed and sorted on (according to more or less standard algorithms, see some Unicode consortium report). This hasn't been done yet, though.
Check out MySQL 4.1 there UTF-8 and sorting per colum is supported.
We end up munging all the search fields to get usable results:
- underscores replaced with spaces for title
- (wouldn't it be easier if we only used the underscores for urls, and the spaces everywhere else? sigh...)
- HTML and other markup stripped from text
- partial words duplicated: [[super]]califragilistic -> super supercalifragilistic
Additionally, some temporary performance hacks for the English wikipedia include duplicating the cur_namespace and cur_is_redirect fields into the searchindex table, avoiding a need to join to the cur table (and letting that standalone table sit on another server).
It's been sometimes suggested that we use another search engine.
UTF-8 is not supported in MySQL through 4.0.x. To use UTF-8 text in the fulltext search engine, we have to do a bunch of additional crap on text that's about to be inserted into the index or used to match:
- fold case
- replace all high bytes with hex sequences that will be thought of as part of words
- additionally for Japanese and Chinese we have to insert spaces to try to fake interword spacing.
With MySQL 5 one may alleviate many charset-related problems by creating the tables with 'TYPE=InnoDB, DEFAULT CHARSET=utf8 COLLATE utf8_general_ci;'