User:CKoerner (WMF)/New Japanese language analyzer

Help needed with new search analyzer

Hello,

The Search team at the Wikimedia Foundation is looking for people who speak Japanese who can help us understand how a new Japanese language analyzer affects search results.

The purpose of the analyzer is to break text up into words, and to index related forms of words together, so that searching for one finds the rest.

Currently, if you search Japanese Wikipedia you get matches on "bigrams", which are sequences of two characters. For example, the phrase "ガラティア語" is currently broken into bigrams as "ガラ", "ラテ", "ティ", "ィア", and "ア語". The new analyzer divides it into just two words, "ガラティア" and "語".

With the Japanese analyzer, you'd also get matches on other forms of a word. For example, "押さえ込ま", "押さえ込み", "押さえ込む", and "押さえ込ん" will all match each other.

Of course, it isn't perfect. It doesn't always divide words perfectly, and it misses some matches, and makes some it shouldn't, but the hope is that overall the net effect is much more positive than negative.

In WMF Labs we have a copy of the Japanese Wikipedia index. You can search, and it shows snippets of results, but the full articles are not there.

It would be great if you could try it out. Please run some queries in labs and see what you think, or if you'd like, run them both in labs and on the regular Japanese Wikipedia to compare.

Any thoughts—including concerns and complaints, of course—would be much appreciated!