Community Tech/Numerical sorting in categories/Notes

From Meta, a Wikimedia project coordination wiki

Prelim analysis meeting, Dec 15th[edit]

This would need to be implemented on top of the existing function? What kind of database changes would be made?

Probably no DB changes, we can do sorting at the last minute before output. But pagination is a problem, could be gaps/multiples.

Parsing the string, extracting the number -- if the string has more than one number?

Part of this will be to figure out all the important edge cases. All sorting problems have been figured out in computer science.

Also: Differences in numbers cross-language? 2.000 vs 2,000 -- also RTL and non-Arabic numerals.

Why has this been hanging around since 2006? Discussion on Phab thread talks about DB schema -- adding a sortkey column in category links column. That uses the manual (DEFAULTSORT) column.

We need to change how that's generated? Would be easier than changing DB schema. That would require re-sorting -- maintenance scripts to regenerate fields and tables. Not something you do lightly, but possible. Or we could just process as pages are resaved.

There is a super-hacky way -- saving an article, if there's not a manually created sort, generate one that pads the numbers. Not a good idea.

Message from Bartosz, Dec 22[edit]

I worked on category collations some time ago as a volunteer (before I got hired by the WMF), specifically phab:T45799, in order to get the categories to sort right on my home wiki – Polish Wikipedia :D.

MediaWiki already supports different methods of ordering category entries (category collations) and I'm pretty sure implementing this would not require any database changes or major refactorings. The primary purpose of this system is to make it possible to correctly order categories on wikis in various languages that use special characters, but it could just as well be used for this. See mw:Manual:Collation.php (heh, no documentation)

(I don't think this existed back in 2006, which explains the references to database schema changes in the tasks. The changes already happened years ago :) Maintenance script to rebuild the ordering exists too, called updateCollation.php.)

For example, Polish has a letter 'Ą' ('ą') that should be sorted after 'A', before 'B'. In the default collation, all such special characters are sorted at the end (after 'Z'). Note that there's no universal ordering that works for all languages. For example, the same letter 'Ä' should be sorted after 'A' in German, but after 'Z' in Swedish. See Swedish alphabet and German alphabet.

(Aside: MediaWiki doesn't support having more than one collation on a single wiki at the same time. This is also a long-wanted feature, e.g. for Wiktionaries (to have different sorting in categories listing words in different languages) or Chinese-language projects (which has multiple possible collations, all equally correct, but used for different purposes). See Chinese characters#Indexing.

This is implemented by having an algorithm that converts the page title (or the 'defaultsort', if any is specified) to a "sortkey", such that if page A should be displayed before page B, the sortkey for A will compare smaller than B (sortkeys are just binary strings, and can be compared efficiently in database queries). These sortkeys are stored in the database (so no, it's not possible to do the sorting just before displaying the page; as you note, it wouldn't work with pagination).

MediaWiki supports a few collations, in particular 'identity' (where the sortkey is just the page title [As encoded in UTF-8]), 'uppercase' (where the sortkey is an uppercase version of page title ;) effectively making category sorting case-insensitive – this is the default) and several 'uca' collations, which use the ICU library to implement the Unicode Collation Algorithm (which mostly works for most languages) and adds "tailorings" to tweak it for different languages (since like I mentioned, there's no universal ordering that works for all languages). See mw:Manual:$wgCategoryCollation.

To implement numeric ordering, you'd just have to implement a new collation. So you're left with just solving the innumerable problems of actually ordering them, some of which you've already been pointed out ;) And it's true that this has already been figured out – the same ICU library we're using to make sortkeys for 'uca' collations apparently has an option to do this. However, it's a C library, and the PHP bindings don't make it possible to turn that option on. The ICU library does expose the numeric sorting order, via $myCollatorObject->setAttribute( Collator::NUMERIC_COLLATION, Collator::ON ); (You can test it at [1]). You would still need a different collation name, so they are named differently in the database.

So I guess you have two options:

a) Implement some logic for this in our own PHP code, as a new collation. This sadly will not work for wikis using the 'uca' collations (as of now, 86 wikis in 26 languages). To make it work with 'uca' collations, you could try to just zero-pad any numbers in the page title (you mentioned this, and it's not actually that bad of an idea) before building the sortkey with ICU, but this limits the available length of the sortkey. (Imagine a series of articles like "List of books released in 1999 in Poland", for each year and each country – if the year was padded with too many zeroes, the country's name would not be included in the sortkey and ignored when sorting.)

b) Expose the numeric sorting option of ICU in the PHP bindings, submit that as a patch for PHP (to be exact, the intl extension -- book.intl.php), and get it deployed on our cluster! :D This would actually be really wonderful and probably not that difficult to actually implement. But it would either require waiting until we upgrade to whichever PHP/HHVM version would include your patch (could take years), or backporting it to our version (ops people could be difficult to convince this is a good idea). It wouldn't work for wikis *not* using 'uca' collations, but on the other hand, every wiki should be using those. ;) Use the ICU numeric option.

There's one more issue to consider: category pages are divided by headings for each letter. Right now each number gets its own heading too, which would no longer work with this implemented (since the order would be wrong). You could just put them all under a heading like "Numeric", but maybe there's something smarter that can be done.

Dev Summit notes, Jan 5[edit]

Talked to Brian (Bawolff). We could use ICU sorting. 15 wikis are using it (en.wp not one of them). The conversion script is inefficient, especially for big wikis, so there isn't much appetite. Polish is the largest so far.

Find out why people don't want to use ICU -- Wiktionary doesn't like something about it. En.wp might object to breaking Greek letters or stars -- we could use a bot to fix this?

Brian says that the conversion script has an inefficient query that might have to get fixed. It's possibly fixable, but nobody's looked into it. On large wikis, it'll take two to three days to run, and during that time, category listings might be unstable. Once it's finished running, it should be okay. Talk to Ops about how to handle it.

TCB meeting, Jan 13[edit]

This is also on TCB's radar -- they have a top wish about sorting umlauts correctly. (TCB notes, in German.) Our work with the ICU library should also fix umlauts. We'll be working together on this.

Meeting, Feb 19[edit]

Core question right now: What's wrong with the conversion script to implement the ICU sorting?

NL.wp used the script in early February, and it took 3 hours for a wiki with 10 million categorylinks entries. EN.wp has about 100 million rows, and this will probably be non-linear -- it won't take just 10x as long.

One possibility is to add another index to the database table; it's something that we have to talk to Ops about. Jaime (jcrespo) is open to a new index that improves performance, but we have to do more testing to see if it's feasible. We could run the changes on a duplicate table and then swap it in for the old one.

We'll test the updated conversation script on a beta site, although those are small and won't give us scale. One option is to do a dry run where we get a log of actions that would be done (without actually making the changes).

There's ongoing tech conversation on T58041.

Talk page, Feb 29[edit]

MatmaRex, re umlauts: "This should be entirely possible today (without any further development work), by setting $wgCategoryCollation to uca-de."