Talk:Community Tech/Numerical sorting in categories

From Meta, a Wikimedia project coordination wiki

Sorting umlauts correctly[edit]

@DannyH (WMF): Regarding:

There is also a request from the German community to sort umlauts correctly -- it was one of the top wishes on TCB's wishlist survey. (survey entry)

This should be entirely possible today (without any further development work), by setting $wgCategoryCollation to uca-de. The only problem could be the updateCollation script taking too long, but dewiki categorylinks table has only ~10M rows too. Matma Rex (talk) 22:45, 28 February 2016 (UTC)[reply]

Oh, thanks for the info. Brian and Jaime are currently working on speeding up the updateCollation script (or talking about it, at least). I've added your note to our investigation ticket (T120854), so we'll have it when the script is fast enough to run. Thanks! -- DannyH (WMF) (talk) 22:27, 29 February 2016 (UTC)[reply]
@Matma Rex: I created a new ticket for this: task T128806. Should be easy enough to implement once task T58041 is resolved, which should be fairly soon. If T58041 gets stalled, we can probably just switch it anyway (as you suggest), but I would prefer to wait for T58041. Kaldari (talk) 23:11, 3 March 2016 (UTC)[reply]

Auto-sorting numbers as page titles[edit]

User:Wikid77 here. As a long-term editor on English Wikipedia, it would be nice to have automatic numerical sorting of numbers as page titles, in the midst of alphanumeric titles sorted alphabetically. This split, as a 2-minded sort, could work in a category by introducing number-groups before the typical first-character groups. For numeric title groups, all 1-digit numbers would fall under group "1s" and 2-digit under "10s" but 3-digit would split under "100s" or "200s" or "300s" (etc.) as groups up to 100 each, then 4-digit years as 1000s or 1100s or 1200s etc. For example, consider the mixed list of titles: 9, BBC News, #Hasher, 14, 101, 141, 19, 123/ABC, 1-methylpropyl, 282, 2 Fast 2 See, 2's complement, 23, 3, 8, 7, BBB, 66, 1492, 1497, 1922, 292, 1984, 84, 184, 93, 1993. The auto-sorting with numeric titles could split a category as:

  • 1s   - 3, 7, 8, 9
  • 10s - 14, 19, 23, 66, 84, 93
  • 100s - 101, 141, 184
  • 200s - 282, 292
  • 1400s - 1492, 1497
  • 1900s - 1922, 1984, 1993
  • # - #Hasher
  • 1 - 123/ABC, 1-methylpropyl
  • 2 - 2 Fast 2 See, 2's complement
  • B - BBB, BBC News

The result would put all integer numbers into numeric order (as no longer "101, 14, 140, 184, 19, 201, 22" etc.). Beyond 9999, then the numeric titles could be considered alpha, such as "32767" sorting under group "3" (or not?), but page "3276" would sort under the numeric group 3200s. I think that division would be ideal, even if it took a year to approve and implement. Currently, modern year titles are sorted under "1" for 1900s or under "2" for 2000s, which can seem strange to newcomers. -Wikid77 (talk) 20:32/20:39, 2 March 2016 (UTC)[reply]

@Wikid77: You can test out the library that we're planning to use for numerical sorting here: ICU collation demo. Under Settings on that page, set numeric to "on", and then you can test out the sorting with lists of page titles in the open text box. That will fix the "101, 14, 140, 184, 19, 201, 22" problem. The conversion script to make that happen is currently being reviewed to make sure it won't break anything, and when that's done, we'll be able to fix the sorting.
I'm not sure about the headings -- it didn't occur to me to ask, so I'm glad that you brought it up. I'm sure it won't do "1", "2", "3" as headings, because that would be pointless -- only one item under each heading. :) I'll see if I can find out what the headings will be. -- DannyH (WMF) (talk) 21:03, 2 March 2016 (UTC)[reply]

How to sort out effectively[edit]

11 will now come after 2, and 99 will come before 101. This is a welcome change to a software which for ages has ordered pages according to the first digit only. But in certain times there might be a sorting error when different numbering formats are used. 1001, 1 001, 1,001 are the same number in different punctuations, how will they appear on a list with 999, 1010 and 1 010? To put it simple, how complex will the algorithm be to be able to correctly identify the size of numbers and sort out correctly? 49.148.66.44 11:12, 10 May 2016 (UTC)[reply]

I think that we should go with the simple method of raw numbers, no splitters; this would be simplest to encode, and would ensure that sequences of numbers could be used without the software merging them together into a single number. עוד מישהו Od Mishehu 17:35, 10 May 2016 (UTC)[reply]
It appears that the algorithm only treats sequences of digits specially, and breaks the sequence on any space or punctuation. Quoting the Unicode standard [1]: "any sequence of Decimal Digits (…) is sorted at a primary level with its numeric value". You can verify this yourself with the demo linked above (https://ssl.icu-project.org/icu-bin/collation.html; set "numeric" to "on" in top-right corner, input texts to order on the left): "1 001" is sorted after "1" but before "2". Matma Rex (talk) 18:46, 10 May 2016 (UTC)[reply]
Is there an example of a category where this would be a problem? I'd be surprised if there's an actual category on a wiki that includes different numbering styles -- both 1 001 and 1,002. -- DannyH (WMF) (talk) 19:04, 10 May 2016 (UTC)[reply]
Actually there is for ex. the sorting of US highways whereas three digit highway numbers are considered spur routes of the main highways, maybe similar numbering systems in other countries. But I guess that in such cases we already use DEFAULTSORT or other sorting values anyways so I think we can ignore the issue. --Matthiasb (talk) 23:15, 28 June 2016 (UTC)[reply]
PS: As a second thought, but I am no expert in chemics, but aren't numbers used here as well in a different way that only integer "counting"? --Matthiasb (talk) 23:19, 28 June 2016 (UTC)[reply]

Some thoughts from nowiki[edit]

If we want to list categories containing birth years at nowiki we would order them either in ascending order

or in descending order

Or we could do it like this in ascending order

or in descending order

The previous are all valid ways to sort years, even if we don't use all of them at nowiki. It is possible to solve the problem by having recognized autosequences, but then the "f.kr."/"e.kr." is in the wrong position. Another way to do this is by using "f.kr."/"e.kr." as an ascending/descending marker for a single autosequence.

This is according to Alfabetisering/sortering : regler for norske bibliotek, Oslo : Norsk bibliotekforening, 1985, ISBN 8299093252, p. 23 [2]

There are other sources too on this, if anyone will argue over its slim size (It is only 26 pages). — Jeblad 23:41, 10 May 2016 (UTC)[reply]

Something from @Cwek: on zhwiki[edit]

--zh:Wikipedia:互助客栈/技术#Tech News: 2016-39

lit. Regarding this, is it be possible to just make a Magic Word to control, instead of request at Phabricator? --Liuxinyu970226 (talk) 00:06, 2 October 2016 (UTC)[reply]

Cwek and Liuxinyu970226 -- I'm sorry, I'm not sure what you mean. This change would affect the entire zhwiki. You can use the Magic Word DEFAULTSORT to change the sorting on a specific page, but it won't fix every numerical sorting problem on the wiki. I don't know if I understood the question correctly -- could you say more about what you mean? -- DannyH (WMF) (talk) 18:42, 7 October 2016 (UTC)[reply]

DannyH (WMF)I means that the sorting style can be only set for the specific category, not for the whole site. So We assign the numerical sorting for some categories, but the other use the default sorting with the configuration of the site. --Cwek (talk) 00:45, 8 October 2016 (UTC)[reply]
Using MagicWord to control the sorting style on the category page.When edited the cat. page with changing the sorting style, it will set a sorting job on backgorund queue to resort the category items. --Cwek (talk) 00:45, 8 October 2016 (UTC)[reply]
@Cwek: What you are asking for (I think) is phabricator:T30397, which has not been implemented (and will not be any time soon). Allowing the same page to be sorted differently in different categories would require a major re-architecting of how category sorting works. I'm sorry we can't support that currently. Ryan Kaldari (WMF) (talk) 01:11, 12 October 2016 (UTC)[reply]

Help page[edit]

Could it be written a proper help page that describes what gets sorted and how? How does the numeric sorting interact with ordinary lexical sorting? And how is multiple numbers handled? Can sorting be tailored to specific categories? — Jeblad 17:57, 14 October 2016 (UTC)[reply]