Community Wishlist Survey 2020/Wiktionary/Multiple collations per site

From Meta, a Wikimedia project coordination wiki

Multiple collations per site

  • Problem: It is extremely common, on Wiktionary projects, to display entries of multiple languages on the same page. But, only one collation can be used on a particular Wikimedia project. That means: if a website uses a language-compliant collation, e.g. uca-default which is a English- and Portuguese-friendly collation, all categories concerning e.g. Swedish words, will sort words starting with Å under A, because Å is considered in English to be the same letter than A with a diacritic, while it is a whole new letter in Swedish (where it is sorted at the near end of the alphabet). Categories' headers are therefore incorrect for many languages with the current solution used on Wiktionary projects.
    Currently a way to circumvent the problem is to use the default Mediawiki collation (namely uppercase), but this implies that sort keys are added in all English/French/etc. entries with a diacritic in the title, as Å, É, etc., as all diacritic letters are considered as first-entry headers in categories, and this implies a huge amount of sort keys in pages to bypass this behavior (and thus sort Å under A for e.g. English), and makes Wiktionary projects less readable and editable for newcomers.
  • Who would benefit: users of Wiktionary categories, and new editors to all Wiktionary projects
  • Proposed solution: allow multiple collations per site, and therefore collation to be specified per category: uca-sv should be used for Swedish-related categories, uca-es for Spanish cats, uca-default for English (and similar), etc.
  • More comments: Liangent and Bawolff have been working on this in the past, but feasability seems also to depend on sysadmins (for increased system load).
  • Phabricator tickets: phab:T30397
  • Proposer: Automatik (talk) 21:58, 23 October 2019 (UTC)[reply]

Discussion

  • This proposal is a rerun of the 2019 proposal, always topical. — Automatik (talk) 21:58, 23 October 2019 (UTC)[reply]
  • It's not up to me to decide (so this is not official in any way shape or form) but, in my opinion, I dont think there are scalability concerns with allowing collations to be set on a per category basis, provided any individual category only has one collation (e.g. there is a magic word to say that this category is french or german or whatever. You can only specify one, you dont for example have a drop down where you can view a category with different collations on the fly (like is wanted in zh)). Bawolff (talk) 04:24, 24 October 2019 (UTC)[reply]
  • This should be merged with Community Wishlist Survey 2020/Wiktionary/Context-dependent sort key. Urhixidur (talk) 14:13, 25 October 2019 (UTC)[reply]
  • This feature is sorely needed. Currently in the English Wiktionary we use a sort_key value for each language in our language data modules that describes how to generate sortkeys from page titles, and is used by the makeSortKey method of our Language objects. The sortkeys are generated inside many different templates, and are used in category links, and to sort lists of links to entries (for instance in Template:col3). The generated sortkeys are not always able to make categories sort correctly, as described in the proposal.

    An extension of this proposal would be allowing definition of custom collations. Some languages probably do not have a collation system (not sure of the correct terminology) available, such as Egyptian (which in Wiktionary mostly uses a transliteration system rather than hieroglyphs). The desired sort order for the Egyptian transliterations (ꜣ j y ꜥ w b p f m n r h ḥ ḫ ẖ z s š q k g t ṯ d ḏ) is so different from the order of code point values (b d f g h j k m n p q r s t w y z š ḏ ḥ ḫ ṯ ẖ ꜣ ꜥ), which is presumably used in Category:Egyptian lemmas, that a custom sortkey cannot work. We can sort lists of links by generating a sortkey for each link with a module (Module:egy-utilities), but the Egyptian module cannot be used in categories because the sortkeys would put nonsensical code points in the category headers. (The sortkey-generating function works by replacing the characters in the transliteration with arbitrary code points that have the correct sort order.) So getting Egyptian categories to sort correctly requires a custom collation system.

    Another idea would be to make collation available in a Lua library for Scribunto. At minimum what would be required is a function to compare two strings using a collation and yield values indicating "greater than", "less than", or "equal" (like strcmp in C), which could be adapted for use by table.sort (which requires a function returning a truthy value that indicates whether argument 1 is less than argument 2). Then we could sort lists of links using the same collations used in categories whenever possible, rather than using module-generated sortkeys. This might not depend on the implementation of the "multiple collations" proposal, but if custom collations were implemented, ideally they would be available in the Lua library.

    I haven't submitted the first idea as a separate proposal because it depends on the "multiple collations" proposal, but perhaps I should submit the second one. — Erutuon (talk) 20:53, 7 November 2019 (UTC)[reply]

Voting