Wiki language ISO 639-1 → BCP 47 proposal

This is a proposal to gradually shift away from limited ISO 639-1 language codes to ISO 639-3 standard for all wikis when applicable. Majority of the wikis (which are a large number of smaller wikis) already use ISO 639-3 codes as an ISO 639-1 code does not exist. A significant number of languages (particularly dialects of languages) are not represented in ISO 639-1. We should really be sticking to one single standard for the purpose of consistency.

This would also solve the situations where you have one "main" wiki (macro-language) of a language that a dialect or variant of that language that has spin off to be a full fledged wiki on its own. You end up with a situation where the main wiki represents only one dialect/variant. The ISO code for no.wikipedia for instance actually refers to both 'nno' (Nynorsk) and 'nob' (Bokmål). If a wiki represents only one dialect of a macrolanguage and all other dialects have their own wiki, that last dialect shouldn't be represented by the macrolanguage.

All existing ISO 639-1 codes would redirect to ISO 639-3 codes. So for example links to en.wikipedia would go to eng.wikipedia. This is to ensure backwards compatibility.

I don't think that it would be a good idea to move big language wikis like enwiki to a new name. Remember, we are one of the most visited sites in the world and such a major change would need a major discussion. -Barras 15:56, 21 August 2011 (UTC)[reply]

Indeed. The change wouldn't influence functionality. If you were to type en.wikipedia.org or eng.wikipedia.org you would end up on the same main page. The idea of this proposal to start that very discussion however I do not feel the proposal is "ready" to be advertised on individual village pumps yet. I certainly want to hear what the foundation thinks for instance. -- とある白い猫 ^chi? 16:02, 21 August 2011 (UTC)[reply]

See bug 14010 for related discussion.
— Danny B. 19:13, 28 August 2011 (UTC)[reply]

I'm just wondering what the practical benefit of consistency is. Ok, so we have the (Standard) German Wikipedia at the ISO 639-1 code for German in general (de), when in 639-3 there is a specific code for Standard German (deu). Then we have various Wikipedias such as "bar", "ksh" etc which are technically subsumed under "de" in ISO 639-1, but have their own codes in ISO 639-3. But so what? How does it impact the German Wikipedia or Wikimedia in general if the code is "deu" instead of "de"? This seems a little bit like consistency for consistency's sake. --Terfili 17:53, 30 August 2011 (UTC)[reply]

Consistency is important. That is the point of standards. It is kind of weird when most wikis do not even use the ISO 639-1 standard and that is maintained for the sake of not having a standard/being different. Is there any benefit from keeping the 2 letter ISO codes instead of 3? You cannot argue that all languages are equal if they do not even comply with the same standard. This is particularly important with the macro language - dialect/sub language relationship.

Are there any technical benefits of this? No, there isn't.

-- とある白い猫 ^chi? 23:11, 2 September 2011 (UTC)[reply]

Possible problem

“

RFC 4646 para 2.2.1:

“

Note: For languages that have both an ISO 639-1 two-character code and an ISO 639-2 three-character code, only the ISO 639-1 two-character code is defined in the IANA registry.

”

This indicates pretty clearly to me that we should *not* use the three-letter codes for those languages for which two-letter codes are registered, since they wouldn't be considered valid RFC 4646 language codes for web usage. If we're not using them, and there's no past usage to drive traffic, there's little or no reason to go adding redirects.

Resolving as INVALID, as request was made on the incorrect basis that we use "a mix of ISO 639-1 and ISO 639-2/3". Rather than the implied unordered mix (in which case trying to redirect everything to everything else could make a lot of sense), we have always attempted to simply abide by RFC 3066 (superseded by RFC 4646), which is quite clear about the method in which it draws from those sources. Since there is no ambiguity, there's no need to provide alternates.

--Brion VIBBER @ bug 14010

”

“

However, in the newer RFC 5645 and RFC 5646, all individual languages and macro-languages of ISO 639-3 were finally registered as (primary) language subtags, with a new language matching algorithm that allows a resource whose localization is missing in an individual language to be looked for in its macro-language, whose code is now present in the IANA database along with other classification information coming from ISO 639-3 (and also ISO 639-5 for language families).

From IETF language tag article

”

So I do not believe this is a problem anymore. Also this proposal doesn't deal with metadata (where language tags matter) and just the 2/3 letter abbreviation for wikis. -- とある白い猫 ^chi? 21:10, 28 August 2011 (UTC)[reply]

So, both RFC 3066 and RFC 4646 are obsoleted now (so it wouldn't make much sense to keep following them) and the IANA registry already has the 3-letter codes of ISO 639-3 defined (so the problem Brion pointed to above is no longer an issue). I'd love to see WMF's/Brion's comments on this. If no other impediments are presented, I'd fully support this proposal. --Waldir 00:00, 8 September 2011 (UTC)[reply]

Using two letter code when both two and three letter versions exist has been the standard for years. It would be inconsistent to not follow that practice. – Nikerabbit 05:49, 8 September 2011 (UTC)[reply]

So I hear from Tim, I am hence proposing BCP 47 standard instead which seems to be w:IETF language tag what IETF is using for inter-operability purposes. Likewise we should use that for the same inter-operability purposes. -- とある白い猫 ^chi? 06:08, 8 September 2011 (UTC)[reply]

The latest BCP 47, i.e. RFC 5646/4647, still requires ISO 639-1 to be used in preference to ISO 639-3 where a language code is available in both. BCP 47 provides strong stability guarantees, which are intended to avoid the need for mass data migration, such as moving en.wikipedia.org to eng.wikipedia.org. See section 3.4 of RFC 5646. That is one of the reasons why BCP 47 is an appropriate standard to use for Wikimedia subdomains, an application where instability would cause a great deal of disruption. -- Tim Starling 08:06, 8 September 2011 (UTC)[reply]

Then we should use BCP 47 standard which would require few changes (if any) causing minimum to no disruption? -- とある白い猫 ^chi? 21:00, 8 September 2011 (UTC)[reply]

ISO 639-6

The ISO 639-6 standard is being developed currently. Since ISO 639-3 and its base, Ethnologue, though being the best that we currently have, has nontheless a lot of errors, omissions, and various glitches, with ISO 639-6 an extended list of (all) languages, dialects, and language varieties is being worked on. It is going to give standard recognition to some language varieties that we support already in MediaWikis localisation, such as the ones, we code as *-formal at the moment. It differentiates between written and spoken language varieties in many instances. It can also be expected to add several thousand dialectal varieties missing from ISO 639-3 today. It is known to be using four letter abbreviations once it is out, since obviously three-letter abbreviations keeping the existing ones would not provide a sufficient number of codes. We can expect ISO 639-6 to be integrated into BCP 47 in the usual way - by simply adding to the already existig code sets. --Purodha Blissenbach 06:04, 20 September 2011 (UTC)[reply]