Talk:Proposed Wikipedia policy on foreign characters

From Meta, a Wikimedia project coordination wiki

2002-02-06: If article titles can only contain a subset of ASCII, it would be nice to be able to change what is displayed at the top of the page, by some kind of #TITLE line or a separate field on the edit page, so that the page on El Nino can actually say El Niño in the header.

What's wrong with titling an article, e.g. El Niño and making e.g. El Nino a #redirect to it? Latin-1 titles need to work for the French, German, etc Wikipedias, so there's no technical reason why they won't work. (Well, they're broken right *now* but that's a bug due to certain URLs not being URL-escaped, and should be fixed shortly.) Certainly there's no reason to limit ourselves to ASCII when the customary spelling of a name or word *in English* uses non-ASCII Latin-1 characters, and the typing/searching argument is easily taken care of by redirects, just like other common misspellings. Brion VIBBER 2002/02/06
The primary reason for limiting titles to ASCII is that most English-speaking Internet users don't know how to enter non-ASCII characters on their keyboards, and so will have difficulty searching for articles, guessing URLs, and creating new links. Those of us who do know how can always edit the article text to be nicer, but keeping the titles simpler is a big win for simplicity. Redirects help the search problem a bit, but only if authors remember to do them. It is critically important that creating a useful new article be easy.

(I happen to personally believe that Unicode isn't being given enough chances and that there's nothing wrong with storing the header as UTF-8, converted from Latin-1 and entities by something like recode_string('h..utf8', $title).)

This is certainly doable, if so desired. Or, we could just convert the whole wikipedia to UTF-8. :) Brion VIBBER

UCS-2 is an encoding of ISO/IEC 10646, not a character set. The equivalent in Unicode is UTF-16. UTF-16 can encode all Unicode characters, it's just that those not in the Basic Multilingual Plane need to be encoded as surrogates pairs. Do you want to say that only characters from the BMP should be used, or do you actually mean any Unicode 3.1 (soon 3.2) characters?

--Carey Evans


Yes, what I mean to say is that the underlying character set of Wikipedia, independent of how it is encoded or served, is 16-bit ISO 10646. Strictly speaking, "UCS-2" is an encoding, but it can only encode the 16-bit plane, not the 32-bit stuff. At any rate, this is just to conform to HTML 4.01. I'll try to clarify that.

Yes, we need something like "#TITLE", and we've talked about that before a lot. Something like that probably will happen, and if it does, there will no longer be any compelling reason to avoid special characters in that display-title. But I do still think article titles should generally be English terms. Lee Daniel Crocker


We should use bignums for character codes (^_^)

Actually, we should just allow diacriticals to be used in a header. What if there ARE two names which are alike but for accent marks? How likely is that? I think we will start getting some of those soon... Juuitchan

We do allow diacritics in titles. However, the English wiki is still limited to what you can squish into Latin-1. --user:Brion VIBBER