Proposed Wikipedia policy on foreign characters
The last substantial edit to this page was made on or before 5 Februrary 2002 and as such this page contains numerous inaccuracies.
Over the last year, we have had many discussions about the various ways foreign characters are used, and I think we reached consensus on many of the issues, but those discussions are a bit scattered. I think a simple, organized, documented policy on these issues is important to help all the newcomers, and also to serve as a spec for fixing the new software to properly support and enforce those choices.
The following applies only to the "primary" English language Wikipedia. Foreign Wikipedias will probably use different character encodings and allow foreign characters in titles.
Article titles will be composed of a string of characters from among the set of upper- and lowercase US w:ASCII (7 bit) letters, digits, hyphen, comma, period, left and right parenthesis, apostrophe, forward slash, and space. Article titles will be converted to "canonical" form by the software, which will consist of collapsing strings of spaces into a single space, removing any spaces at the beginning or end of a title, and capitalizing the first letter of the title and any letter that appears after a space, hyphen, comma, period, parenthesis, or forward slash (but not after an apostrophe). The software should prevent the creation of articles whose titles contain illegal characters.
Article titles should be English terms. Foreign names and words that have Anglicized forms in common use (for example, "Taoism", "Venice") should use that name to facilitate cross-referencing other works that are likely to use that name. Otherwise, the title should use a simplified transliteration of the native term. The Anglicized title is a convenience for searching and indexing, and does not necessarily have to be a usable English term if none exists that matches the above rules (for example, an article about "El Niño" can be titled "El Nino" even though the tilde is universally included even in English news reports.)
Spanish and French terms can simply omit diacriticals in the title. German terms can translate "ö" to "oe", "ß" to "ss", etc. Chinese terms can use Pinyin without tonal markings.
The first paragraph of an article with an Anglicized title (and preferably the first sentence) should spell out the proper foreign name, using foreign characters if necessary, and should also include any other transliterations or alternate names in common use (perhaps parenthetically). This is also were the full names of persons commonly known by nicknames or pseudonyms should appear (the article should be titled by the most commonly known form). Any other names by which something might be searched for should also be mentioned; for example, historical names of cities that have changed political affiliation.
Article content is edited, stored, and served in the w:ISO 8859-1 encoding. Characters within that set may be either encoded directly or as w:HTML named entity references. The majority of text in most articles should use the subset of ISO characters recommended on the w:Wiki special characters page that will work in almost all systems. The bulk of an article about a foreign subject should use the name of that subject as it would most commonly appear in contemporary English-language texts about the subject, which will usually--but not always--be the same as the article title (for example, English texts usually do include most simple European diacriticals, so the "Kurt Goedel" article can use "Gödel" in the body of the text).
Characters outside this range may be encoded as HTML numeric entity references from the ISO 10646 (w:Unicode) character set. The content of an article should never depend upon such characters being properly rendered, but should use them only for additional information. For example, articles on Chinese people and places may include the Han ideographs purely for additional interest, but should use the Anglicized name throughout the article itself.
Math symbols defined as named HTML entities in the HTML 4.01 specification can be used in math articles if needed. If it is possible to avoid using them or to use ASCII substitutes without overly burdening the article with awkward text, then it is preferred to do so. For example, an article that only briefly mentions a transfinite such as Aleph-null should just spell that out; but an article on transfinites themselves that has multiple equations might reasonably require the Aleph symbol.
The following punctuation marks, which are recognized by a majority of web browsers even though they are outside the ISO set, are allowed but not encouraged, and may be included as as HTML entity references with the following names: copy, reg, bull, euro, lsquo, rsquo, ldquo, rdquo, mdash, ndash.
Other "specialty" Unicode symbols that help make an article clearer may be used, but only sparingly and in such a way that the article is still understandable to readers whose system cannot render them.