Help talk:Special characters
- 1 Automatic changing of "--" into "&ndash" etc.
- 2 Page out of date
- 3 Preview and copy and paste
- 4 font-family class in CSS
- 5 disputed
- 6 English wiki
- 7 Help?
- 8 Something important is missing...
- 9 Moved from article
- 10 "The workaround" accessible by choice?
- 11 Where did the special characters go?
- 12 sorting chars with diacritics
- 13 Indexing Unicode
- 14 Update of English Wikipedia page
- 15 Problem with IPA character order
- 17 Maybe we could also provide a see also link to font packs for different languages here.
- 18 formatting question
Automatic changing of "--" into "&ndash" etc.
What about an automatic way to replace -- by – and --- by — and "text" by «text» (depending on the Wikipedia language version, in German i.e. better &sdquo;text“) as ---- already is replaced by
. Would make editing much better! And, if someone don't want this change, automatic changing could be stopped by using the nowiki pseudo-html element. 184.108.40.206 14:06, 19 Nov 2003 (UTC)
- There's been a lot of talk about this and no decision on which syntax is preferable.
- Now that we have unicode support, we can just insert the actual unicode character instead of creating a wikicode or adding the html entity.
- I have written a script that converts a lot of simple cases into the unicode equivalents when you click a tab. See w:User:Omegatron#Regular expressions. Any suggestions are welcome. — Omegatron 16:41, 4 November 2005 (UTC)
Page out of date
This page appears to me to be horribly out of date. Unicode-aware webbrowsers are available for all platforms now (even obscure platforms like the PlayStation or old MacOSs), and even MSIE has adequate support for Unicode. Use of Unicode on webpages (and thus numerical entities in decimal or hex) should be recommended, and not warned against. 220.127.116.11 01:09, 12 Jan 2004 (UTC)
Preview and copy and paste
One of the problems with HTML characters is that they make the markup much harder to read. A lot of people complain about this. One possible solution is to simply enter the markup as HTML entities, preview it, and copy and paste it back into the markup. For example:
I type this into the markup:
45 − 342×10 = ±∞
preview it, copy the text from the article itself, and paste it back into the edit box:
45 − 342×10 = ±∞
Both look the same, but the markup looks much cleaner and easier to read in the second case (edit this section to see), which a lot of people whine about. Does this only work for my browser (firefox 0.8)? Or can any modern browser handle "advanced characters" in the text box? In other words, if someone else does this, and then tries to edit the text in an old browser, would it turn the special characters into blank boxes before saving?
On second thought... That foreign character screwed up the edit box when I tried to view it after saving. I tried deleting the character and it moved my cursor to the middle of the next line and strange things like that. Maybe not such a good idea... - Omegatron
font-family class in CSS
Hey, so I've found that in my IE the only way to get pages like the International Phonetic Alphabet to display right is to instruct the browser to use a unicode font explicitly using CSS. Inserting an attribute like 'style="font-family: Arial Unicode MS, Lucida Sans Unicode"' usually helps get things working. It would be nice if this style marking were available as some kind of wiki shorthand, or at least if a relevant CSS class were defined in the standard wiki stylesheets. --Chinasaur 22:22, 1 Jul 2004 (UTC)
- On English WP we've developed w:Template:IPA (documentation on the talk page). This applies a
<span>with an appropriate font-family declaration which displays in MSIE/Windows only. Incidentally, it also applies class="IPA" to the text, so you can augment or override the display in your own style sheet.
- It would be great to see this font declaration put into Wikipedia's MSIE-specific style sheet, instead of relying on a CSS hack (actually a well-documented and standards-compliant workaround). —w:User:Mzajac 05:38, 19 Jan 2005 (UTC)
The Symbol font has been on MacOS forever. lysdexia 20:01, 6 Nov 2004 (UTC)
I have updated the discussion on Symbol fonts which indeed was incorrect. There was a difficulty with Windows users using a Symbol font and having the results display properly on non-Windows systems, but this was because of character translation issues between Windows and Macintosh (and also because Un*x machines could not be expected to have a Symbol font installed). It was not because the Macintosh lacked a compatible Symbol font. Similarly, use of characters above 7F on a Mac web page would produce incorrect results on a Windows machine. Browsers would match characters on the target machine as though the character sets were regular MacRoman and Windows Latin-1, rather than passing the characters through transparently. A standard Symbol font can be used on Mozilla browers, but requires hacking the code. See . Jallan 17:24, 12 Dec 2004 (UTC)
Does anyone know when will the English wiki upgrade to Unicode? It seems to me that the first wiki out there is also the last one to upgrade... Halibutt
- The current UTF-8 upgrade procedure we have is extremely disruptive and requires a large amount of intermediate disk space and preparatory work. The English wikipedia is by far the biggest we've got, and the space and time requirements make it impractical to use that conversion method on it. So, it's waiting until we have a chance to get a quicker, cleaner upgrade method ready and tested. --brion 22:01, 11 Dec 2004 (UTC)
- from what i have heared from speaking with devs it will be converted as part of the upgrade to 1.5.
- the main issue has been adapting mediawiki so we can convert without dumping and converting all the bulk text which would mean days of downtime (instead existing text will be conveted on load and new/modified text will be saved in utf-8). Plugwash 01:30, 29 May 2005 (UTC)
Does anyone know how to type ♡, ☆, and ♪? (I was able to copy-and-paste them from somewhere to ask this question, but that doesn't always work -- for example, in Notepad when I'm editing HTML.) --Ketsuban
- you CAN paste them in notepad at least the recent versions of it. However this is only of use if you are saving the text in an encoding that supports those characters. for using those chars in html documents that are not in an encoding that supports those charactors you will have to use the &#<codepoint in decimal>; or &#x<codepoint in hex>; notations. Plugwash 00:51, 29 May 2005 (UTC)
Something important is missing...
"To find out which character set applies in a project, copy e.g. ☉ (this should be a circle with a dot inside)"
what if it isn't? in my copy of Mozilla Firefox (both in the edit box and on the article page) it's a circle with nothing inside.
Moved from article
Text moved out of the article pending major restructuring and rewriting to reflect the current situation some of it may go back later.
The following extended ASCII characters are safe for use in all Wiki pages. The table below shows the character itself, lists the code for each character in hexadecimal and decimal, shows the HTML entity name, and gives the common name of the character.
|00A0||0160|| ||no-break space|
|¤||00A4||0164||¤||intl. currency sign|
|«||00AB||0171||«||left double-angle quote|
|®||00AE||0174||®||registered trademark sign|
|¶||00B6||0182||¶||pilcrow (paragraph) sign|
|·||00B7||0183||·||middle dot (Georgian comma)|
|»||00BB||0187||»||right double-angle quote|
|ß||00DF||0223||ß||sharp s (ess-zed)|
These characters are a subset of the most common extended ASCII character set in use on the Internet, ISO 8859-1. MediaWiki pages are identified by the server as containing ISO-8859-1 text. The characters above are a subset selected to improve compatibility with other machines.
For example, the Apple Macintosh is in common use on the Internet, is not limited to any specific language, and its native character set (which is not ISO-8859-1) contains many of the common international characters. Many Macintosh browsers will correctly translate ISO text into the native character set, as long as the characters used are available. So the table above is the subset of ISO-8859-1 characters that are also available on the native Macintosh character set. (This is the situation up through Mac OS 9.x, at any rate; Mac OS X appears to use Unicode as its native encoding.)
Microsoft Windows standard code page 1252 set is a superset of ISO-8859-1, so these characters will be readable as is on Windows machines. The most common Latin character sets other than ISO-8859-1 are MS-DOS (pre-Windows) code page 437, Macintosh Roman, and other ISO sets such as ISO-8859-2. The number of pre-Windows MS-DOS machines with web browsers is small and they are often dedicated-purpose machines that wouldn't be using MediaWiki anyway, so it is reasonably safe to sacrifice compatibility with them for the sake of needed foreign characters. Other ISO sets are generally intended to be read by other browsers using those same sets in the same country, and so those pages should use a language-specific set.
These characters can be entered either as HTML named character entity references such as à, directly from foreign keyboards, or with whatever facilities are available to the Wiki author for entering these characters. For example, Wiki authors using Windows machines can enter these by holding down the Alt key while typing the 4-digit decimal code of the character on the numeric pad of the keyboard. It is important that all 4 digits (including the leading 0) be typed; typing a 3-digit code will enter characters from the obsolete code page 437. Wiki authors using Macintosh machines should take care to either use special facilities to enter these in ISO-8859-1 format rather than with the native character set, or else use HTML named character entity references. Note that some Windows users may have trouble with versions of Microsoft Internet Explorer that use "Alt-Left-Arrow" and "Alt-Right-Arrow" for page movement. These will interfere with entering codes that contain the digits 4 and 6. Use HTML named character entity references in this case.
The characters from the table above can be used directly as 8-bit characters in all Wiki pages, and are sufficient for all pages primarily in English, Spanish, French, German, and languages that require no more special characters than those (such as Catalan). These are also generally safe to use in titles, except for a few characters like double quotes, less than and greater than, and a few others.
Note especially what is missing here from the full ISO-8859-1 set: The broken bar (
0166=¦ [¦]¹), soft hyphen (
0173=­ ¹), superscript digits (
0179=³ [³]¹), vulgar fractions (
0190=¾ [¾]¹), Old English (and Icelandic and Old Norse language) eth and thorn (
0254=þ [þ]¹), and multiply sign (
0215=× [×]¹). These should be considered unsafe (and adequate substitutes are available for most of them).
Special care should be taken with characters that do exist in the native character set of popular machines but not in the above set. These are not safe, even though they may display correctly to you when you use them. Characters from Windows code page 1252 not in ISO-8859-1 include the euro sign (
€ [€]¹), dagger and double dagger (
‡ [‡]¹), bullet (
• [•]¹), trade mark sign (
™ [™]¹), typeset-style punctuation (see below), per mille sign (
‰ [‰]¹), some Eastern European caron-accented letters, and the oe/OE ligatures. Characters from the Macintosh Roman set not in ISO-8859-1 include dagger and double dagger, bullet, trade mark sign, a few math symbols such as infinity (
∞ [∞]¹) and not equal (
≠ [≠]¹), a few commonly-used Greek letters such as pi (
π [π]¹), ligatures like oe/OE and fl, typeset-style punctuation, per mille sign, and lone accents such as the breve, ogonek, and caron.
HTML 4.0 defines named character entities for some Latin characters not in ISO-8859-1 that are used by popular languages, such as OE ligature (
œ [œ]¹), uppercase Y with diaeresis (
Ÿ [Ÿ]¹), and some Eastern European accented characters like
š [š]¹. These are also unsafe; though if they entered as HTML named character entity references, they may display on some machines.
In short, don't assume that it is safe to use a special character just because it looks correct on your machine. Use the ones from the table above, and read and understand how to use others shown below.
- ¹ sample in square brackets to see if they work on your configuration
Possibly usable non-ISO characters
Some characters not listed as safe above may still be usable when entered as named HTML character entity references, because web browsers will recognize them and render them correctly, perhaps by switching to alternate fonts as needed. All of these should be considered less safe to use than those above, but only in the sense that they may not display properly, though in the form of HTML character entity references they are unambiguous, and preserve data integrity.
For many of these, adequate substitutes and workarounds are available, and should be used when the value of making the text available to users of older computers and software exceeds the value of good presentation to those with newer software (in the judgment of the author or editor).
Absent from the ISO-8859-1 character set, but commonly used and present in both Macintosh Roman and Windows code page 1252 character sets, are proper English quotation marks and dashes. These can be entered as character entity references, and should appear correctly on most machines running recent software. Even on ISO-based machines such as Unix/X, browsers should be able to interpret these references and make appropriate substitutes using plain ASCII straight quotes and hyphens. (Mozilla does this correctly, for example.) These references were not present in older versions of HTML, so may not be recognized by older software. Since using these characters maintains data integrity even on those machines that may not display them correctly, it should be considered safe to use these unless proper display on old software is critical. German "low-9" quotation marks are a similar case, but are less commonly translated by browsing software, and so are not quite as safe. The table below shows these characters next to a capital letter "O" for better visibility:
|‘O||‘||left single quote||—O||—||em dash|
|’O||’||right single quote||–O||–||en dash|
|“O||“||left double quote||‚O||‚||single low-9 quote|
|”O||”||right double quote||„O||„||double low-9 quote|
Many web sites targeted for a Windows-using audience use code page 1252 references for these characters: for example, using
— for the em dash. This is not a recommended practice. To ensure future data integrity and maximum compatibility, recode these as named references such as
—. If you really want to use a number, you can use
Be aware that if you edit text in a separate word processor or other program to cut and paste into your browser, and it "automatically" converts quotes to the left and right "smart quotes" for you, you may unknowingly mangle markup, either your own or already existing, by replacing the standard quotes in HTML tags & properties with the smart quotes, which will cause the tags to fail in various ways. Furthermore, some people consider the extra encoding of smart quotes, fancy "’" apostrophes used in possessives and contractions, etc., to be a waste of bytes that could be put to better use, and will replace them with the standard single characters at will.
Set your wordprocessor options such as Auto Edit and Auto Correction such that undesired replacements do not occur.
Greek letters and math symbols
Compare ∇ and <math>\nabla</math>, giving ∇ and , respectively. Depending on preferences, the second may be the same as the first (HTML rendering), or an image. The HTML symbol depends on the font size and type, the image has a fixed size in terms of pixels. The color of symbol and background in the first case are those of text in general, according to the settings, and for the image they are black on white.
- Note: much of the text below regarding mathematical symbols is obsolete now that MediaWiki supports embedded TeX within pages. Non-trivial mathematical equations are probably best notated in TeX using the MediaWiki math tags. See the page MediaWiki User's Guide: Editing mathematical formulae for more on this.
Web standards for writing about mathematics are very recent (In fact MathML 2.0 was just released in February of 2001.), so many browsers made before these standards were in place try to compensate by at least allowing characters commonly used in mathematics, including most of the Greek alphabet. These are necessarily entered as character entity references. Browsers might render these by switching to a "Symbol" font or something similar.
Upper- and lowercase Greek letters simply use their full names for character entities. These should, of course, only be used for occasional Greek letters in primarily-Latin text. (Large quantities of Greek-language text should be written using an editor with native UTF-8 Unicode support to facilitate editing and reduce page bloat). Here are a few samples:
|ς||ς (final sigma, lowercase only)|
Other common math symbols
It was once customary to use the Adobe Symbol Symbol character set to render Greek letters and mathematical symbols. Both Macintosh and Windows operating systems provided a Symbol font using this set; a compatible Symbol font was included in most laser printers along with external truetype or postscript versions for computer use; and public domain Truetype and Postscript symbol fonts using this set were easily found. However, in web use, characters greater than hex 7F often did not transfer consistently between operating systems.
However, all of these characters were included in Unicode from the beginning and all are now firmly part of Unicode. Also many browsers no longer support separate Symbol fonts as their encoding methods break HTML rules. Accordingly use of the Symbol character set is strongly discouraged. Some products such as TtH still use a special hacked Symbol font to render equations which can be viewed on such browsers as do not support a normal Symbol font, but you should be aware that if you create text requiring such a font, you are restricting your audience to users who also have this font. (Whether or not that's acceptable is a judgement you will have to make as an author.)
Other common symbols
Some characters such as the bullet, Euro currency sign, and trade mark sign are special cases. They are likely to be understood and rendered in some way by many browsers. Because they are important for international trade, many computers specifically add them to fonts at some non-standard location and render them when requested, or else render them in special ways that don't require them to be present in a font. See below for how your browser renders these:
|€||€||euro currency sign|
|™||™||trade mark sign|
Other somewhat less commonly used symbols include these:
|†||†||dagger||♠||♠||black spade suit|
|‡||‡||double dagger||♣||♣||black club suit|
|◊||◊||lozenge||♥ or ♥||♥ (see below)||red heart suit|
|←||←||leftward arrow||♦ or ♦||♦ (see below)||red diamond suit|
|↑||↑||upward arrow||‹||‹||single left-pointing angle quote|
|→||→||rightward arrow||›||›||single right-pointing angle quote|
|↓||↓||downward arrow||‰||‰||per mille sign|
These should be considered unsafe to use except perhaps on pages intended for a specific audience likely to have very up-to-date software on popular machines. Even then, in some cases, IE 6.0 does not show the diamond symbol above. The regular diamond ♦ displays in IE 5 but not 6. The alternative code for the red diamond ♦, which works in IE 6 but not 5, is <font face="sans-serif" color="red">♦</font>.
The official character set of HTML 4.01 is the ISO 10646 Universal Character Set, which is equivalent to the character set defined by Unicode. Many browsers, though, are only capable of displaying a small subset of the full UCS repertoire.
Numeric character entity references are the only way to enter these characters into a Wiki page at present.
There are two ways:
- decimal, e.g.
Йgiving Й on your browser
- hexadecimal, in this case
These should be the same. However, decimal encoding will increase the number of browsers on which they will work.  shows for all possible values whether they work and how they look in your browser, using decimal code.
For example, the codes
Й ק م display on your browser as Й, ק, and م, which ideally look like the Cyrillic letter "Short I", the Hebrew letter "Qof", and the Arabic letter "Meem", respectively. It is unlikely that your computer has all of those fonts and will display them all correctly unless you have a Macintosh or have installed the fonts, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and has the characters available.
These characters should not be used in MediaWiki pages unless they make no difference to the understanding of the text, and are just extra information.
See Unicode and HTML for character entities tables.
Most wikimedia wikis have now switched to utf-8 allowing direct entry of unicode text however care must still be taken to avoid overuse of strange unicode charactors in places where people are likely to be unable to see them.
The following additional entities are available. On some browsers, these are converted to Unicode equivalents.
Special Note: The Del symbol ("nabla;"), among others, is not supported on Windows 95 or 98. On the English Wikipedia it has been uploaded as an image, and can there be referenced as [[Image:Del.gif]], or here and some other projects as http://en.wikipedia.org/upload/d/db/Del.gif, and looks like this: http://en.wikipedia.org/upload/d/db/Del.gif. On projects where this does not work, upload a copy of the image to that project.
However, the del symbol is usually found in formulæ which are better facilitated using MediaWiki User's Guide: Editing mathematical formulae.
Not all characters are displayed in all browsers. Also, since the font in the edit box may well be different from that of the rendered page, the browser may show the characters properly in one of the two areas and not in the other. For each, try to choose fonts which show all characters you need.
In the case of ISO-8859-1 encoding, special characters in the edit box are converted to code that consists of the common characters &, #, digits and a semi-colon, which are always displayed properly.
At any rate, the HTML source code shows the codes of both the characters that are displayed and those that are not. The HTML source code of a preview webpage also shows these for the wikitext.
Note that as a reader, it is best to use a browser with maximum capabilities, but as an author the least capable of the common browsers is a better guideline.
Alternatives include using a similar, more common symbol, or using an image, e.g. eo:Ŝablono:El: http://eo.wikipedia.org/upload/d/db/Ikono_tero_malgranda.png.
Also you can describe the character.
"The workaround" accessible by choice?
"After en switched to utf-8 and interwiki bots started replacing html entities in interwikis with literal unicode text, edits that broke unicode characters became so common they could no longer be ignored. A workaround was developed to allow broken browsers to edit safely provided mediawiki knew they were broken."
- The only other disadvantage of allowing direct unicode characters that I can think of is that it causes problems in certain text editors and makes it harder to tell similar characters apart in the markup:
- —, –, −, -
- µ, μ
- ⋅, ·
- Would it be possible to access the HTML entities version of the edit text selectively? In other words, I'd like to be able to see the edit text either with or without literal unicode characters, depending on what I'm doing.
- I'm sure there's a way to fake out the server into thinking I have a blacklisted browser, but I'm thinking something more convenient like pressing a button to get the entities version.
- Any ideas? — Omegatron 16:38, 4 November 2005 (UTC)
- Atm the only way to get the "safe mode" version is to send a user agent string (e.g. using firefoxes user agent switcher extention) that matches the bad browsers list (e.g. "Mozilla/4.0 (compatible; MSIE 5.0; Mac_PowerPC)") I wanted to add a preferences option but i couldn't get my head arround the preferences code (my php experiance is somewhat limited) Plugwash 23:42, 25 February 2006 (UTC)
Where did the special characters go?
Where are all the special characters on English Wikipedia? Almost all of them seem to have disappeared from the bottom of the edit screen. If they could be replaced, that would be great. 18.104.22.168 00:06, 14 January 2006 (UTC)
sorting chars with diacritics
I found it very confusing that e.g. Special:All pages does not sort "é" like "e". Shouldn't the software use an appropriate isolatin coall. table ?
PS: I found no place on Wikipedia related to this issue ; thanks in advance for a hint regarding this. MFH 17:39, 21 February 2006 (UTC)
It seems the search function only works for unicode characters stored as such. If I for example search for "杜甫" i do not find the article about en:Du Fu, where his name is entered as 杜甫. However, I find the article for en:Chengdu, where he is mentioned and the "real" unicode characters are used in the source.
Would it be possible to either change the search index to take the 杜甫 format into account, or to have a batch update of all the places where the format is used and convert it to native characters? Mlewan 10:38, 8 April 2006 (UTC)
Update of English Wikipedia page
The last time the English Wikipedia's copy of this page was updated was five months ago and quite a few edits here have gone under the bridge since. Who does the updates, please, so that I can give the necessary gentle prod? Thanks. --A bit iffy 10:08, 12 August 2006 (UTC)
For some time somebody used a bot for automatic updating, but he has stopped. Anybody, including you, can copy help pages.--Patrick 10:20, 12 August 2006 (UTC)
Problem with IPA character order
There’s a problem with the order of IPA characters given under the edit box on Wikipedia. I’m not sure if this is dealt with here or whether Wikipedia has its own version of that code. Nevertheless, the two characters U+02E1 MODIFIER LETTER SMALL L and U+02C8 MODIFIER LETTER VERTICAL LINE are given in an order which can be potentially confusing. The two have very similar glyphs in a number of popular fonts, particularly sans-serif fonts. Since their meanings are quite different it would be advantageous to move them apart so that they aren’t confused with each other. Moving U+02E1 a few characters previous in the list would solve the problem. If this isn’t the right place to discuss this please point me in the right direction. Thanks. — James Crippen 00:16, 27 August 2006 (UTC)
Maybe we could also provide a see also link to font packs for different languages here. --Emesee 07:52, 9 April 2008 (UTC)
I need to write a caret notation in italic.