Language tagging

From Meta, a Wikimedia project coordination wiki

This page explains the best practice of language tagging, i.e. marking a certain text as being in a certain language and/or script.

Language tagging[edit]

It is a best practice to tag a web page or a piece of text in a web page in the correct language. An HTML element should contain a lang attribute identifying the language it is written in, and also a dir attribute, identifying the writing direction.

Site-level[edit]

The MediaWiki software adds a lang and dir attribute on the whole page according to the user language.

Page-level[edit]

MediaWiki has a concept of "page content language". Each page is embedded in a <div id="mw-content-text" lang="xyz" class="mw-content-ltr/rtl">. See mw:Page content language for more information.

This should be correct in most cases. It is however not (yet) possible to change this other than via a MediaWiki hook. See bug 9360/bug 28970 for that.

Pieces of text in a different language than the page language[edit]

When content on a Wikimedia wiki (be it Wikipedia, Wikisource, ...) contains pieces of text in a language and/or script different than the language on the page level, it is recommended to mark this correctly in the content. This can be done by putting a span or div element around the respective text, including a lang (and possibly a dir) attribute.

Here the language tagging is summarized for easy reference to Wikimedians:

<span lang="el">Χαίρε, ω χαίρε Ελευθεριά!</span>

Greek is implied to be written in the Greek script. If this were transliterated to e.g. Latin:

<span lang="el-Latn">Haire, o haire, Eleftheria!</span>

If you want to further specify that this is from a specific country, say Greece:

<span lang="el-Latn-GR">Haire, o haire, Eleftheria!</span>

When the writing direction is different from the page's, use:

<span lang="he" dir="rtl">להיות עם חופשי בארצנו</span>

If this is a longer block of text, usually a div, also use the mw-content-rtl (or mw-content-ltr) class, which will properly align lists etc.:

<div lang="he" dir="rtl" class="mw-content-rtl">להיות עם חופשי בארצנו</div>

A lang attribute is made of ISO 639 language codes, and optionally ISO 15924 script codes and/or ISO 3166-1 country codes.

Language codes are usually mentioned on the respective Wikipedia article about the language. SIL maintains code tables and a simple text file.

Tagging with xml:lang should be done at this moment, but it will be redundant once the wikis are shifted to HTML5.

Benefits[edit]

Except for being a best practice, the WebFonts extension (enabled on Incubator, MediaWiki.org and many Indic wikis) also relies on lang attributes to recognize it and provide a font for the script. See the documentation for more information.

See also[edit]