Communications committee/Subcommittees/Translation/core langs

Core lang thought for Transcom.

What are "core langs"?

From its start, we Transcom have a set of languages whose translation we are primarily concerning about the Foundation information spreading. We call it "core langs". It is roughly equivalent the languages Transcom members may seek for translators, if its language version lacks on a certain information, press releases and so on.

On Transcom page, we mentioned the points we consider to rank languages regarding our necessity and capacity (capacity, since if we can just find tons of translators, and also coordinators, then we wouldn't have needed to worry priorities and so on), that is:

likelihood that a current Wikimedia editor can read it
likelihood that speakers of that language can read languages in lower levels
population of potential/future Wikimedia editors who can read it
population of current Wikimedia project readers in that language who can read it

As of December 2006, the transcom core langs are English, French, Spanish, Russian, Chinese and Japanese. This set should be reviewed periodically and updated, think Transcom members, reflecting the actual status of our project and community.

The idea of core lang doesn't mean Transcom don't welcome other languages, nor other languages are not important on Wikimedia project. In the perfect world, we have no core lang idea and in all languages we can find translators. Alas, we are living however in no such world. We need to publish in a limited time with a limited number of hands. That is why we need to determine "core lang" notion. Please note, it is not a discrimination. Core langs notion aims to help readers as many as possible having information in their language as regularly as possible, and always considered as a tentative goal. Core lang set is always eventual; our final goal is to share our knowledge in every language.

Getting a set of core langs

Lately I refered some statistics and make an attempt to evaluate each points on the above in numbers. My way would be rough and you may propose more sophisticated/scientific way. I will be very happy to hear your suggestions ;)

likelihood that a current Wikimedia editor can read it

I have no good data about it. I suppose the involvement to the foundation level activities, for example, vote to the Board Election might be relevant. But I am not sure. From experience, I know some language editors don't mind to read English documents, others do. If I am sure, a certain language speaker in the average find difficulties to read certain another language which may appear already in this core lang set, I may give a point to that language.

General assumption #1: if a language project is still very small and all its editors are known high-educated, they would have no difficulties to read a certain other language: in most cases, English.
General assumption #2: if a language project is very large and a significant number of young (teen) editors participate and they are mainly educated in the language of this project, another language literacy on this community as a whole are not highly expected, since those young editors may not be fluent in the other languages. This is the case in the Japanese society, and hence its projects.

likelihood that speakers of that language can read languages in lower levels

See the above. Those two are rigidly difference. This criteria is aiming to inform the language society as a whole, not only the Wikimedia project community in that language. Wikimedia Foundation website inform the public in a certain language society, including donors, both actual and potential partners, both willingly and unwillingly affected by our activities, so, whole the world, what the Foundation is, does and so on.

We may find neat & credible statistics made by a certain institute. SIL? If you know something, please let us know!

population of potential/future Wikimedia editors who can read it

Equal to "the population of speakers". Not necessarily it means "native speakers's population", but rather "speakers including second a/o foreign speakers". If a language is worldwidely taught and learned, we can expect there is a significant number of its speaker "as foreign language". For example, English.

Here I would like to consider also how its speakers spread geographically. By lang, I count in how many counties we can find states choose it as their official language. Asia and Oceania are counted one area for convinience.

population of current Wikimedia project readers in that language who can read it

Here I use the amount of Wikipedia edits by lang as of mid 2006 as criteria. It isn't totally equivalent to the whole "Wikimedia project editing" though, I expect it is near to the whole ... Wikipedia is the project we can find the largest number of languages among our projects.

Evaluation

Here is a proposal of core lang set for Transcom as of December 2006.

 pt.  langs
----|--------
 +5 |  en, fr 
  4 |  es
  3 |  [pt], ja, [de], zh
  2 |  [pl, nl], ru
  1 |  ar, id, [it], hi, bn 
----+-----------------------
 <1 |  pa

Remark: On the above, [ ] means due to 1) shortage of manpower and 2) their competition to another language, those people may not need their own version heavily, hence the evaluation of neccesity of their language version on Wikimedia Foundation public relation may be reduced. However the possibility of such reducing is purely an assumption based on no factual data, so this evaluation can be altered due to review in future.

How to calculate?

First, we choose two external data: (a) speaker population & (b) its geographical dissemination: in how many continents the language in question is used as governmental language? I have chosen the square of a*b, (rounded .5 -> 1, .4 -> 0), is our basis for further evaluation, its significance for our current project.

 lang.     pop.(100M)  conti-
           (a)         nents(b)  square(a*b) 
---------+-----------+---------+------------
English  |   5.0          4         4.5
French   |   1.7          4         2.6
Spanish  |   3.9          3         3.3
...

Then I weigh the big projects which have over 250K articles, and give them two points. For projects over 100K, one point. And also I reduce one point from the projects which has not yet reached 10K milestone (therefore Bengali came up from its former rank 1> to rank with point 1 for the sake of its latest 10K milestone). And finally we got the table on the section above.

Please give your comment on talk. When our language projects grow more and still we are suffered with shortage of translators, we'll modify the way of this calculation, however if no one suggests improvement, this calculus will be used for a while.