User:Alecmconroy/Language study

Start with the entire global user database

Select all from users

Remove users who are 'inactive'

where user_touched is sufficiently recent

Remove users who contribute to only one project

The remaining users should be those who are active on more than one project. Useful. Count up for any two projects, how much overlap their editing communities have.
Except for meta and commons, each project has a native language. For any two languages, how much do their editing communities overlap?

Method

Data provided by Platonides [1]
Data imported into a php associative array via custom script
Data analysis script transformed the data into tables
Data analysis script created visualization data file in the form a .gdf file
.gdf files were imported into Gephi, which creates visualizations of the data.
Within Gephia, layout algorithms like ForceAtlas2 were used to find the ideal configuration of each visualization.
Also computed 'eigenvalue centrality to confirm en is, in fact, the 'most central' by that measure. (as we would expect from intuition, visualization, and the data tables)

Quick conclusions

Language barriers exist, but they are not as great as I imagined. The lack of a coherent global community cannot be explained by linguistic barriers alone-- overlap with en is quite substantial.
en repeatedly exhibited 'special' properties even in analysis where I did not a prior tell the algorithm to treat en as special. By a large number of unbiased measures, it was demonstrably the 'most central' language.

Note and Missing data

[2]

need worldwide population of each language
need wikimedia population of each language
need conditional probability tables for knowing one language given another.
Gauge autonomous translation difficulty for En->languages.
Identify Languages where a majority of speakers also speak En.
The UN languages? Arabic, Mandarin, En, Fr, Russian, Spanish. Of these,

Nations, sortable by non-"English Speakers"

Rank	Country	% En	En Speakers in mil.	Non-(En speakers) in mil.
1	United States	95.81%	251.4	11.0
2	India	10.01%	110.1	989.9
3	Nigeria	53.34%	79.0	69.1
4	United Kingdom	97.74%	59.6	1.4
5	Philippines	55.46%	53.8	43.2
6	Germany	56%	46.0	36.2
7	Canada	85.18%	28.4	4.9
8	France	36%	23.2	41.3
9	Pakistan	10.97%	18.0	146.1
10	Australia	97.03%	20.8	0.6
11	Italy	29%	17.3	42.3
12	The Netherlands	87%	14.3	2.1
13	South Africa	28.57%	13.7	34.2
14	Spain	27%	12.4	33.6
15	Turkey	17%	12.0	58.6
16	Poland	29%	11.1	27.1
17	China	0.77%	10.0	1290.0
18	Sweden	89%	8.2	1.0
19	Cameroon	41.51%	7.7	10.8
20	Malaysia	27.16%	7.4	19.8
21	Russia	4.90%	7.0	134.9
22	Thailand	10%	6.3	56.7
23	Belgium	59%	6.2	4.3
24	Israel	84.97%	6.2	1.1
25	Romania	29%	6.2	15.2
26	Zimbabwe	41.58%	5.6	7.8
27	Greece	48%	5.4	5.8
28	Sierra Leone	83.53%	4.9	1.0
29	Mexico	4.55%	4.9	101.8
30	Austria	58%	4.8	3.5
31	Denmark	86%	4.7	0.8
32	Switzerland	61.28%	4.7	3.0
33	Norway	91%	4.5	0.4
34	Ireland	98.37%	4.4	0.1
35	Singapore	80%	4.1	1.0
36	Tanzania	9.89%	4.0	36.5
37	New Zealand	97.82%	4.2	0.1
38	Bangladesh	2.21%	3.5	155.2
39	Finland	63%	3.4	2.0
40	Portugal	32%	3.4	7.2
41	Lebanon	80.51%	3.3	0.8
42	Papua New Guinea	49.76%	3.2	3.2
43	Liberia	82.67%	3.1	0.6
44	Kenya	7.19%	2.7	34.8
45	Jamaica	97.64%	2.6	0.1
46	Uganda	8.09%	2.5	28.4
47	Hong Kong	35.90%	2.5	4.5
48	Czech Republic	24%	2.5	7.9
49	Hungary	23%	2.3	7.7
50	Croatia	49%	2.2	2.3
51	Puerto Rico	48.61%	1.9	2.1
52	Sri Lanka	9.90%	1.9	17.4
52	Zambia	16.02%	1.9	10.0
53	Bosnia and Herzegovina	45%	1.8	2.2
54	Bulgaria	23%	1.8	5.9
55	Slovakia	32%	1.7	3.7
56	Ghana	5.96%	1.4	22.1
57	Slovenia	57%	1.2	0.9
58	Trinidad and Tobago	87.74%	1.1	0.2
59	Lithuania	32%	1.1	2.3
60	Latvia	39%	0.9	1.4
61	Guyana	90.55%	0.7	0.1
62	Botswana	38.42%	0.6	1.0
63	Estonia	46%	0.6	0.7
64	Cyprus	76%	0.6	0.2
65	Malawi	3.88%	0.5	13.4
66	Lesotho	27.86%	0.5	1.3
67	Suriname	87.09%	0.4	0.1
68	Malta	88%	0.4	0.1
69	Namibia	17.24%	0.3	1.5
70	Luxembourg	60%	0.3	0.2
71	Bahamas	87.13%	0.3	0.0
72	Barbados	98.57%	0.3	0.0
73	Belize	81.65%	0.2	0.1
74	Mauritius	15.97%	0.2	1.1
75	Vanuatu	83.55%	0.2	0.0
76	Fiji	20.62%	0.2	0.7
77	Solomon Islands	31.68%	0.2	0.4
78	Ethiopia	0.22%	0.2	78.1
79	Guam	91.09%	0.2	0.0
80	Brunei	37.76%	0.1	0.2
81	Saint Vincent and the Grenadines	95%	0.1	0.0
82	U.S. Virgin Islands	95.97%	0.1	0.0
83	Grenada	90.91%	0.1	0.0
84	Netherlands Antilles	50%	0.1	0.1
85	Samoa	49.86%	0.1	0.1
86	Isle of Man	99.93%	0.1	0.0
87	Bhutan	11.40%	0.1	0.6
89	Saint Lucia	43.04%	0.1	0.1
90	Northern Mariana Islands	83.33%	0.1	0.0
91	Antigua and Barbuda	80%	0.1	0.0
92	American Samoa	100%	0.1	0.0
93	Federated States of Micronesia	57.66%	0.1	0.0
94	Bermuda	96.92%	0.1	0.0
95	Dominica	94.03%	0.1	0.0
96	Marshall Islands		0.0	0.1
97	Swaziland	4.38%	0.0	1.1
98	Aruba	42.31%	0.0	0.1
99	Gambia	2.34%	0.0	1.7
100	Saint Kitts and Nevis	78%	0.0	0.0
101	Cayman Islands	76.70%	0.0	0.0
102	Seychelles	37.93%	0.0	0.1
103	Honduras	0.44%	0.0	7.1
104	Gibraltar		0.0	0.0
105	Tonga	30%	0.0	0.1
107	Kiribati	24.21%	0.0	0.1
108	Rwanda	0.21%	0.0	9.7
109	British Virgin Islands	86.96%	0.0	0.0
110	Palau	92.50%	0.0	0.0
111	Andorra	22%	0.0	0.1
112	Anguilla	92.31%	0.0	0.0
113	Dominican Republic	0.08%	0.0	9.8

The English-Speaking World

Nations with less than 15% of the population unable to communicate in English. Interestingly. these numbers challenge our existing preconceptions--

Canada, long considered a stronghold of the Anglosphere, but actually only 85% know en.
Many nations without a history of English as a mother tongue nevertheless have fluency rates comparable to those of Canada. Israel has only slightly less than Canada, while Denmark, The Netherlands, Sweden, and Norway all have rates higher than Canada.
When multiproject editing patterns become available, I predict that we will these nations editing very actively both in common spaces and in their mother-tongue spaces. After the En speakers, these nations probably represent those that are 'most tied in' to the movement.
Assuming the accuracy of this data, nations of Canada-Israel or above may not present a national challenge. (although we don't organize by nation so this exercise is mostly to pass the time until they announce the results. :)

Country	% En	En Speakers in mil.	Non-(En speakers) in mil.
American Samoa	100%	0.1	0.0
Isle of Man	99.93%	0.1	0.0
Barbados	98.57%	0.3	0.0
Ireland	98.37%	4.4	0.1
New Zealand	97.82%	4.2	0.1
United Kingdom	97.74%	59.6	1.4
Jamaica	97.64%	2.6	0.1
Australia	97.03%	20.8	0.6
Bermuda	96.92%	0.1	0.0
U.S. Virgin Islands	95.97%	0.1	0.0
United States	95.81%	251.4	11.0
Saint Vincent and the Grenadines	95%	0.1	0.0
Dominica	94.03%	0.1	0.0
Palau	92.50%	0.0	0.0
Anguilla	92.31%	0.0	0.0
Guam	91.09%	0.2	0.0
Norway	91%	4.5	0.4
Grenada	90.91%	0.1	0.0
Guyana	90.55%	0.7	0.1
Sweden	89%	8.2	1.0
Malta	88%	0.4	0.1
Trinidad and Tobago	87.74%	1.1	0.2
Bahamas	87.13%	0.3	0.0
Suriname	87.09%	0.4	0.1
The Netherlands	87%	14.3	2.1
British Virgin Islands	86.96%	0.0	0.0
Denmark	86%	4.7	0.8
Canada	85.18%	28.4	4.9
Israel	84.97%	6.2	1.1

Our Wikipedia Languages

source Active Wikipedia editors (more than 5 edits) per project in May 2010.

Lang	Active
en	39383
de	7339
fr	5048
ru	4432
es	4347
ja	4345
it	3129
pt	1808
zh	1796
pl	1762
nl	1562
sv	977
ko	778
hu	706
he	688
fi	680
no	679
cs	634
uk	572
ar	527
tr	521
ca	491
fa	488
ro	372
da	304
id	299
vi	288
th	267
bg	255
sl	216
el	202
hr	192
sr	179
sk	147
simple	147
et	143
lt	138
eo	114
ta	85
ms	75
lv	75
gl	74
mk	71
az	71
eu	69
ka	64
nn	63
ml	57
la	55
is	49
hy	48
sq	47
hi	38
be_x_old	34
bn	33
be	32
bs	31
te	31
zh_yue	30
af	28
br	27
ga	27
tl	26
bar	25
cy	23
ur	23
lb	22
mr	22
si	21
oc	19
arz	19
km	18
sh	16
als	16
fy	15
zh-min-nan	15
an	14
ast	14
lmo	14
kk	13
sw	12
kn	12
ku	12
gd	12
li	12
ckb	12
tt	11
scn	11
nds	11
hif	11
qu	10
gu	10
os	10
yi	9
mn	9
ne	9
pap	9
jv	8
ceb	8
sco	8
nds_nl	8
udm	8
vec	7
ia	7
fo	7
tg	7
diq	7
cv	6
su	6
zh_classical	6
sah	6
mt	6
my	6
mg	6
ps	6
fiu-vro	6
vo	5
war	5
pms	5
pnb	5
co	5
wuu	5
vls	5
ba	5
ug	5
szl	5
ace	5
dsb	5
stq	5
eml	5
bjn	5
pcd	5
io	4
ht	4
am	4
ang	4
hsb	4
se	4
rm	4
frp	4
so	4
nv	4
mzn	4
kl	4
uz	3
yo	3
bpy	3
nah	3
bat_smg	3
wa	3
gan	3
gv	3
ksh	3
kw	3
lad	3
sc	3
ln	3
ab	3
xal	3
hak	3
mwl	3
frr	3
zea	3
new	2
nap	2
bo	2
sa	2
tk	2
dv	2
csb	2
pa	2
roa-rup	2
krc	2
crh	2
pdc	2
ext	2
ce	2
bh	2
gn	2
kv	2
wo	2
mrj	2
cu	2
koi	2
tet	2
kab	2
mdf	2
kbd	2
cdo	2
bi	2
ltg	2
sd	2
rn	2
pam	1
mi	1
bcl	1
mhr	1
map_bms	1
fur	1
ilo	1
ky	1
rue	1
arc	1
myv	1
roa_tara</small}	1
lo	1
ie	1
rw	1
to	1
tpi	1
haw	1
bm	1
ee	1
za	1
zu	1
pnt	1
pih	1
ha	1
st	1
cr	1
ik	1
sg	1
om	1
ki	1
nrm
ig
lij
nov
jbo
cbk_zam
ay
or
pag
glk
pi
bug
rmy
na
lbe
as
iu
chr
kaa
av
kg
ty
srn
ch
ss
sm
dz
ks
got
ts
bxr
ff
sn
ak
fj
tum
ti
xh
ny
ve
tn
lg
tw
chy

It is each project's duty to help send messages to and from those projects its members can directly communicate with. Each project needs to have a plan for sharing translations between it and the global community.
If any project prefers to receive messages in a language other than its own or english, please let the global community know which language(s) would also work best for your community.
Can you estimate what percentage your contributors speak both english and your project language? Hard numbers ideal, but even a general consensus.
Are there any other languages, other than your project's own, that your editor community is familiar with?

Tentative thoughts about language priorities.

(this work has already been done by smarter people than me and the results as very similar)

Our hub lang is En. (And we should always remind people that is purely pragmatic. )
Our five next-largest projects are: de, fr, ru, es, and ja. These six languages give us about 3/4 of our current editors, though this is already out of date and will change with time.
zh and ar probably needed to included in the 'core' just for sheer commonsense-- both are large, diverse, face linguistic barriers, and are underrepresented in WM.
This list is not necessarily exhaustive.

Misc

Find the "obscure language enthusiasts" and ask them to help projects communicate.
For 'official'/'important' translations, consider recruiting the best editors from the Simple En to 'pre-translate' / proofread En statements for brevity, simplicity, clarity.

Strategies for communicating with en speakers

There are two basic strategies for translation to/from all internet languages.

"Direct Translation Strategy" relies upon links to en.
"Indirect Translation Strategy" relies upon links to a project that is itself strongly linked to en.

See Strategies by language

Visualizing our languages

'Truest' visualizations

The truth is that our projects are very densely connected. If you let each project be on the outside of a circle and connect them with thin gray lines, the inside of the circle is so covered with lines that it appears essentially like a solid color. In small numbers of active users, at least, our languages do connect to each other

Forceatlas2 all edges

If we give some languages more 'weight' than others, then some 'fall' to the center, while others are pushed outward. Those closest to the center are, in this visualization at least, more 'central' according to ForceAtlas2.
While this image of a fully-interconnected set of languages is inspiring, it is not particularly useful in developing an intercommunication strategy.

Thus, we realize that while nearly all projects are connected, not all connections are as strong. In some cases, we have lots of users who speak both languages, sometimes just a few. Thus, some of the very very weak connections need to be dropped, so that we can see only those connections with the largest bilingual speakers. At the same time, let's not forget those 'weak' connections do in fact actually exist, just because we aren't showing them. All visualizations from here on out don't show most of the connections we actually have.

Visualization 2-- 'ForceAtlas2' showing top few edges

Give each language a weight based on it's active users, kinda like a star's mass. Some stars are big, some are small-- some languages have many active users, some are still young. Since the point is for everyone to be able to intercommunicate, languages with more active speakers carry 'more weight', since if you can speak that language, there are more people to talk to, and thus more opportunities for inter-language communication.
Let each interconnection between languages act like a "gravity" pulling the two language "stars" closer together. The stronger the interconnection, the stronger the 'pull'. Thus, closely connected languages will 'tend' to be close to each other.
Let languages that are not closely linked tend to push each other apart, 'repelling' each other likes magnets of the same charge. Thus, not closely-linked languages will 'tend' to be far apart from each other.
If you 'pretend' all this is so, throw a random assortment of language "stars" onto a blank "universe" and let time run. After a while, you get what you see here:

Thus, we see:

Large projects that are closely connected to others tend to fall to the 'heart' of this nebula', while less-connected projects orbit the outer rings of this central galaxy of languages. Languages that are closely connected tend to attract each other, and thus tend to be closer to each other. Languages that are less connected tend to lie farther apart from each other.
Most languages are in a central cloud of densely-interconnected languages. En is at the center of this main "nebula" that contains most of our projects' total mass.
English and Simple English are tightly bound together, like a binary star system. This makes sense.
In 'orbit' around the two english projects lies a dense "asteroid belt" of densely packed languages. Most world languages, and those closest to them, lie in this belt. pt, es, fr, it, pl, de and nl to name just a few.
Surrounding this 'asteroid belt' is a less-dense but still-central "ring" of languages which are 'a little less' tied to the mass of the projects, just not to the extent of other languages. ru, zh, ar, ko, ja, to name just a very few.
In this ring, we see a lot of closely relatedly languages pairs. ko, ja, and zh are nearby each other. es and cs, ru and uk. Again, to name just the ones my eye noticed.
Around the edges of the central cloud, we see some "hub" languages that are connected to lots of smaller projects: eu, hu, ka, az, bg, oc.
Outside of this central cloud, we see "hub" languages that serve as the centers of their own clusters/systems. These languages are strongly connected to the central cloud, but many of the languages they are connected to are not strongly connected to any central cloud languages. These projects may be frequented by language enthusiasts. These languages might be ideally suited to recruit communicators to conduits for those smaller projects in their orbit. Examples include wuu, kn, uz, vo(w:Volapük), gv (Manx), gd (Gaelic).

Visualization Three: Connectedness by favorite second language to EN

Keep only one connection per project-- it's strongest connection. Drop all others. * Language directly connected to EN are blue. Languages connected to blue nodes are colored orange or yellow-- yellow if they have children, orange if they do not. Languages connected to a yellow language are colored red. These language are, by this method, 'most distant' from en.

Languages in blue prefer to speak in en over other languages. These projects should tend to have strong translation communities available, if we could mobilize them.
Languages in orange, or yellow prefer to speak in a blue language over en. If these languages have trouble communicating with en, their blue favorite-second-language would be where they likely turn for indirect translation.
Languages in red prefer to speak in a yellow language. Red languages are where potential 'isolated' projects will be, the ones who may potentially post the greatest communications difficulties. (But being red alone does not automatically mean anything such difficulties actually exist of course)
Small red languages's connections do not seem to necessarily mirror the language commonalities of their real-world population. (again, just my impression). I think we should be a little suspicious whether language connections of small red languages will remain the same as those languages gain active users. Presumably, infusion of active users would tend to this data match real world linguistic classifications, which this visualization does not preserve. This may well be due to a biasing effect of the high requirement for active users and the short duration (1 month) of the time window studied. I predict over the foreseeable future, we will see erdos-number visualization should approach all languages connected to the language we currently call en-- that is, all languages becoming blue languages where their favorite second language is en.
The en<-->de connect is, in absolute terms, a special connect, the strongest connection between our two most populous projects. Attempts to building a common lexicon (Wikilish?) should begin at that connection.
ru , de, and zh all jump out as important 'hubs' or 'branches' in this scheme-- that is, they connect us to many languages that are not themselves most-directly connected to en.

Addendum: a fourth visualization-- 'binary tree'

Starting with en, pick two 'child' languages-- the two strongest second-languages on en. Link to them.

For each of those, pick their two strongest second languages (ignoring those languages already connected to en elsewhere in the graph. Keep doing this, so that each language has two "children" it is the 'parent' if. (parent and child are mathematical terms in this case, completely unrelated to any real-world meaning)

This style of connections, called a binary tree, is a very bad way to model our densely-interconnected projects. It ignores most of our connection, and arbitrarily imagines that each project can communicate with, at most, three other languages. This is not at all true-- each language can communicate with as many languages as it has bilingual speakers-- thus the maximum of 3 connections is a very very unusual and arbitrary one. Thus, I do not currently know of an application that requires such a layout.
That said, for the inner parts of the tree, a few meaningful patterns do emerge. en<-->de, ru->uk, es->ca. Mostly though, this kind of a structure isn't very useful for modeling our project.
We could improve upon this by trying to create a 'globally optimum' binary tree-- right now, node relationships are assigned rather arbitrarily. Since a binary tree is of no known current use to us, I didn't bother.

Tentative Conclusion and improvements

Contrary to my earlier concerns which prompted this study, the use of en as a central language does seem objectively defensible. Real-world populations, readership population, and our less active editors may all have dramatically different language preferences than our active users. But among active users, practically all languages have their strongest second-language ties to en.
While this may be biasing in the data caused by the high standard for "active user", it may also be fairly comprehensible. The Wikimedia Movement started in en, and being able to communicate with the existing movement, in some limited or indirect way, is a reality-imposed barrier to joining the movement.

In future, try to get a larger dataset that comes closer to getting all users.
In future, directly ask users for their language proficiencies, so we don't have to infer it.
In future, create a smallest-space matrix, factor analysis, other such stuff.
In future, do a similar analysis but use percentage-connection as a weight instead of strength-of-connection.
In future, do a edge layout looking just at links to en and forceatlas2 it. With no shared edges, projects locations will depend entirely upon their relationship to en. This will produce a map of 'distance from en'. Do the same thing with just links from en/de. and so on.
If requested, do a 'distance from any give project" visualization, of the sort done for en.