Talk:List of Wikipedias by sample of articles/Archives/2011

From Meta, a Wikimedia project coordination wiki

Weight of is.

By the same calculation as above, Icelandic uses 1092 characters and ought, therefore, be weighted 1162/1092 = 1.06410256... = 1.1.

However, when comparing a longer, continuous prose, e.g. Luke 1.1-25 in Icelandic to the The New International Readers Version I find that Icelandic uses 463 words vs 599 in English, 2100 vs 2484 characters without spaces and 2562 vs 3082 characters with spaces. All of which would indicate a more accurate weight of at least 1.2 --89.160.135.115 23:09, 7 February 2011 (UTC)

It is a generous round up but I agree with using a weight of 1.1 for Icelandic.
I am not sure about using something besides Genesis 11:1-9 since that would open the possibly of cherry picking favorable texts. It is already possible to pick a favorable bible translation of Genesis. For example, if we use the version of the bible you provided "The New International Readers Version" then the English version of Genesis 11:1-9 is about 1071 characters as compared with 1162 characters in the "English Standard Version". That would mean the weight for Icelandic would be already correct at one. --MarsRover 04:22, 8 February 2011 (UTC)
Here Genesis 11.1-9 in Icelandic uses 998 characters. Notabene, Icelandic doesn't have the plethora of translations that are available in English. I just picked what seemed like a recent and straight forward English translation (having no familiarity with any of them), but if I were cherry picking, I'd pick the more recent Icelandic translation (linked to above) to compare Genesis 11.1-9 giving me 1162/998 = 1.16432866... = 1.2 ("generous" round up perhaps, but that's just how it rounds up, 1.64 dosn't round up to 1.1 just as 1.064 doesn't round up to 1.0). However, the need for everyone to use the same criterion is understandable, so a weight of 1.1 seems reasonable. --89.160.135.115 09:50, 8 February 2011 (UTC)

This does not seem to have been taken into account before the last update on March 1. Perhaps it could be factored in before the next update? --Cessator 18:07, 30 March 2011 (UTC)

More weights

Malay uses 1,195 characters, therefore it should use a weight of 1162/1195 = 0.97238... = 1.0. (This changes nothing, but that star denoting a language with no data annoys me. :)

Macedonian uses 893 characters, therefore it should use a weight of 1162/893 = 1.30123... = 1.3.

Lithuanian uses 948 characters, therefore it should use a weight of 1162/948 = 1.22573... = 1.2.

Hindi uses 1,170 characters (if I'm not mistaken), therefore it should use a weight of 1162/1170 = 0.99316... = 1.0. (again this changes nothing, but is merely for statistical purposes.)

Latvian uses 1,042 characters, therefore it should use a weight of 1162/1042 = 1.11516... = 1.1. -- Prince Kassad 10:14, 19 February 2011 (UTC)

Weights for April

Aragonese uses 1,074 characters, therefore it should use a weight of 1162/1074 = 1.08194... = 1.1.

Estonian uses 954 characters, therefore it should use a weight of 1162/954 = 1.21803... = 1.2.

Kannada uses 1,241 characters, therefore it should use a weight of 1162/1241 = 0.93634... = 0.9.

Afrikaans uses 1,176 characters, therefore it should use a weight of 1162/1176 = 0.98810... = 1.0. -- Prince Kassad 09:42, 2 April 2011 (UTC)

+3,70 in ukrainian wiki.

This month ukrainian wikipedia gave +3.7 growth in this project. Is this the maximum growth among all wikipedia for all times? --Alex Blokha 17:12, 11 April 2011 (UTC)

I think that is the 7th highest of all time. See the awards page. --MarsRover 21:10, 11 April 2011 (UTC)
Yep, you are right. Catalans are first here too :). They have first 4 places, 5-th is Panjabi and 6-th is Winaray. --Alex Blokha 08:37, 14 April 2011 (UTC)

One thought about weights

I know many people have complained about using the Tower of Babel as a source of weights because it is biblical and thus many translations tend to be very literal (which gives Russian a really skewed score, for example). So I tried to find a better text which is not religious and available in many languages. And guess what, one such text that came to my mind is the Universal Declaration of Human Rights!

It is a really long text so it should give much more accurate results. As well, since it is not religious, it should be rather authentic.

If there is demand, I'll post up a bunch of weights. -- Liliana 22:52, 27 November 2011 (UTC)

Sounds interesting. I think seeing some of the differences between using this text vs the bible might be helpful in deciding whether to change. --MarsRover 04:34, 28 November 2011 (UTC)
Ok then! I counted everything after the heading "Preamble", for reference
New weights to three decimals. — Yerpo Eh? 08:45, 21 January 2012 (UTC)
Language Characters Weight (new) Weight (old)
English 10597 1.000 1.0
Catalan 10910 0.971 1.1
French 11853 0.894 1.0
German 11853 0.894 1.0
Spanish 11817 0.897 1.1
Russian 11675 0.908 1.4
Ukrainian 10658 0.994 1.3
Italian 11888 0.891 1.1
Chinese 2799 3.786 3.7
Portuguese 11315 0.937 1.1
Japanese 4154 2.551 1.9
Bulgarian 11332 0.935 1.1
Vietnamese 12816 0.827 1.1
Swedish 10557 1.004 1.1
Hungarian 11991 0.884 1.1
Czech 9788 1.083 1.3
Polish 11086 0.956 1.1
Finnish 11058 0.958 1.1
Korean 4706 2.252 2.5
Dutch 12721 0.833 0.9
Hebrew 7230 1.466 1.2
Norwegian Bokmal 10173 1.042 1.2
Turkish 10244 1.034 1.3
Serbian 9457 1.121 1.4
Arabic 7527 1.408 1.0
Greek 12361 0.857 1.1
Romanian 11860 0.894 1.1
Croatian 9826 1.078 1.3
Slovak 10056 1.054 1.3
Danish 10837 0.978 1.2
Belarusian 11308 0.937 1.4
Persian 9081 1.167 1.2
Indonesian 12455 0.851 0.9
Galician 11188 0.947 1.1
Macedonian 10653 0.995 1.3
Esperanto 9869 1.074 1.1
Slovenian 10324 1.026 1.2
Lithuanian 10848 0.977 1.2
Thai 9268 1.143 1.0
Malayalam 10555 1.004 1.1
Malay 12539 0.845 1.0
Latin 9907 1.070 1.1
Hindi 10831 0.978 1.0
Latvian 10423 1.017 1.1
Estonian 10749 0.986 1.2
Tamil 13244 0.800 0.9
Basque 10954 0.967 1.1
Kannada 10608 0.999 0.9
Icelandic 10182 1.041 1.1
Afrikaans 10336 1.025 1.0
Welsh 10090 1.050 1.2
Armenian 11716 0.904 1.2
Cebuano 12145 0.873 0.8

As you can see, many weights are corrected downwards, confirming my feeling that English is shorter than most other European languages. -- Liliana 06:40, 28 November 2011 (UTC)

I was under the impression that the current script checks for character size. Or does it just count the characters? (my Python skillz are not nearly l33t enough to be able to see that from the source code). — Yerpo Eh? 13:24, 28 November 2011 (UTC)
I think it counts just the characters, not how large they are in UTF-8. -- Liliana 14:31, 28 November 2011 (UTC)
Oh, ok, then this new set of weights makes a lot of sense. — Yerpo Eh? 21:09, 28 November 2011 (UTC)
When are the new weights going to be implemented in the computations? They have much to be said for them. Jacob. 71.163.203.52 22:30, 3 January 2012 (UTC)
Probably need some more agreement before changing the weights since the results will drop half the wiki scores. Only a few will benefit (Japanese and Chinese wikis) so I predict a backlash. --MarsRover 01:56, 4 January 2012 (UTC)
Hasn't everybody been in agreement that the current weights have been temporary, until such time as an apter text could be found? And wow, is the "Universal Declaration of Human Rights" ever an apter text! (Unlike the current text, it has no religious bias, and—one might say this is a dispositive point—it's much longer.) Surely our contributors are people of goodwill, who wouldn't want to boost their rankings by false methods. Would it soften the blow if the new weights were good to the nearest hundredth (instead of tenth)? The size of the new sample might justify another significant digit. If hundredths were in play, some wikis now poised to drop by 20% would drop by only about 16 percent, or 17 percent, or whatever. Jacob 71.163.192.196 16:28, 5 January 2012 (UTC)
My biggest concern is where does one find these texts in case someone wants to double check the calculation or even the translation. --MarsRover 01:56, 4 January 2012 (UTC)
Well that shouldn't be a problem, the texts are right here! -- Liliana 07:08, 4 January 2012 (UTC)
Would a test run of the whole table be too much work for you? I honestly don't have a notion of how much work you have every month to check everything and transform the script's output into the neat table we see. — Yerpo Eh? 13:57, 5 January 2012 (UTC)
I agree that the UDHR is a better text; translations of the Bible that appear on line are often archaizing (or actually old), whereas this text is in more contemporary language in all the versions I've looked at. And the longer the sample we take, the better. Admittedly, I mostly work in Latin, whose weight wouldn't change, so I wouldn't be scrambling to recover my team's place in the league tables (so to speak). But I think it's a good idea. Another source, by the way, is here at the UN Office of the High Commissioner for Human Rights. Are there any Wikipedia languages that don't currently have versions of the UDHR? -- if so, what happens to them? Otherwise, I support this proposal. A. Mahoney 14:02, 13 January 2012 (UTC)
I suppose those languages would be treated the same way that those without a translation of the Tower of Babel are treated now - assigned a default weight of 1.0 or the weight of a closely related language. Anyway, I support this as well and I'm prepared to help with calculating weights if we decide to do this. — Yerpo Eh? 19:16, 14 January 2012 (UTC)
Using a longer text = good. Using a text where there is less risk of finding archaic versions = good. Using a non-religious text = at least not bad. So I aslo support the change. And using two more decimals on the weights also seems like a good idea to me. Boivie 12:31, 15 January 2012 (UTC)
I wonder, would it be possible to see what the weights would be, calculated to two decimals? And I'd also be curious to see a test run of the table under the new weights (maybe even under the propsed weights to the nearest 10th and also to the nearest 100th, but that's getting to be rather a lot of tests to analyze, to say nothing of run time). Is there a way to solicit responses from some of the editors from languages whose weights will change a lot? If they pay attention to the results, but aren't following this discussion, they may get surprised; is there a way to bring this discussion to broader attention? Or is that not typically done here? A. Mahoney 17:43, 19 January 2012 (UTC)
I changed the table to show new weights to three decimals. Some Wikipedias have specialized projects for improving vital articles (e.g. ca:Viquiprojecte:Els 1.000), we could go there, for example. — Yerpo Eh? 08:45, 21 January 2012 (UTC)
Great idea! That should help make the statistics more exact. -- Liliana 22:36, 21 January 2012 (UTC)
It is very interesting to look at the various UDHR translations and play around with Google translate. I found that several trivial English words that are repeated and if your language doesn't have a concise wording that duplicates the meaning you are in trouble. --MarsRover 05:57, 25 January 2012 (UTC)
That's the way languages work, and it's a strong argument for using the longest and (with regard to content) the most varied text possible, so as to minimize this effect. Jacob. 71.163.68.106 16:51, 26 January 2012 (UTC)
For example, the phrase "Whereas" may have to be translated to "Having regard to" or "Considering that".--MarsRover 05:57, 25 January 2012 (UTC)
Yes, but it may also have to be translated with something shorter ! In Latin, 'whereas' (7 letters) can be translated as quoniam (7 letters), or quando (5), or quod (4) or cum (3). Jacob. 71.163.68.106 16:51, 26 January 2012 (UTC)
The pronoun "Everyone" is repeated many times and some languages have to use "All persons" to get the same meaning. --MarsRover 05:57, 25 January 2012 (UTC)
And some languages can use a word shorter than 'everyone' (8 letters) ! In Latin, you can sometimes merely say omnes (5 letters)—but then again you've got the option of saying unusquisque (11 letters). Many languages have multiple ways of saying approximately the same thing. In Samoan, for 'everyone', you can say taai or taitasi or taitoatasi or taitoatasi uma, and if you're being supercorrect, you can insert a sign for the glottal stop and have ta'ito'atasi 'uma (17 characters). Jacob. 71.163.68.106 16:51, 26 January 2012 (UTC)
The advantage of the Bible over the UDHR is that it wasn't written in English to begin with. I am pretty sure the first language something is written in will always be more concise than the language force to duplicate the meaning with a different vocabulary. --MarsRover 05:57, 25 January 2012 (UTC)
This speculation isn't self-evidently true. Almost every language, when compared with other languages, will have unique features (call them quirks if you like), some of which will tend toward brevity and some of which will tend toward length. The longer and more varied the reference text, the less likely these features are to skew the weightings. Jacob. 71.163.68.106 16:51, 26 January 2012 (UTC)
I just compared the first charter of the Quran and Arabic had almost a 2.0 weight ([1],[2]). Compared with 1.4 using the UDHR and 1.0 using the Bible. Whether its tailoring the document to the language or taking more time picking shorter words, the first language seems to have a better score. Dare you to prove me wrong. --MarsRover 08:44, 27 January 2012 (UTC)
That's too small a sample to put much faith in. Even the UDHR text isn't long & varied enough to bring out some of the features of individual languages; however, Guinness World Records says the UDHR is the most-translated document in the world, and that fact adds to its potential utility here. 71.163.64.43 21:18, 28 January 2012 (UTC)
OK, let's compare the first verse of Genesis in the Samoan Bible with the English (KJV) text, from which it was translated. (From the Samoan point of view, it was effectively written in English.)
In the beginning God created the heaven and the earth.
Na faia e le Atua le lagi ma le lalolagi i le amataga.
The Samoan is exactly the same length as the quasi-original, not longer. Now verse 2, part 1:
And the earth was without form, and void;
Sa soona nunumi le lalolagi ma ua gaogao,
Hmm. Exactly the same length again! Now part 2:
and darkness was upon the face of the deep.
sa ufitia foi le moana i le pouliuli;
Uh-oh. The Samoan is shorter! Now for part 3:
And the Spirit of God moved upon the face of the waters.
na fegaoioiai le Agaga o le Atua i le fogatai.
Again, the Samoan is shorter. According to the list, the (assumed) weight of Samoan is 1.0, so your hypothesis expected the Samoan to be longer than the English here. Part of the brevity in the last phrase is that Samoan has a single word, fogatai, which = 'face of the waters' (or 'surface of the sea'); meanwhile, for 'moved', its fegaoioiai has the added connotation of 'moved about', 'moved this way and that', 'moved all over the place', which would be longer in English if the English tried to catch the dynamism of fegaoioiai. The reason a longer reference text is preferable to a shorter one is to give the diction a chance to "average out" these kinds of contrasts. 71.163.64.43 22:12, 28 January 2012 (UTC)
Oooh, there could be a paper in that! A. Mahoney 14:02, 27 January 2012 (UTC)
Yes, it's an interesting hypothesis. Let's state it for the record, in a testable form: the length of a translation of a text will exceed the length predicted by the weighting of its language in respect to the language of the text from which it's translated. To my ear, this sounds something like "everyone is above average," but maybe others will hear it differently. 71.163.64.43 22:12, 28 January 2012 (UTC)
One small question, though: are all our sample texts (roughly) the same length? The UDHR is about 1600 words in English, the Tower of Babel passage is closer to 250, and the first chapter of the Quran is only about 100. I think that discrepancy could skew the analysis. What about taking 1600 words of the Quran and comparing that? Or a similar amount of text from Genesis? I would conjecture that if we take half a dozen texts of the same general size, we'd get similar language weights. But it wouldn't surprise me that several different 250-word passages from the Bible would produce strikingly different results. A. Mahoney 14:02, 27 January 2012 (UTC)
But I do like the ease of finding the various translations. To wipe out the bias, I wonder if we could use Spanish instead of English as the "baseline" and just have English be the same weight as Spanish (i.e. "1.0")? --MarsRover 05:57, 25 January 2012 (UTC)
A feature that seems to have gone unremarked is the absence of the article the from the English version of the "Lord's Prayer" (stopping just after "deliver us from evil"). It's the commonest word in English, typically used (I'd guess) once every 10 words at most, but in the 48 to 52 (or so) words in that text, its frequency is zero. In this respect, the English text is quite un-English: it's terser than usual. Therefore, using the "Lord's Prayer" for a reference text is probably going to skew the weightings of languages, like Russian and Latin, that don't use an article. (It may give them a lower weight and disadvantage them in the rankings.) This is just one of those quirks mentioned above, and again the takeaway is the same: a longer reference text should be preferred to a shorter one. Jacob. 71.163.68.106 16:51, 26 January 2012 (UTC)

Sure, anything that makes catalan go down in the rankings is acceptable. But nevermind, we are used to it.--Arnaugir 12:57, 22 January 2012 (UTC)

If new weights were applied to Feb 2012 result...

Wiki Language Weight Mean Article
Size
Median Article
Size
Absent
(0k)
Stubs
(< 10k)
Articles
(10-30k)
Long Art.
(> 30k)
Score Growth
1 en English 1.0 66,754 58,181 0 15 175 810 88.94 +0.00
2 ca Català 0.971 36,018 30,100 0 3 487 510 72.68 -20.65
3 fr Français 0.894 42,274 29,417 0 145 366 489 66.78 -4.13
4 de Deutsch 0.894 40,261 28,978 4 134 382 480 66.47 -3.72
5 ja 日本語 2.551 38,617 26,316 0 153 411 436 63.57 +8.19
6 es Español 0.897 37,263 24,574 1 171 414 414 61.70 -6.49
7 zh 中文 3.786 41,786 24,018 0 215 373 412 60.17 +0.74
8 it Italiano 0.891 29,482 19,870 1 254 416 329 54.21 -7.19
9 ru Русский 0.908 29,450 19,347 0 256 429 315 53.41 -14.36
10 pt Português 0.937 27,692 16,494 0 244 483 273 51.48 -7.89
11 uk Українська 0.994 21,707 14,691 0 297 498 205 45.93 -19.48
12 he עברית 1.466 18,581 12,865 1 397 427 175 40.89 +5.53
13 sv Svenska 1.004 17,031 11,094 1 442 395 162 38.67 -3.62
14 ar العربية 1.408 24,104 9,598 0 512 288 200 38.49 +6.40
15 pl Polski 0.956 19,374 10,821 4 470 363 163 37.66 -3.33

The complete table is available at List of Wikipedias by sample of articles/UDHR test

Interesting. I notice a large difference for Japanese, from 1.9 to 2.551. It would be good to know why there is such a difference. Do the sample texts use different amount of Kanji? If so, which sample is closest to an average Japanese Wikipedia article? Or is bytes counted in one case, and characters in the other? Boivie 08:06, 3 February 2012 (UTC)
I'd think that the Babel text uses lots of words that don't exist in Japanese, and have to be written using kana, as opposed to the UDHR. -- Liliana 13:31, 3 February 2012 (UTC)

Reaction to prototype

Thanks for this: it is interesting, and an initial read-over confirms my impression that the new weights are at least as sensible as the existing ones. A. Mahoney 13:16, 3 February 2012 (UTC)

Yes, thanks also. I think that the new weights make much more sense than the old ones which lead to questions about the weights that can't really be answered. These new weights are more consistent, for example Japanese doesn't differ as much as before from Chinese, and I think no one has ever understood why Catalan should differ so much from the other languages. So it's just right to change the weights and take a newer, longer text instead. I didn't understand the old weights anyway.
It's also correct that German texts are normally a bit longer than the English ones. As German speaker, I think it hasn't been right before, that the English weight was exactly the same as the old German one. --Geitost diskusjon 17:11, 3 February 2012 (UTC)
The decease in score for each wiki seems proportional to how much work was put into improving the its articles. This seems like an obvious discouragement. At the very least the weights need to be modified (e.g. choose another base language) so that the change doesn't mean more than half the wikis have their score drop. Initially this mini-project was just to be a better measurement of wiki quality than article count. But IMHO a more important goal of just encouraging improvement has superseded that goal. -MarsRover 17:21, 3 February 2012 (UTC)
I don't see why people would be discouraged. We must assume (as we're told elsewhere) that wikipedians are people of goodwill, and people of goodwill wouldn't want their wiki to be ranked higher than it deserves, as that might seem tantamount to cheating. The question is how to refine the methodology. Would it be useful to consider taking both samples together, rather than either of them separately? Just because one sample is much longer doesn't necessarily mean that the other sample doesn't have at least a little value. Ideally, we'd have a single sample of maybe 100,000 words, but that's not going to happen anytime soon. Jacob. 71.163.192.8 03:01, 4 February 2012 (UTC)
I'm not sure it matters which base language we take, does it? Given a particular base text to count, if Language A's version is 1000 characters long, Language B's is 1200 characters, and Language C's is 1400 characters, we could take A as base and make the weights 1, 0.83, 0.71. Or we could take B as base and get 1.2, 1, 0.86. Or we could take C as base and get 1.4, 1.16, 1. We will always have (A's weight) > (B's weight) > (C's weight), and the proportions won't change -- for example (B's weight) : (C's weight) = 7 : 6 in all cases.
What is making a difference here is the change of base text, from a short one to a longer one. A short text might co-incidentally have some unusual feature that makes Language A's version unexpectedly long (or short). In a long enough text, while there could be such quirks, they should be an insignificant proportion of the whole. I think it's the length, rather than the change from Genesis to the UDHR, that's important here; I would expect if we used not just the Babel story but a section of Genesis roughly the same length as the UDHR, we'd get results like the ones shown here. A. Mahoney 17:55, 3 February 2012 (UTC)
Whatever the referential language is, the ratios of all the languages to each other will be same, I think. But we have good reasons to retain English as the referential language: the English wiki is the biggest of all the wikis, with by far the largest number of editors, and the English language is, as Wikipedia puts it, "the leading language of international discourse." Jacob. 71.163.192.8 03:01, 4 February 2012 (UTC)
The net result of using the UDHR instead of Babel is that English has a higher weight than the rest. By keeping it the base means its weight will still be 1.0 and the rest will have to decease. We can pick another language so that English will have a higher weight and the rest will have as close as possible to their current weight. By not decreasing weights we can minimize the disruption to people who've already editted a bunch of articles that a barely at the threshold for the next classification (e.g. Stubs, Articles, Long Article). --MarsRover 03:35, 4 February 2012 (UTC)
Yes, that's a pertinent point: lowering the weight of some wikis will push certain of their articles below the 10K and/or 30K cutoffs. This fact won't surprise people working in earnest on the 1000 pages because they'll be following the discussion here and will be planning accordingly. For example, the new weight will downgrade the articles in the Latin wiki: some articles will fall below 10K, and a couple will fall below 30K; thanks, however, to Latin's canny programmers, editors in the Latin wikipedia know exactly which articles those are, and they're taking steps to beef them up. (Similarly, Latin already had an article in place for en:Menstruation, so as to be ready when it would be added to the list, in place of en:Watt.) In short: alert editors won't be surprised by the proposed changes and will be taking steps to accommodate themselves to them; lackadaisical editors, though they might be surprised, won't care. Jacob. 71.163.75.111 12:49, 4 February 2012 (UTC)
About chosing where to put 1.000. The three larges Wikipedias, English, German and French, all have 1.0 now. Maybe we should put the new 1.000 at the average of those three languages (11434 characters)? That would put English at 1.079, and German and French at 0.965. By the way, could someone dubblecheck the numbers for German and French? It seems a bit unlikely that they ended up at the exact same number of characters. Boivie 12:45, 5 February 2012 (UTC)
Why would you even want to artificially alter the scores just so some Wikipedias don't drop down that hard? To me, this seems like falsification of statistics. -- Liliana 13:00, 5 February 2012 (UTC)
Well in the language weight list above, in average the scores are reduced by 11.7%. So, one could argue that the limit between stub and article should be changed from 10000 to 8827 characters, if we think that the limit should be as hard as before to reach in average for the languages. Or we could adjust the 1.000 point in a way that is less influenced by the difference between the text samples for only one of the languages (the English). Boivie 08:28, 6 February 2012 (UTC)
That's actually a very good idea. 10.000 characters is already a quite decently-sized article (amounting to some two printed pages of content with font 11 plus another for references, or more content if it isn't very well referenced) and not even all featured articles in various Wikipedias reach 30.000, so there's no sense in increasing this threshold. I feel quite silly not thinking about this myself. So I imagine "article" would be over, say, 8500 and "long article" over 26500 characters? Sounds very reasonable to me. — Yerpo Eh? 12:16, 6 February 2012 (UTC)
Instead of decreasing the thresholds I would be in favor of changing the base language. For example, changing the base language to German or French is the same as changing the thresholds to 8900 and 26800. I am also in favor of just using one decimal for the weight (UDHR's not really that much more accurate). I think the simpler the better when it comes to the formula. --MarsRover 21:36, 6 February 2012 (UTC)
As you say, in terms of score, changing the base language has roughly the same effect as lowering the threshold, so I don't see an advantage of one over the other. Unless you mean the argument of streamlining the text to the language in which it was originally written. In which case, yes, changing the base language would be a slightly better choice. As for decimals, three are perhaps indeed an overkill, but two would be sensible, IMO, because the new text is 5 times longer. — Yerpo Eh? 12:30, 7 February 2012 (UTC)
Three significant digits look too superfine for our current state of knowledge, but two are probably supported by the size of the sample, and one looks crude. Meanwhile, could someone recheck the numbers? Three of the listed languages have a weight of exactly 0.894, and that seems like a curious coincidence. Jacob. 71.163.76.123 21:44, 7 February 2012 (UTC)
I checked the three counts - German has actually the ratio 0.897, but the length of French and Romanian only differs by 2 characters and they both have 0.894. — Yerpo Eh? 07:49, 8 February 2012 (UTC)
That would make a difference, as German would round off to 0.90, but the others would round off to 0.89. Note: if German stands at 0.897, it's the same as Spanish—a new curious coincidence! In case the Spanish is a typo, it should be checked too. Jacob. 71.163.192.88 16:17, 11 February 2012 (UTC)

These results show rather clearly some Wikipedias in which editors contributed to vital articles with the primary goal of improving their score. Most obvious examples are the three that dropped by around 20 each - those have a lot of articles that are just over the threshold with the current weighting. Judging from the response by Catalan editors, those are the ones that will probably be the most offended, but I don't think their offence by itself should be regarded as an important argument against the change. Just as a curiosity, :sl dropped by a rather large percentage of its current score too (the change of weight is similar to :ca), which in this case comes from one or two editors sometimes contributing with this particular goal in mind. — Yerpo Eh? 20:58, 5 February 2012 (UTC)

"I don't think their offence by itself should be regarded as an important argument against the change". - You, Sir, have quite a funny view of the world. I, for one, understand that comment as meaning "I don't care what the others think or feel, only MY opinion is important". You are rubbishing the thoughts of an entire Wikipedian community. --Leptictidium 10:54, 7 February 2012 (UTC)
I stand behind my comment. In absence of arguments, personal feelings of an uninvolved group of editors are of secondary importance. — Yerpo Eh? 12:08, 7 February 2012 (UTC)
what means uninvolved?, my poor english perhaps bring me to a wrong translation ... but I thing that catalan is involved in this ... Btw: An aport of an anonimous Ip at catalan wiki : Different articles can create very different ratios. Is not the same a legal text or a text about a farm. If they take it seriously, instead of a text change of one to the other, having both available sources, should be added the amount of characters in both texts. Would have a more adjusted to reality. It seems a reasonable point of view --Mafoso 12:26, 7 February 2012 (UTC)
I agree with Mafoso (to the extent that I understand him), in both respects: yes, different topics and different social contexts (what linguists call registers) may well call up different vocabularies, grammars, and spelling conventions, resulting in variation among the ratios; and yes, the Babel data are evidence, and throwing them out could be a mistake. For now, why not, as Mafoso seems to be suggesting, take the mean of the Babel data and the UDHR data and use that mean for the weight? And then let's seek other sets of translations to add into the mix as they become available. I've been meaning to keyboard some bitextual examples proving Mafoso's first point, but obligations elsewhere have taken up too much time. Maybe within a few days! Jacob. 71.163.76.123 21:44, 7 February 2012 (UTC)
The obvious problem is, where would you find more texts that are translated to all those languages? You'd probably end up in a situation where some languages would have a larger sample of texts than others from which the average was calculated. That would create a lot of complications, not to mention calculating all that. I'm not too sure about throwing just the Babel in the mix either - we already established that it isn't very representative of the modern English, regardless of the register (don't know about other languages). Opinions? — Yerpo Eh? 07:49, 8 February 2012 (UTC)
"Uninvolved" means that Catalan editors haven't participated in any discussion about this proposal so far, apart from reacting offended. — Yerpo Eh? 12:31, 7 February 2012 (UTC)
The real offense comes up when you said that an uninvolved group of editors are of secondary importance. BTW: A second argument: the relative new Weight change according the reference language used : the difference between the relative weight if you take English as a reference is different than if you take Chinese as it, I do not know how is calculate done but certainly this affects the result.--Mafoso 15:28, 7 February 2012 (UTC)
No, I certainly didn't write "an uninvolved group of editors are of secondary importance". Please read again - it's an important distinction. Taking Chinese as a reference would only multiply character length from almost everybody else by 3.5. We are discussing taking German or French as reference because English version is probably the most streamlined (the declaration was written in English). But the difference, I think, would be minimal and English would still come ahead. The point of the whole proposal is, it's simply more realistic than using the Tower of Babel for the reasons stated before. If a different reference prevents Catalan from dropping by a quarter of its score, that's a minor bonus. Because, again, this is not a competition, but a rough measure of a certain Wikipedia's quality - which Catalan and a few other Wikipedias have really skewed by obsessing about it. — Yerpo Eh? 20:51, 7 February 2012 (UTC)
Please leave to talk about obsesions or offense , it's your sensation and is based in parcial (including mine) apports. let's leave it.--Mafoso 09:40, 8 February 2012 (UTC)

Think about it: what's the use of raising the scores back to what they were? Eventually, you'll reach 100 this way, which is a perfect score - the problem is that when you're at 100, you cannot improve any more, which makes the scores meaningless at this point. Therefore, I don't think that the scores dropping down a bit is really that bad. -- Liliana 17:40, 7 February 2012 (UTC)

The change is based in tying to ajust the mesaurement of the Length changing the reference text, I brought the opinion of a third person who argues that this adjustment would be more refined taking into account of the two texts that have: them we are improving instead of changing.
About the question of reaching 100 at the ca:wiki project of 1.000 we are considering also how many of the 1.000 are featured articles. This could be another complementary weight (I'd will be great to have 10.0000 character articles with more weight than 30.000 ones). --Mafoso 09:40, 8 February 2012 (UTC)
"In absence of arguments, personal feelings of an uninvolved group of editors are of secondary importance." - In absence of arguments? Are you kidding me? The long-running campaign to bring down the score of the Catalan Wikipedia is on its own a significant enough argument. This new scheme is but the latest in a long series of attempts to reduce our score by changing articles (with replacements whose importance is dubious at best). But you are sending a very clear message to the Catalan Wikipedia and to other small Wikipedias for which this project has been a driver for growth and improvement: The English Wikipedia must be at the top. One way... or the other.. Leptictidium 13:17, 8 February 2012 (UTC)
Yes, in absence of arguments. There is no "long-running campaign to bring down the score of the Catalan Wikipedia" whatsoever. I can understand how you got this feeling, but you fail to see the real reason (to be honest, at least one of you sees the big picture). Besides, is Catalan Wikipedia really more improved if it's in the first place, or less if it isn't? Again, the majority of editors active here (including me) come from other language projects and have nothing to gain by "forcing :ca down". So can we please stop with this nonsense now? — Yerpo Eh? 15:25, 8 February 2012 (UTC)
I like the suggestion from Mafoso about giving extra points to featured articles. And as for the status of Catalan as the highest-scoring Wikipedia, congratulations to you guys; I for one admire what you've done with the 1000 Pages. I don't think any of us considers this the only measure of how good a Wikipedia is, or the best measure, or even necessarily our own favorite measure, but as long as we are measuring it, I entirely understand the desire to get a better score! I'm from Latin, myself, and we, too, are deliberately working on raising our score -- though that's far from the only thing we're doing. We know we're not going to catch Catalan any time soon, but it's something to aspire to -- not by bringing you guys down, but by doing as well as you have. A. Mahoney 15:44, 8 February 2012 (UTC)
I'm not sure that's what Mafoso meant. In any case, featured criteria aren't comparable between Wikipedias and can change over time so perhaps it's more of a locally relevant measure. At :sl, we are also noting featured important articles in our list. — Yerpo Eh? 09:20, 9 February 2012 (UTC)
Yes, that's a tempting idea. The proposal to adjust the point scores by counting featured articles more heavily wouldn't be fair, for the reason Yerpo gives. However, perhaps a new column could be added to the table. For each wiki, it would show the number of topics from the 1000 that have been featured articles. Since the number would have no more than three digits, the column could be quite narrow and therefore wouldn't drastically affect the width of the table. Jacob. 71.163.199.210 19:04, 9 February 2012 (UTC)
The only one here talking nonsense is you, Mr Yerpo, who are desperately grasping at straws to try and justify what for all intents and purposes is a scheme designed to put the English Wikipedia back at the top of the list. Moreover, you're pretty much alone in your endeavour: skimming through the previous discussion, the number of people who fully support the change can be literally counted on the fingers of one hand. I strongly oppose such a far-reaching decision being made by such a small group of users. At the very least, it should be taken to vote by all active members at Wikimedia, otherwise the change has no legitimacy whatsoever. 109.14.96.147 11:49, 10 February 2012 (UTC)
Allow me to cite Mars Rover: "Initially this mini-project was just to be a better measurement of wiki quality than article count. But IMHO a more important goal of just encouraging improvement has superseded that goal.". Or, for example, look at Mr A Mahoney's comment above:
It's "Ms" A. Mahoney. A. Mahoney 14:20, 10 February 2012 (UTC)
"I'm from Latin, myself, and we, too, are deliberately working on raising our score -- though that's far from the only thing we're doing. We know we're not going to catch Catalan any time soon, but it's something to aspire to -- not by bringing you guys down, but by doing as well as you have.". The potential reward of climbing up the classification is encouraging the editors of the Latin Wikipedia to improve their language edition. However, if you strike the Catalan Wikipedia back to somewhere between the Top 10 and the Top 20, you are setting an extremely dangerous precedent: why would anyone want to waste efforts in editing these specific articles, if they know they will be beaten back down the classification once they start to challenge the English Wikipedia's leadership? You may retort that the editors of the Latin Wikipedia should be improving their language edition anyway, regardless of their classification. But I'll remind you that Wikipedians are people, not slaves or drones, and they will only improve Wikipedia if they feel rewarded. Remove the reward, and you're removing the incentive to improve those articles. So stop searching your bag full of excuses and accept that, while the English Wikipedia is the best language edition overall, smaller Wikipedias can and will be better than it in specific fields and/or projects.109.14.96.147 11:49, 10 February 2012 (UTC)
Your opinions will be taken more seriously if you stop looking for conspiracies where there are none. This is, for all intents and purposes, a scheme to make this measure more realistic. How evil, eh? I understand you're feeling disappointed because it might cost :ca the first place you were working hard for, but do try to see the broader picture, at least. Is it really a reward to have the first place with a broken measure? And why do you mean by this "strike the Catalan Wikipedia back to somewhere between the Top 10 and the Top 20" drama? :ca would be #2 at worst, because you have lots of long articles in any case. Possibly, it would even retain the first place if we change the reference language, which would be a good thing. This counting of proponents and opponents of yours doesn't make much sense, at least until the opposition's only "argument" is "we don't like it because our score will go down". — Yerpo Eh? 20:04, 10 February 2012 (UTC)

I can understand how you got this feeling, but you fail to see the real reason (to be honest, at least one of you sees the big picture). Besides, is Catalan Wikipedia really more improved if it's in the first place, or less if it isn't? Again, the majority of editors active here (including me) come from other language projects and have nothing to gain by "forcing :ca down". So can we please stop with this nonsense now?

are we on a rush? why can't we keep debating the change? I think it shouldn't be aplied on February, because it can hurt feelings and it can be implemented with more consensus --Barcelona 12:11, 10 February 2012 (UTC)
I agree, I can't see why a broad decision like this one should be decided only with the opinions from such few people, and as far as I see there is no consensus. I hope you wait (and search) for more opinions, maybe you should open a section in [3] in order to involve mor wikimedians?--Arnaugir 12:44, 10 February 2012 (UTC)
Regardless of the arguments for and against this change, such a far-reaching decision cannot be made unilaterally by a handful of users, especially when there seems to be no consensus (on this talk page, there are at least as many people opposing this change as supporting it). Since it has the potential to affect hundreds of Wikipedians across dozens of language editions, the only way to acquire the necessary legitimacy to carry out the changes is by obtaining a mandate from the Wikimedian community. Therefore, the changes must be subjected to a vote by the entire community.--Leptictidium 13:37, 10 February 2012 (UTC)
Why are you all taking such a silly page so seriously? I bet most of the Wikipedia community couldn't care less about what happened here. It's just a statistic, chill out! And yes we are on a rush, because of the "change since last month" thing, this needs to be decided in February or the statistics will break. -- Liliana 15:14, 10 February 2012 (UTC)
1) If it's such a silly page, why are YOU taking it so seriously? You may bet whatever you want, but at the end of the day it's just your opinion and you have no right whatsoever to impose it on the rest of the Wikimedian community. A mandate from the community is necessary if we are to implement such a far-reaching change.
2) Rushing things reminds me of bad salesmen's hard sell tactics: "Only this week!", "Hurry up before our product runs out!", "Limited amount available!". Usually, the product is bad quality and the client ends up with buyer's remorse, and I suspect the same thing is applicable here. And I'm 100% sure we can avoid "breaking the statistics" without ramming your proposed changes through: Mars Rover has outstanding technical knowledge and I'm sure he can do it (or at least continue using the present language weights until this case has been resolved).Leptictidium 17:20, 10 February 2012 (UTC)
My home wiki is the Dutch Wikipedia, not the Catalan, but I agree that having just three or four users decide to make these changes without consulting anyone else goes right against the spirit of Wikipedia. Can't there be a vote or something? I think it's not fair to just decide something without asking the great majority of people what they think about it. And why such a rush, we can just keep on going with the old system until a consensus has been reached. --Laurita87 17:53, 10 February 2012 (UTC)
Voting tends to obscure arguments. The great majority of people don't even know about this page, and Catalan editors that seemed the most interested were notified of this discussion in good faith. They don't seem to disagree on the main argument for the change - that it would produce more realistic results. — Yerpo Eh? 20:04, 10 February 2012 (UTC)
On the contrary, voting tends to cast the spotlight on the arguments. If the proposed vote is preceded by a 3-week period in which both sides present their reasons to support/oppose the change and debate on a public arena, each voter will be able to analyse both sides' arguments and then cast his vote for the side which seems more reasonable to him. Therefore, I see no problem whatsoever with a vote, and indeed it is the only way for this decision to be made with the necessary legitimacy.Leptictidium 21:33, 10 February 2012 (UTC)
I think the idea is good of three weeks to present arguments, since both parties must debate each other seriously, and not accuse of conspiracy. As a user of Wikipedia, I want much more serious debate than a discussion between four users for and four users against. The more people participate in the discussion, the better the result. And the end result is what we Wikipedians are most interested, or?--62.75.138.253 21:55, 10 February 2012 (UTC) (Der Speisner)
You're welcome to create an RFC, it could be useful to summarize the arguments and raise participation. But that's not the same as a poll. I still think that voting doesn't equate to argumented debate and will usually only entrench the people involved. — Yerpo Eh? 08:13, 11 February 2012 (UTC)
I don't take this list seriously at all. For all I care it could be outright deleted, if there's so much drama over it. -- Liliana 23:56, 10 February 2012 (UTC)

So everyone agrees with stopping it and debating the changes?--Barcelona 09:06, 11 February 2012 (UTC)

We must stop this prototype and start debating it. --Lluis tgn 10:54, 11 February 2012 (UTC)
Yet another reason to decide this issue by debating and voting, not unilaterally: it would be very interesting to get external input on the possibility of combining the Babel text and the DHR text to create one same corpus. A big corpus is better than a smaller one, so we should add texts together, not replace them. Leptictidium 16:45, 11 February 2012 (UTC)
Agreed! In general, the bigger the corpus, the more accurate the weight. For the perils of jumping to conclusions on the basis of a small corpus, see the new heading below, regarding the Latin versions if the UDHR. Jacob. 71.163.192.88 02:25, 12 February 2012 (UTC)
But as Yerpo said, the Babel text is Biblical, and thus unsuitable to represent the average Wikipedia article, since Wikipedia articles usually aren't written in Biblical style. -- Liliana 21:34, 11 February 2012 (UTC)
There's nothing wrong with "biblical" style per se (is there really such a thing? "In the beginning God created the heaven and the earth" is stylistically undistinguishable from quotidian English prose), if it's recognized as just one of several possibilities. A biblical sample can give us a datapoint, but maybe it shouldn't be the only datapoint used. Jacob. 71.163.192.88 02:25, 12 February 2012 (UTC)
Apart from the argument which I submitted in favor of adding the two texts to take them into account at the new weights proposed (this proposal follows the same principle that a longer text implies better measure) and will cover different linguistic registers. We must also bear in mind that when translated, is very important, what is the original language of the document. This is one of the variables should also be taken into account when considering why "Babel" gives a different weight to the weight that gives the UDHR, not only is the length of the text. --Mafoso 15:33, 13 February 2012 (UTC)
I support the changing of weights, the new weights are much more realistic. E. g. 1,4 for Russian is absolutely unreal. Russian is longer than english. 0,9 sounds more truthlike. Possible dissapointment of non-english editors is not a good argument. The competition should be honest. I'm from ru-wiki, the language that will suffer from new weights. But the realistic and honest figures are much more important than favouring some wikis, based on incoorect weights.--Abiyoyo (talk) 12:45, 8 March 2012 (UTC)
If we take care to balance the threshold or use a different reference, the disruption of scores will be smaller. — Yerpo Eh? 18:34, 8 March 2012 (UTC)
Why not this? Leptictidium (talk) 14:13, 16 March 2012 (UTC)
Everybody wants it, but nobody is actually doing anything. We can't change the stats if we have no numbers. -- Liliana 15:01, 16 March 2012 (UTC)

Weighting factor for Alemannic

Alemannic uses 1097 characters (Alsatian is part of Alemannic). So the weighting factor should be 1162/1079 = 1,076923... = 1.1. --Holder 11:24, 1 December 2011 (UTC)

Dec 2011 update

MarsRover, could you please run the script for :sl again? There was an interwiki conflict for en:Brussels which I resolved now.

Another thing, I noticed that the English filter isn't applied to the "popular articles" script. For example, :si wiki pops up a lot among top 3, but they only have a bunch of untranslated copies (see List of Wikipedias by sample of articles/Neglected#si සිංහල). — Yerpo Eh? 08:33, 4 December 2011 (UTC)

Ok, I fixed the score for slovene. I will try to fix the filtering untranslated articles out of the average size calculations for next month. The source code is unfortunately a mess at the moment so it takes a bit of time. --MarsRover 21:43, 4 December 2011 (UTC)
No problem, just wanted to let you know. Thanks for updating :sl. — Yerpo Eh? 07:15, 5 December 2011 (UTC)