Talk:List of Wikipedias by sample of articles/Archives/2007

From Meta, a Wikimedia project coordination wiki

Any "lists of absent articles of each major minor Wikipedias"?

If possible, I suggest the lists of absent articles of each major Wikipedias are better to be generated in another page for references. Actually, in Chinese Wikipedia's "List of Title for Group National Society General Siege Social-Commercial Erotic Pornography Sexual Intense Child in Leverkusen to Germany for Europe articles every Wikipedia should have", all articles listed there have been created already (0 absent articles). However, since that list was created more than a year ago, we don't know which 34 acticles are newly added in the list in meta but are not created in chinese version yet. -- Kevinhksouth 14:11, 8 November 2007 (UTC)

That's a good idea. My software didn't record the list of absent articles -- I'll change it so that it does, it isn't very difficult. So when I update this page (say, in a couple of months), I'll mage sure I have a list of 'missing articles'. You're right that the list of articles every Wikipedia should have has changed in the last year, and may even change in the future. (How are the articles chosen, by the way? By voting? Or can anyone who thinks a certain article is important just add it?) --Smeira 17:27, 10 November 2007 (UTC)
Can you not check it from the diff? Hillgentleman 23:39, 11 November 2007 (UTC)
What do you mean? Smeira 01:40, 12 November 2007 (UTC)
E.g.[1]Hillgentleman 02:07, 12 November 2007 (UTC)
Oh, I thought you had meant I could recover the information about which articles from the list were not found on the Chinese Wikipedia. Yes, I can see new additions to the list of Wikipedias this way. Thanks! --Smeira 07:52, 12 November 2007 (UTC)
Hello, you have announced above, that you will have a list (or betteer lists) of 'missing articles' ... do you have this kind of lists ? Where can we consult there ? --Jauclair 22:40, 20 March 2008 (UTC)
List of Wikipedias by sample of articles/Absent Articles MarsRover 22:53, 20 March 2008 (UTC)

My Wikipedia samplings

I was testing creating statistics from sampling articles from various wikipedia. (A lot simpler than downloading gigabytes of data). Get 30 random articles and figure out what was covered based on the List of articles every Wikipedia should have topics. Also, figure out what percent of the article edits were made by a robot. Its pretty interesting seeing the differences. If a wiki has a lot of articles with unknown categories, it just means that they have no interwiki links to the English wikipedia.

Language Language (local) Wiki Articles Sampling Human %
1 English English en 2 086 764 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 92
2 German Deutsch de 664 151 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 94
3 French Français fr 581 388 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 81
4 Polish Polski pl 441 111 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 65
5 Japanese 日本語 ja 434 477 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 92
6 Dutch Nederlands nl 378 985 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 67
7 Italian Italiano it 370 815 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 76
8 Portuguese Português pt 338 165 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 72
9 Spanish Español es 297 977 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 76
10 Swedish Svenska sv 260 357 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 68
11 Russian Русский ru 213 652 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 70
12 Chinese 中文 zh 152 730 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 83
13 Finnish Suomi fi 139 517 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 76
14 Norwegian (Bokmål) Norsk (Bokmål) no 139 251 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 65
15 Volapük Volapük vo 112 390 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 2
16 Romanian Română ro 95 981 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 60
17 Turkish Türkçe tr 94 966 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 82
18 Esperanto Esperanto eo 90 938 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 57
19 Catalan Català ca 86 627 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 55
20 Lombard Lumbaart lmo 84 518 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 16
21 Slovak Slovenčina sk 83 079 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 42
22 Czech Čeština cs 81 574 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 65
23 Ukrainian Українська uk 77 040 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 69
24 Hungarian Magyar hu 76 233 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 69
25 Danish Dansk da 73 498 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 59
26 Indonesian Bahasa Indonesia id 68 974 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 56
27 Hebrew עברית he 65 693 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 78
28 Lithuanian Lietuvių lt 56 433 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 58
29 Serbian Српски / Srpski sr 56 050 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35
30 Slovenian Slovenščina sl 54 589 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 52
31 Bulgarian Български bg 48 195 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 73
32 Korean 한국어 ko 45 843 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 73
33 Arabic العربية ar 43 802 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 64
34 Estonian Eesti et 42 785 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 76
35 Telugu తెలుగు te 37 560 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 33
36 Croatian Hrvatski hr 36 940 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 63
37 Newar / Nepal Bhasa नेपाल भाषा new 37 013 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 20
38 Cebuano Sinugboanong Binisaya ceb 33 510 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 8
39 Galician Galego gl 29 187 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 54
40 Greek Ελληνικά el 29 071 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 59
41 Thai ไทย th 29 036 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 64
42 Norwegian (Nynorsk) Nynorsk nn 27 318 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 54
43 Persian فارسی fa 27 386 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 91
44 Vietnamese Tiếng Việt vi 26 305 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 67
45 Malay Bahasa Melayu ms 24 242 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 64
46 Bishnupriya Manipuri ইমার ঠার/বিষ্ণুপ্রিয়া মণিপুরী bpy 22 091 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0
47 Basque Euskara eu 21 502 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 28
48 Bosnian Bosanski bs 20 919 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 44
49 Simple English Simple English simple 20 674 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 56
50 Icelandic Íslenska is 18 264 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 69


Category Legend
Biography History Geography Society Culture Science Technology Foodstuff Mathematics Unknown

I am still working on the categorization logic. Most of the problems are from choatic en.wp categorizations. The Human % is the percent of edits for the articles where the username doesn't contain "Bot". --MarsRover 10:45, 11 November 2007 (UTC)

A quite interesting approach, MarsRover! (By the way, how do you get automatic access to the editors' names? Are you using the py.wikipedia framework? Is there a function/variable/method there that will do this?) I agree that there is a problem with en.wp categorization logic, and further yet: with decisions about what becomes an article and what doesn't. One example: when I was doing de.wp, it turned out that (of all things) the article en:sex didn't have a link to de.wp. This is because en.wp has a general article for both human sexes (plus an article about male sex and another one about female sex), whereas de.wp doesn't: they preferred to have de:männliches Geschlecht (=male sex) and de:weibliches Geschlecht (=female sex) as independent articles, without a general one. This makes it look like de.wp forgot one of the articles in the list, but that's not true; in fact they only made a different choice about where to put the content (in two articles instead of one). If we were using the list of desirable articles from de.wp, the results might be different...
If you're interested, maybe we could talk more about the topic and try to come up with an interesting measure together. --Smeira 13:05, 11 November 2007 (UTC)
I am not using the python framework. I am just downloading the history page and parsing the HTML. (Actually just the history for last 50 edits). I think your approach with a fixed list of articles (instead of random articles) is a better method. Otherwise, people can keep "rolling the dice" until they have a nice score. But it does show you what the wiki contains beside the important articles. MarsRover 23:11, 11 November 2007 (UTC)
I wonder if there isn't an automated way to get the list of contributors into some variable... that would make it easier. A refinement might be to check which users have bot flags and which don't -- some human user names like say Abbot might accidentally contain the word "bot" -- but people are usually pretty consistent in adding "Bot" to bot accounts, so probably this wouldn't affect your results very much.
I looked for a way. The "Special:Export" method is supposed to allow downloading different revisions with the user information. But, this ability have been disabled. Also, a lot of bots are not setting the bot flag. btw, I fixed it to be case insensitive when looking for "Bot" in the username and re-generated the table. MarsRover 05:40, 12 November 2007 (UTC)

About the human %, I think it should be improved quite a lot. Taking just the last 50 contributions is quite disadvantage for small wikipedias where the content of an article is correct but not frequentely updated. Then, most of the last 50 contributions will be just bots adding interwikis. Here there is an example of what I mean: ca:Essen. I created that article. It's quite a good article and since it was "finished", there have been just minor changes and many many many bots adding interwikis. So, I think you could neves say that this article has a 3% human content. I would estimate it arround 90% of human content, but your algorithm would tell something arround 3-5%. So, proposals to improve it: calculate the difference in bytes of every contribution or not counting iw bots. Moreover, I have a bot account which I use manually and sometimes with AWB to make sistematic corrections, change categories, etc. It has a bot flag, but is controled manualy. The changes it makes are of a very few amount of bytes. So, I think is not the same a bot to create articles, a bot to add iws or a bot to move categories. The second and the third shouldn't count so much as the fist to compare the human/bot impact in Wikipedia. Anyway, I think you're doing a good work!--Xtv 20:43, 12 November 2007 (UTC)

I will thinking of doing that but its more complicated since your need all the history (since the first edit could be the most important) and then you need to parse the actual edits to see if it was a interwiki edit or not. The way it is now its really "Human % of Activity" (not "Human % of Content"). It is still an interesting metric and if an admin if working on building a community would want to increase the number. --MarsRover 03:48, 13 November 2007 (UTC)

I find curious that there is a Wiki (bpy) with 0% human edits... Malafaya 13:57, 17 November 2007 (UTC)

This is just a sampling of 30 articles. I'm sure there is at least one human edit in the wiki. But if you click on any of the samples, they look like this were imported by a bot. -MarsRover 21:38, 17 November 2007 (UTC)

Biased against Asian languages

What is a character in your calculations? If you mean an actual character in the language, you need to normalize the value of each character since a Japanese character packs more information than an English character. It seems ja.wp should be ranked higher. Maybe you can do the following:

  • Use bytes instead of characters - if the article's text is encoded as UTF-8, this would mean an English character is worth 1 and an Japanese character would be worth 2 or 3.
  • Weight the character per language - weighted_char_count = char_count * alphabet_size / english_alphabet_size;
  • Use the compressed size - if you compressed the text with the "deflate" algorithm, the resulting size is an approximation of the information value. ex. a lot of spaces counts less then a lot of unique letters and words.

--MarsRover 23:36, 11 November 2007 (UTC)

More. For the same content, a text in German is substantially longer than one in English or French. It would be best to set the mark (e.g. against the English language) by asking the locals if there are exact translations of a number of texts. To get a rough idea, you may dig into the translations on meta.Hillgentleman 00:04, 12 November 2007 (UTC)
Yes, I've been thinking about these issues. What I am using for length is the return of the len() function in python -- I am not sure whether or not it considers Chinese characters as one or more than one (when I try to calculate the length of, say, "日本語" in the python shell, I get only an "unsupported characters in input" error; if, however, my bot program gets the text of a page from zh.wp and calculates its len(), I do get a valid numeric result). I've considered weighing the results (so in my comparative corpus, German tends to be between 1.4 and 1.5 times longer than English); maybe I should do it for higher precision, but when I looked at the results it didn't seem so worthwhile (the distance between en.wp and de.wp is already high even without this weight; adding it to the calculation would mean simply increasing it; of course, Chinese and Japanese might be a different story, if the len() function treats Kanji characters as one normal ASCII character instead of two or three). I also noted that the comparability between results might not be so much increased, since the "information ratio" or "weight" is not a transitive operation (i.e. Weight(English/German) * Weight(German/French) won't be necessarily equal to Weight(English/French), because "translate" is not really a structure-preserving bijection and the domains in question -- the semantic range and its actualization in vocabulary items in the various languages -- are not exactly the same). But maybe I'll add a further column with a weighted score (with respect to English) or mean article length just to see how it compares to the absolute numbers. Of course, I'd need sufficiently many translations from English into all these languages to estimate the weights -- even with the Meta translations, I'll probably have to limit myself to the larger Wikipedias. Smeira 01:35, 12 Nov 2007
Another weight comparison could be based on a word count of raw article pages, less articles (grammar) for languages which have any, less some markup. Obviously, this is impecise, too, and may be hard or expensive to calculate. Note that, both determining word boundaries, and eleminating articles, and possibly similar words, is highly language dependant and not obvious for languages such as Thai, Japanese, the Chinese varieties, and some others. --Purodha Blissenbach 14:44, 12 November 2007 (UTC)

Proposed weighting of characters for formula (Option#1 using The Lord's Prayer)

Base on the article en:The Lord's Prayer in different languages, I came up with the following ratios. For the languages not listed I would just use 1.0 ratio. We should at least use a rounded ratio of 0.5 for all Asian languages since they are the most distorted.

Language Local Characters English Characters English Ratio
German 374 365 1.0
French 438 332 1.3
Italian 311 261 1.2
Portuguese 362 365 1.0
Romanian 407 379 1.1
Spanish 391 345 1.1
Bulgarian 347 325 1.1
Croatian 323 346 0.9
Polish 358 325 1.1
Russian 324 305 1.1
Bengali 337 333 1.0
Hindi 323 333 1.0
Urdu 335 333 1.0
Armenian 259 287 0.9
Finnish 416 333 1.2
Hungarian 404 332 1.2
Chinese (Simplified) 127 333 0.4
Chinese (Traditional) 120 367 0.3
Indonesian 410 333 1.2
Korean 192 333 0.6
Japanese 165 391 0.4
Hebrew 233 333 0.7
Arabic 368 350 1.1
Georgian 451 362 1.2
Esperanto 363 341 1.1
Interesting. One question though: why is the number of English characters in the Lord's Prayer not the same in every case? Wasn't the same English text of the Lord's Prayer compared to its counterpart in each of the other languages? (The page you mention appears to have the same English text in all cases.) Some of the weights are surprising: 1.0 for English:German and for Portuguese:German? That goes against my (extensive) experience with both languages (the latter my mother tongue); could it be just because the Lord's Prayer is a somewhat old-fashioned text?
The translations are sometimes slightly different across languages especially the second to last line.
  • "For thine is the kingdom, the power and glory, forever and ever"
  • "For Thine is the kingdom, the power, and the glory, For ever and ever."
  • "For the kingdom, the power, and the glory are yours, now and forever"
  • "For the kingdom, the power, and the glory are yours, Both now and ever and unto the ages of ages."
I am assuming the English next to the local text is the exact wording used. Couple of the ratios seem odd such as Hebrew being unusally short and French unusally long. But, this text although short and religious seems to be popular for this sort of comparison. --MarsRover 23:45, 26 November 2007 (UTC)
Anyway, here is the table for the first 20 languages, with average article size corrected using the weights you proposed (for Dutch, Swedish, Vietnamese, Norwegian, Catalan and Serbian, I myself calculated the Dutch Lord's Prayer coefficient), for everybody's appreciation (for most of these languages, ordering by average article size and and by score coincide):
Language Average article size
(measured via len())
Weight
(language-to-English)
Average article size
(divided by weight)
English 41 434 1.0 41 434
German 31 113 1.0 31 113
French 25 993 1.3 19 995
Spanish 20 694 1.1 18 813
Italian 18 195 1.2 15 162
Russian 14 131 1.1 12 846
Portuguese 13 140 1.0 13 140
Dutch 11 690 1.2 9 742
Polish 12 119 1.1 11 017
Finnish 11 123 1.2 9 269
Hungarian 10 717 1.2 8 930
Swedish 9 911 1.0 9 911
Hebrew 9 368 0.7 13 382
Vietnamese 9 912 1.0 9 912
Czech 8 846 1.0 8 846
Japanese 8 598 0.4 21 495
Norwegian 8 603 1.0 8 603
Catalan 8 063 1.4 5 759
Chinese 6 953 0.4 17 382
Serbian 7 250 1.0 7 250
Indeed, the changes for Chinese and Japanese (and to a lesser extent also for Hebrew) are dramatic: Japanese would jump ahead of Spanish and become #4, Chinese would be right after Italian as #6. The coefficient for more translated texts must be compared before we could arrive at better values (does anyone know other texts that could have sufficiently many translations? else we'd have to limit this table to only those languages for which there are translations to calculate the coefficients); however, this change in the result would probably remain. (Of course, in order to use these coefficients meaningfully, it would be necessary to refine article measurements: I am currently simply calculating the len() of the whole page, but simply dividing this by the appropriate coefficient would imply treating e.g. the interwiki links as if they were part of the text -- and in the case of the more popular articles, then len() of the interwiki links alone can be close to 1000 characters...).
A final question of some importance. Does anyone know whether the len() function in python, when applied to a page in Chinese or Japanese, actually counts every Chinese character as one character -- or as more than one? (If it counts Chinese characters as sequences of more than one character -- or more than one byte --, then of course the coefficients cannot be used.) --Smeira 22:45, 24 November 2007 (UTC)
1. To be accurate, we should first remove the templates calls.
2. On len():
print (u'早晨!')
print len(u'早晨!')
gives you 3.
Hillgentleman 23:00, 24 November 2007 (UTC)
OK, so using the coefficients on len() results makes sense. For some reason I always get "Unsupported characters in input" erros when I try the above commands from the python shell. --Smeira 23:21, 24 November 2007 (UTC)


Reply:
  • I would just remove the interwiki links since it is mostly foreign text. Categories and Templates should be in the local language so should be correct for the ratio calculation. Also, since the keyword "Category" is different for each wiki, its not that easy to remove. Templates can contain valuable text such as with Infoboxes. The unimportant templates are usually short like {{Stub}}. So, I would include all templates.
  • Other text choices for comparison include Babel text, en:The North Wind and the Sun and en:Schleicher's fable. Finding all the translations will be a challenge and perhaps we can use the average ratio of the language group for the obscure ones.
MarsRover 00:40, 27 November 2007 (UTC)
Note that the weight given to catalan language is not accurate (catalan is similar to french, spanish and italian, so it should have a weight similar to them). Taking into account the data from the second option, the weight would be 0.9. --Meldor 23:52, 9 December 2007 (UTC)

Proposed weighting of characters for formula (Option#2 using Babel text)

Note: please do not archive this section, as it is linked from the content page.

Using "Babel text" which are the Biblical verses about the en:Tower of Babel is another option. Translations in various languages are easily found [2][3]. The text is longer than the option #1 so is hopefully more accurate.

Language Characters Language Weight
(English / Local)
English 1162 1.0
German 1209 1.0
French 1157 1.0
Italian 1078 1.1
Portuguese 1038 1.1
Romanian 1060 1.1
Spanish 1095 1.1
Bulgarian 1016 1.1
Croatian 883 1.3
Polish 1057 1.1
Russian 840 1.4
Bengali - -
Hindi - -
Urdu - -
Armenian 932 1.2
Finnish 1096 1.1
Hungarian 1059 1.1
Chinese (Simplified) 314 3.7
Chinese (Traditional) 315 3.7
Korean 462 2.5
Japanese 598 1.9
Hebrew 964 1.2
Arabic 1195 1.0
Georgian - -
Esperanto 1083 1.1
Thai 1109 1.0
Farsi/Persian 957 1.2
Tamil 1357 0.9
Dutch 1297 0.9
Cebuano 1395 0.8
Vietnamese 1100 1.1
Swedish 1084 1.1
Norwegian (Bokmal) 931 1.2
Turkish 867 1.3
Slovak 869 1.3
Czech 911 1.3
Ukrainian 969 1.3
Danish 942 1.2
Lithuanian 949 1.2
Serbian 822 1.4
Slovenian 977 1.2
Greek 1293 0.9
Catalan 1090 1.1
Latin 1050 1.1
Welsh 1010 1.2
Basque 1076 1.1
Malayalam 1021 1.1
Belarusian (Taraškievica) 834 1.4
Indonesian 1256 0.9
Galician 992 1.1
It seems a better option than the one above. I don't know how you count the characters, though, it seems to me there are 1050 char in New International Version (if that's the one we would like to take as reference, instead of the English Standard Version, which would be a better option in my opinion (best reflects english, not simple english); there are 1181 characters there.
Note also there is the catalan version, which is 1108 characters long. That results in a completely different ratio than in the option above, so I don't know if that way of calculating ratios does really make sense. Here, catalan has a ratio of 0.9 (0.94). --Meldor 23:50, 9 December 2007 (UTC)
Ok, I changed it to use the English Standard Version for the reference. The ratio is a rough figure and probably should be rounded even more (0.4, 0.6, 0.8, 1.0, 1.2) so the choice of translation isn't a factor. I think its correct for adjusting the Asian language problem but maybe not for calculating minor differences in European languages.
To calculate the count I paste the text into MSWord and bring up the statistics window. The character count should be all characters includes spaces but excluding carriage returns and line feed characters. --MarsRover 06:04, 10 December 2007 (UTC)
A potential problem with using biblical texts for samples is that biblical translation tends to be extremely literal, from an original that itself is extremely simple (in vocabulary & syntax), and so you don't get in the translation many of the kinds of idiomatic expressions that characterize the language as it's ordinarily spoken & written. 71.191.124.63 15:29, 8 April 2008 (UTC)
Well, that problem already would be less of a problem if you choose the original, the Classical Greek Septuaginta, as a base and not the English translation. But of course, it is a fact that you can more easily translate Greek into some languages (Modern Greek for example) than in some others (like Pirahã). But there are no texts that often translated as the Bible, so Biblical texts are better than no texts. --::Slomox:: >< 16:19, 8 April 2008 (UTC)
Agreed: the convenience of using a text (like the Bible) that's available in hundreds of languages is important. But in that case, two changes might improve the reliability of the results: (1) use a bigger sample size, and (2) include in the sample some passages in a different style, perhaps some poetical flourishes from Job, Isaiah, Jeremiah, or Revelation, and the more abstruse or tendentious discussions in some of Paul's letters. 96.231.99.229 01:01, 10 April 2008 (UTC)
You can find a true Min Nan translation on this site. There are about 960 characters, so the weight should be 1.2. 83.200.43.51 00:43, 18 April 2008 (UTC)

I don't know the reason, maybe the translation of a biblical text does tend to be too literal, but I noticed that the results of the weighting procedure are not very good. Being a native speaker of Russian and Ukrainian languages, I always notice that translated texts tend to be longer in those languages than their English equivalents. Just to check this I did character count using MS Word in a few Russian translations of English texts, the results showed a few percent longer Russian versions, for example Jack London. The Call of the Wild has 143K characters, and its Russian translation has 147K. The table, on the contrary, gives weights above one to those languages. Thus, at least with certain languages, the table does not do a good job and gives worse results than even unweighted counting. I would strongly recommend to find a couple of other text examples to be used for the weighting procedure.--Oleksii0 16:59, 11 May 2008 (UTC)

Interesting. I did a quick check of the first paragraph of the book. The English version is 557 characters and the Russian is 529 characters which is a ratio of 1.05293 does seem to be above one but far less than the biblical 1.4 ratio. I think a more modern text would be better but shouldn't we use the same text for all languages? Otherwise people can just find the most favorable text. Also, does the language the text was first written have an advantage? --MarsRover 07:26, 12 May 2008 (UTC)

Using random samples?

I had the idea of doing the same statistics here with a random sample of articles (say 1000), to see how this would affect the results. As a preliminary test, I hit the "random article" link 20 times in the five larger Wikipedias (by number of articles) and counted how often I'd have links to other Wikipedias. The results surprised me a little:

  • en.wp: 15 out of 20 had no interwiki links.
  • de.wp: 14 out of 20 had no interwiki links.
  • fr.wp:  7 out of 20 had no interwiki links.
  • pl.wp:  7 out of 20 had no interwiki links.
  • ja.wp: 11 out of 20 had no interwiki links.

This suggests that random samples wouldn't be a good idea -- in too many cases there would be no interwiki-linked articles (not necessarily because they're non-existant; maybe the interwiki bots haven't found them yet). If these numbers are representative, then by choosing en.wp for the random sampling, all other wp's wouldn't get more than 25% of the sample (which would guarantee low scores); but even choosing another wp wouldn't help (choosing ja.wp would still keep all others -- including en.wp -- with less than 45% of the sample). Any ideas on how to solve this problem? Or should the random sample idea be dropped? --Smeira 23:19, 24 November 2007 (UTC)

One obvious possibility that has now occured to me: drop articles without interwiki links, and only add to the random sample articles with interwiki links. (But wouldn't this leave out some important information for comparing the quality level of the various .wp?) --Smeira 23:23, 24 November 2007 (UTC)
Sidetrack: The interwikis can be a significant part of an articles size in bytes.
This (4021) contains more bytes than this (4011).
-Jorunn 00:10, 25 November 2007 (UTC)
I think the random sample metrics isn't a lost cause but I don't think is mixes well with this article. Biggest problem isn't the interwiki links but the lack of consistency. As far as ignoring articles without interwiki links it depends what you plan to measure. I found local villages or local politicians are often missing in en.wp so it distorts the balance of topics. But if you were measuring the average article size I guess it shouldn't matter. MarsRover 01:03, 27 November 2007 (UTC)


Perhaps, in effect, you have tried to measure how useful wikipedia is to a common man, which is difficult. It is helpful to get some narrower and more specific view points, such as "how useful is wikipedia to a ... Icelandic historian"? Or to a mathematician? (Or is planetmath more useful?) A mathematician wants a quick, brief and accurate definition on everything in her specialiality, and also some motivation and history in a near-by field. Now that may not be an easy thing to survey. But the lesson is that thinking in that way may help in getting more specific measures. Hillgentleman 07:54, 28 November 2007 (UTC)


Maximum score

Maybe this is obvious but... Just noticed that the maximum score is fixed. (1092 x 9 = 9828) I like that its almost 10000 so easy to remember but its cool that each wiki can eventually catch up the the larger wiki unlike the "article count" metric. MarsRover 01:22, 27 November 2007 (UTC)

This is true, MarsRover. I am also happy with that :-). But I note there is also a drawback: this score actually measures how well a certain Wikipedia has dealt with the List of articles at Meta. Now, even assuming that everybody would agree the list is "perfect" and thus "representative" of what should be in an encyclopedia (I personally would disagree, and I've seen many others do that, too), an encyclopedia is certainly not limited to that. So, if two projects get the maximum score of 9828, it doesn't follow they're two equally good Wikipedias: all we can say is they've done a good job with the articles in the list, but nothing else. One of them might have another 10,000 featured articles on various themes, and the other might have nothing else. I'm hoping new ideas for new measurements will also come up... --Smeira 14:57, 27 November 2007 (UTC)
There is an expanded list in en.wp that has 2000 articles. I suspect you will have similar rankings with a larger list. A team of editors can possibly write long articles on the core topics to generate an outstanding score. That is something worthwhile to encourage not really a drawback. Since these seems to be the agreed upon core articles if you wanted to expand the list, its possible to find the most common nine wiki links from each article for en.wp, de.wp, fr.wp and then you would have an expanded list of 10000 articles. Those extra 9000 tangential articles should weighted less since they are non-core articles. I personally would like this metric to be a measurement of quality since we already have a ton of metrics for quantity. --MarsRover 21:56, 27 November 2007 (UTC)
I didn't know about this list -- thanks, MarsRover. I agree that an outstanding core is something to encourage; I only meant that a Wikipedia that did a god job on this list and also on another 10,000 articles has better quality than one that only did a god job on the list, and this table currently won't see this difference. Of course, since most Wikipedias haven't done a good job on the list yet (even en.wp doesn't have the maximum A+ score), this problem is probably not important for the time being. --Smeira 16:49, 29 November 2007 (UTC)

Could someone please do a difference between the first list and the expanded list of 2000 vital articles from the English Wiki to make a "first 1000 articles" and a "second 1000 articles" list? -- Yekrats 18:43, 28 November 2007 (UTC)

That's a good idea. I'll try to do this tonight and post the difference here. --Smeira 16:49, 29 November 2007 (UTC)

Source code

Acting on a suggestion from User:Mxn, I've posted the source code of the python script I'm using for this table here: List of Wikipedias by sample of articles/Source code. Feel free to comment, suggest, mention problems, bugs, etc. (NB: following my heart's desires, the variable and file names are in Volapük... Should this cause any problem for understanding the script, please let me know.) --Smeira 16:49, 29 November 2007 (UTC)

Can you include "list of articles" or tell us when you copied the article? The "list of articles" is constantly getting tweaked so we need this information to duplicate your results. Thanks. --MarsRover 18:24, 29 November 2007 (UTC)
I've just added the list of articles to the List of Wikipedias by sample of articles/Source code page (after the source code). --Smeira 19:43, 1 December 2007 (UTC)
Thanks, I added some suggestions to the source code here: Talk:List of Wikipedias by sample of articles/Source code --MarsRover 11:07, 2 December 2007 (UTC)
Thanks for the suggestions! I'll try them out in the next few days! For my personal enlightnement (my programming experience is small): what do you mean by "cacheing English pages", and how does that optimize the program? --Smeira 16:11, 3 December 2007 (UTC)
For any language besides English you must first get the English page then get the interwiki link then get the local page (since your "list of articles" is in English). So, if you process a bunch of languages at once it now only gets each English page once and reuses it for the other languages. It should make it go twice as fast. --MarsRover 19:43, 3 December 2007 (UTC)

Is it ok if we update the table with the new code? (note it uses character weighting). The list of articles has changed a little, but I suppose it's best to keep it until it is somewhat stable; even though, it would be good to know what happens with the old list of articles, a month has passed since the last update. --Meldor 21:35, 29 December 2007 (UTC)

I can do that. Do you mean using option 2 weights? I wasn't able to find weights for all 150+ languages in the current table so I would use a weight of one for those missing a weight. But I don't think the weight will be a big factor for tiny wikis. I would use the current list of articles but I see no reason why the list couldn't be updated in the future. --MarsRover 20:30, 1 January 2008 (UTC)

good ramking

it promotes quality, as involves a community in basic articles