Talk:List of Wikipedias by sample of articles/Archives/2007

Please do not post any new comments on this page. This is a discussion archive first created in 2007, although the comments contained were likely posted before and after this date. See current discussion or the archives index.

Any "lists of absent articles of each major minor Wikipedias"?

If possible, I suggest the lists of absent articles of each major Wikipedias are better to be generated in another page for references. Actually, in Chinese Wikipedia's "List of Title for Group National Society General Siege Social-Commercial Erotic Pornography Sexual Intense Child in Leverkusen to Germany for Europe articles every Wikipedia should have", all articles listed there have been created already (0 absent articles). However, since that list was created more than a year ago, we don't know which 34 acticles are newly added in the list in meta but are not created in chinese version yet. -- Kevinhksouth 14:11, 8 November 2007 (UTC)

That's a good idea. My software didn't record the list of absent articles -- I'll change it so that it does, it isn't very difficult. So when I update this page (say, in a couple of months), I'll mage sure I have a list of 'missing articles'. You're right that the list of articles every Wikipedia should have has changed in the last year, and may even change in the future. (How are the articles chosen, by the way? By voting? Or can anyone who thinks a certain article is important just add it?) --Smeira 17:27, 10 November 2007 (UTC)

Can you not check it from the diff? Hillgentleman 23:39, 11 November 2007 (UTC)

What do you mean? Smeira 01:40, 12 November 2007 (UTC)

E.g.[1]Hillgentleman 02:07, 12 November 2007 (UTC)

Oh, I thought you had meant I could recover the information about which articles from the list were not found on the Chinese Wikipedia. Yes, I can see new additions to the list of Wikipedias this way. Thanks! --Smeira 07:52, 12 November 2007 (UTC)

Hello, you have announced above, that you will have a list (or betteer lists) of 'missing articles' ... do you have this kind of lists ? Where can we consult there ? --Jauclair 22:40, 20 March 2008 (UTC)

List of Wikipedias by sample of articles/Absent Articles MarsRover 22:53, 20 March 2008 (UTC)

My Wikipedia samplings

I was testing creating statistics from sampling articles from various wikipedia. (A lot simpler than downloading gigabytes of data). Get 30 random articles and figure out what was covered based on the List of articles every Wikipedia should have topics. Also, figure out what percent of the article edits were made by a robot. Its pretty interesting seeing the differences. If a wiki has a lot of articles with unknown categories, it just means that they have no interwiki links to the English wikipedia.

№	Language	Language (local)	Wiki	Articles	Sampling																														Human %
1	English	English	en	2 086 764	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	92
2	German	Deutsch	de	664 151	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	94
3	French	Français	fr	581 388	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	81
4	Polish	Polski	pl	441 111	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	65
5	Japanese	日本語	ja	434 477	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	92
6	Dutch	Nederlands	nl	378 985	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	67
7	Italian	Italiano	it	370 815	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	76
8	Portuguese	Português	pt	338 165	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	72
9	Spanish	Español	es	297 977	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	76
10	Swedish	Svenska	sv	260 357	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	68
11	Russian	Русский	ru	213 652	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	70
12	Chinese	中文	zh	152 730	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	83
13	Finnish	Suomi	fi	139 517	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	76
14	Norwegian (Bokmål)	Norsk (Bokmål)	no	139 251	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	65
15	Volapük	Volapük	vo	112 390	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	2
16	Romanian	Română	ro	95 981	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	60
17	Turkish	Türkçe	tr	94 966	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	82
18	Esperanto	Esperanto	eo	90 938	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	57
19	Catalan	Català	ca	86 627	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	55
20	Lombard	Lumbaart	lmo	84 518	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	16
21	Slovak	Slovenčina	sk	83 079	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	42
22	Czech	Čeština	cs	81 574	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	65
23	Ukrainian	Українська	uk	77 040	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	69
24	Hungarian	Magyar	hu	76 233	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	69
25	Danish	Dansk	da	73 498	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	59
26	Indonesian	Bahasa Indonesia	id	68 974	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	56
27	Hebrew	עברית	he	65 693	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	78
28	Lithuanian	Lietuvių	lt	56 433	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	58
29	Serbian	Српски / Srpski	sr	56 050	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	35
30	Slovenian	Slovenščina	sl	54 589	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	52
31	Bulgarian	Български	bg	48 195	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	73
32	Korean	한국어	ko	45 843	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	73
33	Arabic	العربية	ar	43 802	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	64
34	Estonian	Eesti	et	42 785	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	76
35	Telugu	తెలుగు	te	37 560	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	33
36	Croatian	Hrvatski	hr	36 940	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	63
37	Newar / Nepal Bhasa	नेपाल भाषा	new	37 013	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	20
38	Cebuano	Sinugboanong Binisaya	ceb	33 510	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	8
39	Galician	Galego	gl	29 187	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	54
40	Greek	Ελληνικά	el	29 071	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	59
41	Thai	ไทย	th	29 036	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	64
42	Norwegian (Nynorsk)	Nynorsk	nn	27 318	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	54
43	Persian	فارسی	fa	27 386	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	91
44	Vietnamese	Tiếng Việt	vi	26 305	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	67
45	Malay	Bahasa Melayu	ms	24 242	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	64
46	Bishnupriya Manipuri	ইমার ঠার/বিষ্ণুপ্রিয়া মণিপুরী	bpy	22 091	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	0
47	Basque	Euskara	eu	21 502	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	28
48	Bosnian	Bosanski	bs	20 919	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	44
49	Simple English	Simple English	simple	20 674	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	56
50	Icelandic	Íslenska	is	18 264	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	69

Category Legend
Biography	History	Geography	Society	Culture	Science	Technology	Foodstuff	Mathematics	Unknown

I am still working on the categorization logic. Most of the problems are from choatic en.wp categorizations. The Human % is the percent of edits for the articles where the username doesn't contain "Bot". --MarsRover 10:45, 11 November 2007 (UTC)

A quite interesting approach, MarsRover! (By the way, how do you get automatic access to the editors' names? Are you using the py.wikipedia framework? Is there a function/variable/method there that will do this?) I agree that there is a problem with en.wp categorization logic, and further yet: with decisions about what becomes an article and what doesn't. One example: when I was doing de.wp, it turned out that (of all things) the article en:sex didn't have a link to de.wp. This is because en.wp has a general article for both human sexes (plus an article about male sex and another one about female sex), whereas de.wp doesn't: they preferred to have de:männliches Geschlecht (=male sex) and de:weibliches Geschlecht (=female sex) as independent articles, without a general one. This makes it look like de.wp forgot one of the articles in the list, but that's not true; in fact they only made a different choice about where to put the content (in two articles instead of one). If we were using the list of desirable articles from de.wp, the results might be different...
If you're interested, maybe we could talk more about the topic and try to come up with an interesting measure together. --Smeira 13:05, 11 November 2007 (UTC)

I am not using the python framework. I am just downloading the history page and parsing the HTML. (Actually just the history for last 50 edits). I think your approach with a fixed list of articles (instead of random articles) is a better method. Otherwise, people can keep "rolling the dice" until they have a nice score. But it does show you what the wiki contains beside the important articles. MarsRover 23:11, 11 November 2007 (UTC)

I wonder if there isn't an automated way to get the list of contributors into some variable... that would make it easier. A refinement might be to check which users have bot flags and which don't -- some human user names like say Abbot might accidentally contain the word "bot" -- but people are usually pretty consistent in adding "Bot" to bot accounts, so probably this wouldn't affect your results very much.

I looked for a way. The "Special:Export" method is supposed to allow downloading different revisions with the user information. But, this ability have been disabled. Also, a lot of bots are not setting the bot flag. btw, I fixed it to be case insensitive when looking for "Bot" in the username and re-generated the table. MarsRover 05:40, 12 November 2007 (UTC)

About the human %, I think it should be improved quite a lot. Taking just the last 50 contributions is quite disadvantage for small wikipedias where the content of an article is correct but not frequentely updated. Then, most of the last 50 contributions will be just bots adding interwikis. Here there is an example of what I mean: ca:Essen. I created that article. It's quite a good article and since it was "finished", there have been just minor changes and many many many bots adding interwikis. So, I think you could neves say that this article has a 3% human content. I would estimate it arround 90% of human content, but your algorithm would tell something arround 3-5%. So, proposals to improve it: calculate the difference in bytes of every contribution or not counting iw bots. Moreover, I have a bot account which I use manually and sometimes with AWB to make sistematic corrections, change categories, etc. It has a bot flag, but is controled manualy. The changes it makes are of a very few amount of bytes. So, I think is not the same a bot to create articles, a bot to add iws or a bot to move categories. The second and the third shouldn't count so much as the fist to compare the human/bot impact in Wikipedia. Anyway, I think you're doing a good work!--Xtv 20:43, 12 November 2007 (UTC)

I will thinking of doing that but its more complicated since your need all the history (since the first edit could be the most important) and then you need to parse the actual edits to see if it was a interwiki edit or not. The way it is now its really "Human % of Activity" (not "Human % of Content"). It is still an interesting metric and if an admin if working on building a community would want to increase the number. --MarsRover 03:48, 13 November 2007 (UTC)

I find curious that there is a Wiki (bpy) with 0% human edits... Malafaya 13:57, 17 November 2007 (UTC)

This is just a sampling of 30 articles. I'm sure there is at least one human edit in the wiki. But if you click on any of the samples, they look like this were imported by a bot. -MarsRover 21:38, 17 November 2007 (UTC)

Biased against Asian languages

What is a character in your calculations? If you mean an actual character in the language, you need to normalize the value of each character since a Japanese character packs more information than an English character. It seems ja.wp should be ranked higher. Maybe you can do the following:

Use bytes instead of characters - if the article's text is encoded as UTF-8, this would mean an English character is worth 1 and an Japanese character would be worth 2 or 3.
Weight the character per language - weighted_char_count = char_count * alphabet_size / english_alphabet_size;
Use the compressed size - if you compressed the text with the "deflate" algorithm, the resulting size is an approximation of the information value. ex. a lot of spaces counts less then a lot of unique letters and words.

--MarsRover 23:36, 11 November 2007 (UTC)

More. For the same content, a text in German is substantially longer than one in English or French. It would be best to set the mark (e.g. against the English language) by asking the locals if there are exact translations of a number of texts. To get a rough idea, you may dig into the translations on meta.Hillgentleman 00:04, 12 November 2007 (UTC)

Yes, I've been thinking about these issues. What I am using for length is the return of the len() function in python -- I am not sure whether or not it considers Chinese characters as one or more than one (when I try to calculate the length of, say, "日本語" in the python shell, I get only an "unsupported characters in input" error; if, however, my bot program gets the text of a page from zh.wp and calculates its len(), I do get a valid numeric result). I've considered weighing the results (so in my comparative corpus, German tends to be between 1.4 and 1.5 times longer than English); maybe I should do it for higher precision, but when I looked at the results it didn't seem so worthwhile (the distance between en.wp and de.wp is already high even without this weight; adding it to the calculation would mean simply increasing it; of course, Chinese and Japanese might be a different story, if the len() function treats Kanji characters as one normal ASCII character instead of two or three). I also noted that the comparability between results might not be so much increased, since the "information ratio" or "weight" is not a transitive operation (i.e. Weight(English/German) * Weight(German/French) won't be necessarily equal to Weight(English/French), because "translate" is not really a structure-preserving bijection and the domains in question -- the semantic range and its actualization in vocabulary items in the various languages -- are not exactly the same). But maybe I'll add a further column with a weighted score (with respect to English) or mean article length just to see how it compares to the absolute numbers. Of course, I'd need sufficiently many translations from English into all these languages to estimate the weights -- even with the Meta translations, I'll probably have to limit myself to the larger Wikipedias. Smeira 01:35, 12 Nov 2007

Another weight comparison could be based on a word count of raw article pages, less articles (grammar) for languages which have any, less some markup. Obviously, this is impecise, too, and may be hard or expensive to calculate. Note that, both determining word boundaries, and eleminating articles, and possibly similar words, is highly language dependant and not obvious for languages such as Thai, Japanese, the Chinese varieties, and some others. --Purodha Blissenbach 14:44, 12 November 2007 (UTC)

Proposed weighting of characters for formula (Option#1 using The Lord's Prayer)

Base on the article en:The Lord's Prayer in different languages, I came up with the following ratios. For the languages not listed I would just use 1.0 ratio. We should at least use a rounded ratio of 0.5 for all Asian languages since they are the most distorted.

Language	Local Characters	English Characters	English Ratio
German	374	365	1.0
French	438	332	1.3
Italian	311	261	1.2
Portuguese	362	365	1.0
Romanian	407	379	1.1
Spanish	391	345	1.1
Bulgarian	347	325	1.1
Croatian	323	346	0.9
Polish	358	325	1.1
Russian	324	305	1.1
Bengali	337	333	1.0
Hindi	323	333	1.0
Urdu	335	333	1.0
Armenian	259	287	0.9
Finnish	416	333	1.2
Hungarian	404	332	1.2
Chinese (Simplified)	127	333	0.4
Chinese (Traditional)	120	367	0.3
Indonesian	410	333	1.2
Korean	192	333	0.6
Japanese	165	391	0.4
Hebrew	233	333	0.7
Arabic	368	350	1.1
Georgian	451	362	1.2
Esperanto	363	341	1.1

Interesting. One question though: why is the number of English characters in the Lord's Prayer not the same in every case? Wasn't the same English text of the Lord's Prayer compared to its counterpart in each of the other languages? (The page you mention appears to have the same English text in all cases.) Some of the weights are surprising: 1.0 for English:German and for Portuguese:German? That goes against my (extensive) experience with both languages (the latter my mother tongue); could it be just because the Lord's Prayer is a somewhat old-fashioned text?

The translations are sometimes slightly different across languages especially the second to last line.

"For thine is the kingdom, the power and glory, forever and ever"
"For Thine is the kingdom, the power, and the glory, For ever and ever."
"For the kingdom, the power, and the glory are yours, now and forever"
"For the kingdom, the power, and the glory are yours, Both now and ever and unto the ages of ages."

I am assuming the English next to the local text is the exact wording used. Couple of the ratios seem odd such as Hebrew being unusally short and French unusally long. But, this text although short and religious seems to be popular for this sort of comparison. --MarsRover 23:45, 26 November 2007 (UTC)

Anyway, here is the table for the first 20 languages, with average article size corrected using the weights you proposed (for Dutch, Swedish, Vietnamese, Norwegian, Catalan and Serbian, I myself calculated the Dutch Lord's Prayer coefficient), for everybody's appreciation (for most of these languages, ordering by average article size and and by score coincide):

Language	Average article size (measured via len())	Weight (language-to-English)	Average article size (divided by weight)
English	41 434	1.0	41 434
German	31 113	1.0	31 113
French	25 993	1.3	19 995
Spanish	20 694	1.1	18 813
Italian	18 195	1.2	15 162
Russian	14 131	1.1	12 846
Portuguese	13 140	1.0	13 140
Dutch	11 690	1.2	9 742
Polish	12 119	1.1	11 017
Finnish	11 123	1.2	9 269
Hungarian	10 717	1.2	8 930
Swedish	9 911	1.0	9 911
Hebrew	9 368	0.7	13 382
Vietnamese	9 912	1.0	9 912
Czech	8 846	1.0	8 846
Japanese	8 598	0.4	21 495
Norwegian	8 603	1.0	8 603
Catalan	8 063	1.4	5 759
Chinese	6 953	0.4	17 382
Serbian	7 250	1.0	7 250

Indeed, the changes for Chinese and Japanese (and to a lesser extent also for Hebrew) are dramatic: Japanese would jump ahead of Spanish and become #4, Chinese would be right after Italian as #6. The coefficient for more translated texts must be compared before we could arrive at better values (does anyone know other texts that could have sufficiently many translations? else we'd have to limit this table to only those languages for which there are translations to calculate the coefficients); however, this change in the result would probably remain. (Of course, in order to use these coefficients meaningfully, it would be necessary to refine article measurements: I am currently simply calculating the len() of the whole page, but simply dividing this by the appropriate coefficient would imply treating e.g. the interwiki links as if they were part of the text -- and in the case of the more popular articles, then len() of the interwiki links alone can be close to 1000 characters...).

A final question of some importance. Does anyone know whether the len() function in python, when applied to a page in Chinese or Japanese, actually counts every Chinese character as one character -- or as more than one? (If it counts Chinese characters as sequences of more than one character -- or more than one byte --, then of course the coefficients cannot be used.) --Smeira 22:45, 24 November 2007 (UTC)

1. To be accurate, we should first remove the templates calls.

2. On len():

print (u'早晨！')
print len(u'早晨！')

gives you 3.

Hillgentleman 23:00, 24 November 2007 (UTC)

OK, so using the coefficients on len() results makes sense. For some reason I always get "Unsupported characters in input" erros when I try the above commands from the python shell. --Smeira 23:21, 24 November 2007 (UTC)

Reply:

I would just remove the interwiki links since it is mostly foreign text. Categories and Templates should be in the local language so should be correct for the ratio calculation. Also, since the keyword "Category" is different for each wiki, its not that easy to remove. Templates can contain valuable text such as with Infoboxes. The unimportant templates are usually short like {{Stub}}. So, I would include all templates.
Other text choices for comparison include Babel text, en:The North Wind and the Sun and en:Schleicher's fable. Finding all the translations will be a challenge and perhaps we can use the average ratio of the language group for the obscure ones.

MarsRover 00:40, 27 November 2007 (UTC)

Note that the weight given to catalan language is not accurate (catalan is similar to french, spanish and italian, so it should have a weight similar to them). Taking into account the data from the second option, the weight would be 0.9. --Meldor 23:52, 9 December 2007 (UTC)

Proposed weighting of characters for formula (Option#2 using Babel text)

Note: please do not archive this section, as it is linked from the content page.

Using "Babel text" which are the Biblical verses about the en:Tower of Babel is another option. Translations in various languages are easily found [2][3]. The text is longer than the option #1 so is hopefully more accurate.

Language	Characters	Language Weight (English / Local)
English	1162	1.0
German	1209	1.0
French	1157	1.0
Italian	1078	1.1
Portuguese	1038	1.1
Romanian	1060	1.1
Spanish	1095	1.1
Bulgarian	1016	1.1
Croatian	883	1.3
Polish	1057	1.1
Russian	840	1.4
Bengali	-	-
Hindi	-	-
Urdu	-	-
Armenian	932	1.2
Finnish	1096	1.1
Hungarian	1059	1.1
Chinese (Simplified)	314	3.7
Chinese (Traditional)	315	3.7
Korean	462	2.5
Japanese	598	1.9
Hebrew	964	1.2
Arabic	1195	1.0
Georgian	-	-
Esperanto	1083	1.1
Thai	1109	1.0
Farsi/Persian	957	1.2
Tamil	1357	0.9
Dutch	1297	0.9
Cebuano	1395	0.8
Vietnamese	1100	1.1
Swedish	1084	1.1
Norwegian (Bokmal)	931	1.2
Turkish	867	1.3
Slovak	869	1.3
Czech	911	1.3
Ukrainian	969	1.3
Danish	942	1.2
Lithuanian	949	1.2
Serbian	822	1.4
Slovenian	977	1.2
Greek	1293	0.9
Catalan	1090	1.1
Latin	1050	1.1
Welsh	1010	1.2
Basque	1076	1.1
Malayalam	1021	1.1
Belarusian (Taraškievica)	834	1.4
Indonesian	1256	0.9
Galician	992	1.1

It seems a better option than the one above. I don't know how you count the characters, though, it seems to me there are 1050 char in New International Version (if that's the one we would like to take as reference, instead of the English Standard Version, which would be a better option in my opinion (best reflects english, not simple english); there are 1181 characters there.

Note also there is the catalan version, which is 1108 characters long. That results in a completely different ratio than in the option above, so I don't know if that way of calculating ratios does really make sense. Here, catalan has a ratio of 0.9 (0.94). --Meldor 23:50, 9 December 2007 (UTC)

Ok, I changed it to use the English Standard Version for the reference. The ratio is a rough figure and probably should be rounded even more (0.4, 0.6, 0.8, 1.0, 1.2) so the choice of translation isn't a factor. I think its correct for adjusting the Asian language problem but maybe not for calculating minor differences in European languages.

To calculate the count I paste the text into MSWord and bring up the statistics window. The character count should be all characters includes spaces but excluding carriage returns and line feed characters. --MarsRover 06:04, 10 December 2007 (UTC)

A potential problem with using biblical texts for samples is that biblical translation tends to be extremely literal, from an original that itself is extremely simple (in vocabulary & syntax), and so you don't get in the translation many of the kinds of idiomatic expressions that characterize the language as it's ordinarily spoken & written. 71.191.124.63 15:29, 8 April 2008 (UTC)

Well, that problem already would be less of a problem if you choose the original, the Classical Greek Septuaginta, as a base and not the English translation. But of course, it is a fact that you can more easily translate Greek into some languages (Modern Greek for example) than in some others (like Pirahã). But there are no texts that often translated as the Bible, so Biblical texts are better than no texts. --::Slomox:: >< 16:19, 8 April 2008 (UTC)

Agreed: the convenience of using a text (like the Bible) that's available in hundreds of languages is important. But in that case, two changes might improve the reliability of the results: (1) use a bigger sample size, and (2) include in the sample some passages in a different style, perhaps some poetical flourishes from Job, Isaiah, Jeremiah, or Revelation, and the more abstruse or tendentious discussions in some of Paul's letters. 96.231.99.229 01:01, 10 April 2008 (UTC)

You can find a true Min Nan translation on this site. There are about 960 characters, so the weight should be 1.2. 83.200.43.51 00:43, 18 April 2008 (UTC)

I don't know the reason, maybe the translation of a biblical text does tend to be too literal, but I noticed that the results of the weighting procedure are not very good. Being a native speaker of Russian and Ukrainian languages, I always notice that translated texts tend to be longer in those languages than their English equivalents. Just to check this I did character count using MS Word in a few Russian translations of English texts, the results showed a few percent longer Russian versions, for example Jack London. The Call of the Wild has 143K characters, and its Russian translation has 147K. The table, on the contrary, gives weights above one to those languages. Thus, at least with certain languages, the table does not do a good job and gives worse results than even unweighted counting. I would strongly recommend to find a couple of other text examples to be used for the weighting procedure.--Oleksii0 16:59, 11 May 2008 (UTC)

Interesting. I did a quick check of the first paragraph of the book. The English version is 557 characters and the Russian is 529 characters which is a ratio of 1.05293 does seem to be above one but far less than the biblical 1.4 ratio. I think a more modern text would be better but shouldn't we use the same text for all languages? Otherwise people can just find the most favorable text. Also, does the language the text was first written have an advantage? --MarsRover 07:26, 12 May 2008 (UTC)

Using random samples?

I had the idea of doing the same statistics here with a random sample of articles (say 1000), to see how this would affect the results. As a preliminary test, I hit the "random article" link 20 times in the five larger Wikipedias (by number of articles) and counted how often I'd have links to other Wikipedias. The results surprised me a little:

en.wp: 15 out of 20 had no interwiki links.
de.wp: 14 out of 20 had no interwiki links.
fr.wp: 7 out of 20 had no interwiki links.
pl.wp: 7 out of 20 had no interwiki links.
ja.wp: 11 out of 20 had no interwiki links.

This suggests that random samples wouldn't be a good idea -- in too many cases there would be no interwiki-linked articles (not necessarily because they're non-existant; maybe the interwiki bots haven't found them yet). If these numbers are representative, then by choosing en.wp for the random sampling, all other wp's wouldn't get more than 25% of the sample (which would guarantee low scores); but even choosing another wp wouldn't help (choosing ja.wp would still keep all others -- including en.wp -- with less than 45% of the sample). Any ideas on how to solve this problem? Or should the random sample idea be dropped? --Smeira 23:19, 24 November 2007 (UTC)

One obvious possibility that has now occured to me: drop articles without interwiki links, and only add to the random sample articles with interwiki links. (But wouldn't this leave out some important information for comparing the quality level of the various .wp?) --Smeira 23:23, 24 November 2007 (UTC)

Sidetrack: The interwikis can be a significant part of an articles size in bytes.
This (4021) contains more bytes than this (4011).

-Jorunn 00:10, 25 November 2007 (UTC)

I think the random sample metrics isn't a lost cause but I don't think is mixes well with this article. Biggest problem isn't the interwiki links but the lack of consistency. As far as ignoring articles without interwiki links it depends what you plan to measure. I found local villages or local politicians are often missing in en.wp so it distorts the balance of topics. But if you were measuring the average article size I guess it shouldn't matter. MarsRover 01:03, 27 November 2007 (UTC)

Perhaps, in effect, you have tried to measure how useful wikipedia is to a common man, which is difficult. It is helpful to get some narrower and more specific view points, such as "how useful is wikipedia to a ... Icelandic historian"? Or to a mathematician? (Or is planetmath more useful?) A mathematician wants a quick, brief and accurate definition on everything in her specialiality, and also some motivation and history in a near-by field. Now that may not be an easy thing to survey. But the lesson is that thinking in that way may help in getting more specific measures. Hillgentleman 07:54, 28 November 2007 (UTC)

Maximum score

Maybe this is obvious but... Just noticed that the maximum score is fixed. (1092 x 9 = 9828) I like that its almost 10000 so easy to remember but its cool that each wiki can eventually catch up the the larger wiki unlike the "article count" metric. MarsRover 01:22, 27 November 2007 (UTC)

This is true, MarsRover. I am also happy with that :-). But I note there is also a drawback: this score actually measures how well a certain Wikipedia has dealt with the List of articles at Meta. Now, even assuming that everybody would agree the list is "perfect" and thus "representative" of what should be in an encyclopedia (I personally would disagree, and I've seen many others do that, too), an encyclopedia is certainly not limited to that. So, if two projects get the maximum score of 9828, it doesn't follow they're two equally good Wikipedias: all we can say is they've done a good job with the articles in the list, but nothing else. One of them might have another 10,000 featured articles on various themes, and the other might have nothing else. I'm hoping new ideas for new measurements will also come up... --Smeira 14:57, 27 November 2007 (UTC)

There is an expanded list in en.wp that has 2000 articles. I suspect you will have similar rankings with a larger list. A team of editors can possibly write long articles on the core topics to generate an outstanding score. That is something worthwhile to encourage not really a drawback. Since these seems to be the agreed upon core articles if you wanted to expand the list, its possible to find the most common nine wiki links from each article for en.wp, de.wp, fr.wp and then you would have an expanded list of 10000 articles. Those extra 9000 tangential articles should weighted less since they are non-core articles. I personally would like this metric to be a measurement of quality since we already have a ton of metrics for quantity. --MarsRover 21:56, 27 November 2007 (UTC)

I didn't know about this list -- thanks, MarsRover. I agree that an outstanding core is something to encourage; I only meant that a Wikipedia that did a god job on this list and also on another 10,000 articles has better quality than one that only did a god job on the list, and this table currently won't see this difference. Of course, since most Wikipedias haven't done a good job on the list yet (even en.wp doesn't have the maximum A+ score), this problem is probably not important for the time being. --Smeira 16:49, 29 November 2007 (UTC)

Could someone please do a difference between the first list and the expanded list of 2000 vital articles from the English Wiki to make a "first 1000 articles" and a "second 1000 articles" list? -- Yekrats 18:43, 28 November 2007 (UTC)

That's a good idea. I'll try to do this tonight and post the difference here. --Smeira 16:49, 29 November 2007 (UTC)

Source code

Acting on a suggestion from User:Mxn, I've posted the source code of the python script I'm using for this table here: List of Wikipedias by sample of articles/Source code. Feel free to comment, suggest, mention problems, bugs, etc. (NB: following my heart's desires, the variable and file names are in Volapük... Should this cause any problem for understanding the script, please let me know.) --Smeira 16:49, 29 November 2007 (UTC)

Can you include "list of articles" or tell us when you copied the article? The "list of articles" is constantly getting tweaked so we need this information to duplicate your results. Thanks. --MarsRover 18:24, 29 November 2007 (UTC)

I've just added the list of articles to the List of Wikipedias by sample of articles/Source code page (after the source code). --Smeira 19:43, 1 December 2007 (UTC)

Thanks, I added some suggestions to the source code here: Talk:List of Wikipedias by sample of articles/Source code --MarsRover 11:07, 2 December 2007 (UTC)

Thanks for the suggestions! I'll try them out in the next few days! For my personal enlightnement (my programming experience is small): what do you mean by "cacheing English pages", and how does that optimize the program? --Smeira 16:11, 3 December 2007 (UTC)

For any language besides English you must first get the English page then get the interwiki link then get the local page (since your "list of articles" is in English). So, if you process a bunch of languages at once it now only gets each English page once and reuses it for the other languages. It should make it go twice as fast. --MarsRover 19:43, 3 December 2007 (UTC)

Is it ok if we update the table with the new code? (note it uses character weighting). The list of articles has changed a little, but I suppose it's best to keep it until it is somewhat stable; even though, it would be good to know what happens with the old list of articles, a month has passed since the last update. --Meldor 21:35, 29 December 2007 (UTC)

I can do that. Do you mean using option 2 weights? I wasn't able to find weights for all 150+ languages in the current table so I would use a weight of one for those missing a weight. But I don't think the weight will be a big factor for tiny wikis. I would use the current list of articles but I see no reason why the list couldn't be updated in the future. --MarsRover 20:30, 1 January 2008 (UTC)

good ramking

it promotes quality, as involves a community in basic articles