Top Ten Wikipedias

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

In 2008, it was decided that the Top Ten Wikipedias are defined by usage rather than article count.


The following is a discussion about a rearrangement of the top ten wikipedias that are displayed on the main wikipedia portal (http://www.wikipedia.org).

The objective is to discuss and define a clear set of criteria for inclusion in the top ten wikipedias around the logo. Please keep in mind that a top-ten approach should be robust and scalable (for example, expanding the number of wikipedias around the logo to include a few more is not scalable, since more and more wikipedia editions would be added as they grow). An up-to-date top ten should be consistently maintained.

Also note that most proposals in the table below are followed by a version with a threshold of 100K+ articles, to avoid some strange results.

Please, before commenting, read the discussion on Talk:Www.wikipedia.org template/2008#rethinking the top ten

Note: THIS IS NOT A VOTE. Feel free to add your comments, but avoid subscribing a proposal without adding any new thoughts to it, since that would make this page less readable.

Table[edit]

You can point your cursor to the boxes in the first row for explanation.

# article count most speakers 100K+ lang rank × article rank most visited 100K+ 1-stub ratio 100K+ article / speakers 100K+ articles / users 100K+ compressed DB size
1 English ChineseUp ChineseUp EnglishSteady EnglishSteady EnglishSteady AfarUp EnglishSteady VolapükUp VolapükUp VolapükUp VolapükUp EnglishSteady
2 German EnglishDown EnglishDown ChineseUp SpanishUp SpanishUp RipuarianUp FrenchUp IdoUp NorwegianUp Newar / Nepal BhasaUp SwedishUp GermanSteady
3 French HindiUp SpanishUp GermanDown FrenchSteady FrenchSteady KanuriUp PortugueseUp AragoneseUp SwedishUp Bishnupriya ManipuriUp PolishUp FrenchSteady
4 Polish SpanishUp RussianUp FrenchDown JapaneseUp JapaneseUp ChamorroUp RussianUp BretonUp FinnishUp CebuanoUp DutchUp JapaneseUp
5 Japanese RussianUp PortugueseUp SpanishUp GermanDown GermanDown SesothoUp ChineseUp IcelandicUp DutchUp TarantinoUp JapaneseSteady ItalianUp
6 Dutch ArabicUp FrenchDown RussianUp PolishDown PolishDown LugandaUp ItalianUp Norwegian (Nynorsk)Up PolishDown PiedmonteseUp RussianUp SpanishUp
7 Italian PortugueseUp GermanDown PortugueseUp PortugueseUp PortugueseUp AbkhazianUp GermanDown LuxembourgishUp ItalianSteady LombardUp NorwegianUp PolishDown
8 Portuguese BengaliUp JapaneseDown JapaneseDown RussianUp RussianUp MuscogeeUp SpanishUp Bishnupriya ManipuriUp GermanDown IdoUp RomanianUp DutchDown
9 Spanish FrenchDown ItalianDown PolishDown ChineseUp ChineseUp NdongaUp RomanianUp EsperantoUp RomanianUp WalloonUp FinnishUp RussianUp
10 Russian IndonesianUp PolishDown ItalianDown ArabicUp ItalianDown Hiri MotuUp FinnishUp EstonianUp JapaneseDown HaitianUp FrenchDown PortugueseDown

Thoughts[edit]

Which criteria to use? how to measure it?

  • Add your comments below (please include it in the correct section, or create a new one if it doesn't fit any of the existing ones)

Size[edit]

number of articles[edit]

  • pros:
    1. I believe the ten Wikipedia editions with the largest number of articles deserve their placement around the puzzle ball, because they've worked hard to generate all these articles. If anything, the fact that most of the Top 10 languages are from Europe indicates that our non-Latin editions have a lot of work to do. (user:Mxn on Talk:Www.wikipedia.org template#Russian Wikipedia will have 100K articles soon)
    2. None of the Wikipedias in question really consists "mainly of short articles" and in none of them you can observe an excessive creation of poor quality stubs. There is, however, a tendency to bot-generate cross-wikipedia town stubs, but this happens e.g. as well in the Dutch as in the Polish Wikipedia, the effect might be levelled out.
    3. This is the easiest way to meassure size, and seems to be working right now (don't fix what isn't broken). A system based on traffic (usefullness to visitors) is a better way to go though.
  • cons:
    1. It's not so meaningful given the fact that some Wikipedias contain mainly short articles, or even many automatically generated articles, while other Wikipedias contain less but much longer articles, all handwritten. (Erik Zachte, on http://stats.wikimedia.org/EN/Sitemap.htm)
    2. It makes it look like a competition for which language has the most articles, and this would definitely influence the creation of poor quality stubs to increase those article counts. (user:Lenev on Talk:Www.wikipedia.org template#Russian Wikipedia will have 100K articles soon)
    3. This factor portraits a fake image about the wikis because to inflate the number of article takes really litle work with just programming a stub-bot.
    4. It does not further the purpose of the Wikimedia Foundation; the purpose is to bring information to people. (GerardM on his blog)
    5. All garbage pages in main namespace except for redirects are articles when contain any sort of internal link. 1) Vandalized pages, disambiguation pages, for some wikis just lists of articles are articles. 2) Links leading us out from main namespace are counted as regular links. Based on two above, articles are still not very clever defined set. Mashiah Davidson 21:51, 5 July 2008 (UTC)

size of database[edit]

  • pros:
    1. Database dump size has the advantage of measuring the amount of actual text, independent of whether it is distributed over a few long articles or many short ones. As pointed out below, measuring by raw database size has some drawbacks with respect character encoding (UTF-8) and also with respect to the ratio between articles and talk pages. The latter problem can be addressed by measuring the size of the database dump pages-articles.xml which doesn't contain the talk pages. Further, if you measure by the size of the compressed database dump pages-articles.xml.bz2, you remove the impact from the character encoding. This also shrinks the impact of machine generated articles, such as on the Volapük Wikipedia, because those articles follow a pattern and are very easy to compress. The result, as shown in one of the columns in the table above, are just some minimal adjustments to the current ranking. The top three are left in place. Large languages such as Japanese, Spanish and Russian gain. The losers are Polish, Dutch and Swedish. Most users would find those adjustments fair and reasonable. The preceding unsigned comment was added by LA2 (talk • contribs) 17:56, 24 March 2008.
    2. Following to @MariusM comment, this can be easily understood as well. The compression ratio reached with bz2 means: "as far as we can recognize" it contains no more than this number of bytes of independent data/usefull info. Mashiah Davidson 21:57, 5 July 2008 (UTC)
  • cons:
    1. Database size depends on coding system (unicode characters take several bytes) and on how much meaning can be conveyed by one character (e.g. Chinese characters are whole words). (Erik Zachte, on http://stats.wikimedia.org/EN/Sitemap.htm)
    2. For a person who just discovered Wikipedia, this measure is too difficult to understand. Keep things simple for the average person, while we all know that WIKIPEDIA SECRET CABAL is deciding everything.--MariusM 23:28, 5 April 2008 (UTC)
  • comments:
    1. @MariusM: I think it is more important for the criteria used to be exact and fair, than to be easily understood. And I think LA2 has a good point on this. --Waldir 20:22, 6 April 2008 (UTC)
    2. Would this be with or without counting the images? Wikis like es:Wiki have no images any more, so they could be penalized if images are counted. --Ecelan 22:44, 5 July 2008 (UTC)
    3. @Ecelan: LA2 referred pages-articles.xml and pages-articles.xml.bz2, so I believe it's safe to assume that the images would not be counted. --Waldir 00:45, 6 July 2008 (UTC)

number of words[edit]

  • pros:
    1. This method is, at least, more realistic than the amount of articles, since it does measure the amount of information, Poco a poco 23:04, 5 July 2008 (UTC)
  • cons:
    1. The length and number of words used per equivalent sentence is not invariant between languages (besides the fact that several languages lack the concept of "words"), so this measure should at least be adjusted by some empirical figure (wherever to get this from...)
    2. There are languages difficult to count numbers of word automatically, so it is not feasible. Not every language orthography has a rule to separate words and words with a space. --Aphaia 13:16, 13 April 2008 (UTC)

number of internal links[edit]

  • pros:
  • cons:
    1. A very high ratio of links per word leads to a lower quality of the whole text, because the important links are overseen. (By the way, 1.0 isn't even the maximum for this ratio.)
    2. This is useful to keep the reader hooked to wikipedia and to make him jump often to more pages, but this does not have much to do with the importance of a wiki, Poco a poco 23:05, 5 July 2008 (UTC)

number of articles with a minimum size of 3000 (or any other number) bytes[edit]

  • pros:
    1. All the bot-created articles and stubs are filtered out. 217.123.242.31 18:33, 24 March 2008 (UTC) (Gebruiker:Rubietje88 on the Dutch Wikipedia)
  • cons:
    1. you coul'd split long articles into more small with just a few bytes over the limit --84.147.109.24 11:22, 7 April 2008 (UTC)

number and/or length of articles in a standard list (e.g. List of articles)[edit]

  • pros:
    1. A well chosen list gets rid of topic bias (i.e. if in a Wikipedia there is a bias towards certain topics -- lots of articles on them, much fewer on others -- then this Wikipedia gets a lower score).
    2. If length is taken into account, then stubs (bot-created or not, there's no real difference) are all filtered out and only longer articles (presumably containing more and richer information, from people and/or bots) contribute to the final evaluation score (see List of Wikipedias by sample of articles for a possible way of doing that -- there are many other possibilities of course).
  • cons:
    1. It is hard to define a really good list; even though any sensible list is probably good enough for a first approximation, cultural biases seem to be almost unavoidable. (One might avoid that by doing several lists with different biases and comparing results.)
    2. The above problems concerning length (some languages have longer words than others and get an unfair advantage; others, like Chinese, have often one-character words and thus get unfair handicaps) apply (but also the solutions; cf. the discussion on this topic at the Talk:List of Wikipedias by sample of articles, where "correcting factors" or "weights" were suggested as a way of dealing with this problem.)
    3. Work done on topics outside of the list is not taken into account. In other words, working with this list classifies Wikipedias on the basis of a sample of their results, not on the basis of their total results. (This fact may have bad consequences, but also good consequences.)
      --Smeira 17:44, 26 March 2008

Quality[edit]

average article length[edit]

  • pros:
  • cons:
    1. How would we measure the quality of a language edition without bias? The only statistical way I can think of is measuring the length of the articles in each edition, but that would not be a good method because an article that is long isn't necessarily good quality. (user:216.106.103.3 on Talk:Www.wikipedia.org template#Russian Wikipedia will have 100K articles soon)
    2. Imho articles shouldn't be too much long, considering the special media we're building wikipedia with. Web users and internet technical costraints tend to push to write "compact" articles, with many links, more then long pages to scroll in (and to load). So for this particular media lenght could be considered a disavantage. --83.190.116.206 21:37, 23 March 2008 (UTC)

non-stub ratio[edit]

  • pros:
    1. limited to the 100k+ wikipedias, seems to be a good measure of quality. i'd still keep the wikipedias sorted by number of articles for simplicity, though. 213.140.22.65 14:28, 24 March 2008 (UTC)
  • cons:
    1. I believe that the meaning of stub could change from one wiki to an other. This is only locally significant. Zil 22:53, 25 March 2008 (UTC)
    2. Even for top ten wikis people do hesitate tagging/untugging articles as stubs. Lots of very poor something not marked as stubs and lots of good articles stubbed. Mashiah Davidson 21:42, 5 July 2008 (UTC)

edits / article[edit]

  • pros:
  • cons:
    1. The ratio of edits/article is also sensitive to bot activity: corrections, automatic expansions of articles, addition of further information etc. (as those necessary in the Volapük Wikipedia) increase the edits/article ratio.
    2. Vandalism and vandalism reversal are also counted.
    3. Especially in small Wikipedias, the users often translate from big Wikipedias (like the English). This is not the most elegant way to create a new article, but absolutely legitimate and welcome. But a translated article has as a result only a few edits, in spite of a high quality of the text itself.--Ziko-W 14:58, 12 April 2008 (UTC)

Statistics based on standard list of articles (see under "Size" above)[edit]

  • pros:
    1. Using a well-chosen standard list of articles as the basis for a measure of quality (based on length, edits/article, etc. -- cf. List of Wikipedias by sample of articles for one implementation) -- would be a way of actually measuring how much like an encyclopedia the given Wikipedia is, in the sense of having articles on topics that everybody agrees are encyclopedic. A very large Wikipedia with lots of long articles but a low score on this well-chosen list would apparently not really be the kind of reference work that we'd call an encyclopedia.
  • cons:
    1. As mentioned above, a really bias-free list is quite hard to make.

Referenced articles[edit]

  • pros:
    1. I think good measure to order the top 10 Wikipedias is this one: The percentage of all (local) articles that are referenced (basically, that use <ref>-tags). This tells us about how accurate one Wikipedia is, and accuracy is (I think) the best scale to measure quality of Wikipedia (Quality, not quantity!) –QWerk 14:44, 5 July 2008 (UTC)
  • cons:

Usefulness[edit]

most visited wikipedias[edit]

some data for reference:
- Alexa (shows controversial figures)
- stats.wikimedia.org (by Erik Zachte)
- infodisiac (also by Erik Zachte)
- stast.grok.se by en:User:Henrik (see rough sum up by Robert Rohde)
- Wikimedia page traffic toplists (compare search counts)

  • pros:
    1. I'd use the 'most visited ones with a 100K+ threshold. As the table above shows, the practical result is virtually exactly what people have been asking (adds russian and chinese, dropping dutch and swedish -- see discussion on Talk:Www.wikipedia.org template), but most importantly I think it fulfills the function of a portal, which is to direct people to the content they want the fastest and most efficiently possible. The portal is built for the visitors, not for wikipedians, so I think it is more important to be useful to visitors, than to reward the efforts of the wikipedians (not that they're not worth rewarding!, but somewhere else — quality.wikimedia.org, perhaps? — would be a better place). Quality could be measured by us, using several approaches, but I think who ultimately should decide about it are the people who use the content -- the readers. Waldir 18:09, 15 January 2008 (UTC)
    2. This is the only alternative, IMHO, to the number-of-articles options (in fact, size of database, could also do the trick). www.wikipedia.org should be used so that readers can find their language version easier. This option does that. --MiCkEdb 19:51, 23 March 2008 (UTC)
    3. The main page is from where I connect to Spanish and French wp. It helps a lot to have both links together. If one or both are removed from the main page, I will reorder my Favoritos in my computer so that I can log both fast. I use some other wp too, (oc, frp, bz, ca) but not as often. So their being on the main page wouldn't help me very much. That explained, I think main pages are for what consumers find main, so traffic is the real thing: traffic would tell us what's important for more people as an aggregate. Maybe a lot of people are using Arabic or Farsi or Bengali wp, but their use isn't being taken into account as a mesure of who deserves to be in the main page. B25es (from es) 22:22, 23 March 2008 (UTC)
    4. Waldir is right. If they are the most visited ones is because they are the most useful too for the readers. I am not totally sure about using the threshold too (If the Arabic wikipedia is used even more than the Italian, in spite of having fewer articles, why should we remove it from the portal?). --Racso 22:33, 23 March 2008 (UTC)
    5. I also support this idea. The main reason for .org portal is helping people to find the wikipedia they are looking for. --FAR 12:27, 24 March 2008 (UTC)
    6. I think this must be THE way to measure how important is a wiki, because without people we are nothing, the more the better. Those wikipedias attracting more people should be rewarded with a better ranking. This is the way the capital world works. I would only create a new language version of a commercial site if there are potentially enough users to watch it and make business. So, if I have to narrow down to the top ten I would take those which reach most readers, Poco a poco 23:12, 5 July 2008 (UTC)
    7. The natural factor, to rank wikipedias according their Alexas's traffic place them where they deserve. The Customer is Always Right
    8. The front page is above all a navigational tool. Traffic is the naturallest first criterion. If the difference is too small to be significant in marginal cases, secondary factors, such as size and depth can be taken into account. For if the design of the front page is an important PR business, we should apply our good sense and should not rely entirely on dead numbers. It would also be a Good Thing to reserve a few places for "winners" in terms of quality, community or other exotic criteria. Hillgentleman 21:42, 13 April 2008 (UTC)
    9. When people start to fight to get a higher rating measured in number of visitors, they actually work towards the aim of the Wikimedia Foundation. We want to bring information to people. The more readers we have the more we are achieving our goal. The growth of the number of readers can be stimulated in many ways, the number of articles, localisation, writing the most requested missing articles, improving the most read articles, ensuring good coverage of those subjects that provide background information to the news ... All these actions further our aim. As to what numbers to use, it should be our own traffic numbers as they are the only ones that are relevant.
  • cons:
    1. I don't think that the absolute number of visitors to the different wps should be a criterion for the sorting of the language editions as absolute numbers won't represent the "importance" of this wikipedia language edition. As already said, many of the languages with lots of speakers have rather poor wikipedia editions. If the chinese WP is visited by, let's say, 200.000.000 readers, this will represent 1/4 of the total number of speakers. Well, that's much, but if e.g. the Spanish WP is visited by also 200k.000 readers, this means that more than 2 of 3 spanish speakers use Wikipedia. Nevertheless, both WPs would have the same rank, misregarding a) that es.wp has more than 2-times the number of articles of zh.wp, b) that the articles on es.wp are (afaict) of higher quality than those of zh.wp and c) that es.wp is clearly more "important" for spanish speaking people than zh is for chinese ones. 91.13.220.249 10:20, 24 March 2008 (UTC)
      I don't think the example above, if all facts, "clearly" shows that es.wp is more important for Spanish users. For those 200 million people of each language, es.wp and zh.wp is of the same importance. And, to divide by the whole population does not have to be a correct choice, as other factors, such as illiterate ratio, access to internet, access to wikipedia web sites (due to national firewall in China), and so on, are not considered. Probably for 200 million Chinese to access zh.wp is more difficult than for 200 million Spanish speaking people to access es.wp. --Mongol 17:11, 16 July 2008 (UTC)
    2. The current measure (article count) can be manipulated by creating articles with a bot, as we have seen with the Volapük Wikipedia, or by creating many small stubs instead of writing longer articles, something the Swedish Wikipedia is sometimes accused of. The measure suggested here (visitor count) can be manipulated by launching a distributed denial of service (DDoS) attack against your favorite language of Wikipedia, to make it appear as having many visitors. That's not something we want to encourage. --LA2 01:30, 18 April 2008 (UTC)
      LA2, Using a bot to create articles is constructive. Wasting precious bandwidth is destructive, and worthy of a ban. They are not analogous. Hillgentleman 14:21, 19 April 2008 (UTC)
  • comments:
    1. Actually, es: is one of the most visited ones (2nd, I think) while chinese had had some problems due to baidu baike competence and restrictions in mainland China. But the cuestion is, if 100% of the speakers of, let's say, luxemburguese are interested in its wikipedia (despite they are few people) should they have preference to chinese, with much more people interested? Thinking of the relative interest of linguistic communities favour small languages against widely spoken - and probably more useful in the portal - languages--FAR 12:27, 24 March 2008 (UTC)
    2. I agree that 'the most benefit to the highest number' is a noble ideal, and that this would lead to a preferential treatment for Wikipedias having more potential users (i.e. languages with more speakers). So be it. I can also sympathize with the feelings of Luxemburgish (or Volapük) speakers; maybe one could satisfy them by having a link to "Wikipedias in languages with fewer than (100M? 10M? 1M?) speakers" somewhere around the ball, leading to a page with all the other languages? In this way speakers of smaller languages could also quickly find their way to their favorite Wikipedia. --Smeira 17:02, 26 March 2008 (UTC)
    3. Sorry alexa seems not to be very popular in germany. If you check the main page from different projects with [1] you get 16215810 hits for french mainpage, but 37546224 for the german one, that's about 3 times more. Spanish WIkipedia has even less, 13100098 hits on the main page. --80.133.139.29 00:27, 27 March 2008 (UTC)
    4. I think that it is the best alternative: If the most visited is the most visible, we are going to help more users. Although, we don't have to use an only site like source (i.e. alexa), but a combination of several. Also other parameters can be used to a lesser extent. Can the server detect HTTP header "Accept-Language" in order to modify the order?--Eloy 03:07, 28 March 2008 (UTC)
    5. Is the list in the table above based on Alexa or on Domas' statistics? Alexa is useless. The Alexa toolbar only works for MSIE and is mostly used inside the United States, or so I'm told. Domas' page view statistics would be the proper numbers to use. We could discuss if non-article namespaces should be filtered out. --LA2 01:45, 18 April 2008 (UTC)
    6. @80.133.139.29: The main page traffic is not a very good measure. Many articles are visited through Google or other search engines. Best regards, Alpertron 18:47, 25 June 2008 (UTC)
    7. @Alpertron: I believe alexa doesn't count only the main page for the traffic, but rather all pages from that subdomain. --Waldir 22:13, 25 June 2008 (UTC)
    8. @Waldir: My comment was directed to 80.133.139.29 because he posted the main page hit numbers taken from [2]. They do not represent how the entire site is reached. In my opinion the vast majority of Wikipedia pages are reached through Google so the main page count is irrelevant. Best regards, Alpertron 13:54, 1 July 2008 (UTC)
    9. @Alpertron: I'm sorry, I don't know what I was thinking :) Thanks for correcting me. Of course I agree with you, it makes perfect sense (I must have been sleepy or something, lol). Cheers, Waldir 00:18, 5 July 2008 (UTC)
      It's easy to make them fake if we count only number of visitors! Easly we can use anykind of bots and openproxy bots that would beat the stats up. Using bot to creation of new stubs isn't bad - remember normal Encyclopedia uses number articles (as the information which is better) and Wikipedia-stubs are nothing else than normal encyclopedia articles! (has almost the same lenght - even longer).
      In my opinion we ought to count an INDICATOR that would include both total pages and total visitors factors and as a result will show us real statistics - which Wikipedia has beter rank i.e. 'articles per visitors' or smth like that! MonteChristof 13:51, 9 July 2008 (UTC)
    10. @MonteChristof: The idea of an encyclopedia is to have as much information as possible about different subjects. If I read an article and find only one paragraph, the encyclopedia fails, because the information we need must be retrieved from another resource. So the Wikipedias with a lot of ministubs are not useful to its readers. This means that the number of articles is not an important variable. Best regards, Alpertron 17:29, 11 July 2008 (UTC)
    11. This parameter can be used in conjunction with the current criteria (number of articles). We can use a threshold of visits to prove the usefulness of a wikipedia. Or we could weight both variables the same (number of articles, number of visits) --MarsRover 05:10, 12 July 2008 (UTC)

Kudos to whoever done that! THAT IS how it should have been done on the first place. This way is promoting quality(over quantity), making editors more likely to think about usability then overall "article number" standings. I hate to use one of the top 10 Wiktionary - it got excellent number of articles, most of witch is bot generated stubs. And THAT teach user not to click "blue links" even if he/she see one. In a fast manner. Vitall 23:27, 12 August 2008 (UTC)

most spoken languages[edit]

  • pros:
    1. I think we should have the languages with the greatest number of speakers in those ten spots. So, even if Finnish and Esperanto make it to 100,000, I think we'd be well justified in not including them in the ten featured on the default page. But keeping Russian and Chinese off simply because they arrived late is absurd. If anything, Wikipedia is far more valuable to these countries (and their hundreds of thousands of speakers) than otherwise groovy languages like Dutch. Using this criteria (and the en:List of languages by number of speakers) we would have the following list of languages on the front page: Chinese, English, Spanish, Portuguese, Russian, Japanese, French, German, Italian, Polish. Dutch and Swedish (beautiful languages though they are, and likewise accomplished in their article counts) would be left off. I think this is both practical (as far as our aims of spreading information) and reasonably fair. (User:Perceval on Talk:Www.wikipedia.org template#Chinese Wikipedia will have 100k acticles)
    2. I would use "number of active speakers" instead of "number of native speakers" for planned languages, as the number of native speakers is quite an irrelevant figure in their case. (User:Marcos on Talk:Www.wikipedia.org template#rethinking the top ten)
  • cons:
    1. Many of the most-spoken languages happen to have very small Wikipedia editions, so linking to them so prominently would be unfair to the Wikipedia editions that represent fewer people but have more articles. (User:Mxn on Talk:Www.wikipedia.org template#Russian Wikipedia will have 100K articles soon)
    2. en:List of languages by number of speakers = several conflicting (and obsolete) sources. Bourrichon 20:46, 23 March 2008 (UTC)
    3. Non-sense; if there are millions of Chinese and Russian that aren't active on Wikipedia why do we have to prefer these users to other more activ user communities? Furthemore, if it's just a matter of being late, late or soon these larger communities will be listed on the top ten (of course if they're interested in wikipedia). --83.190.116.206 21:30, 23 March 2008 (UTC)
    4. I tend to think number of actual users is more important than number of potential users (= number of speakers), since the idea here is current usefulness. Languages with lots of speakers who for some reason don't use Wikipedia so often shouldn't be preferred over languages with fewer speakers who do generate a lot of traffic. Of course, as computers and the internet become a daily reality all over the world, the two criteria should eventually align and yield the same languages. So if we start by listing the languages that have most traffic (actual users) and change it as other languages increase their share, this won't be unfair. --Smeira 17:08, 26 March 2008 (UTC)
  • comments
    1. Most spoken language shows the potential of wikipedia, not the quality of it. Vlsergey 17:21, 31 March 2008 (UTC)
    2. Vlsergey, please note that this section lists criteria for measuring usefulness, not quality. Thus it is normal, and expected, that the criteria listed in this section do not describe quality properly. Waldir 10:27, 1 April 2008 (UTC)
    3. Most spoken language is not necessarily the most-used language online, as access to Internet and the protection of free speech vary widely between nations. There is plenty of population in mainland China and in India, but many websites are blocked in China and many of the poorest Indians cannot afford computers and Internet. Meanwhile Japan, a group of tiny islands, is represented quite well online. --Carlb 20:37, 4 April 2008 (UTC)
    4. Your statement would be true if you said it few days earlier. China unblocked Wikipedia on yesterday.[3] Also, your rebuttal has 2 points: access to computer/internet & internet censorship. Both of which are only part of w:straw man strategy and not the main points of this suggestion. OhanaUnitedTalk page 04:42, 5 April 2008 (UTC)
    5. It is often difficult to tell how many speakers a language has, e.g. Frisian or Sorabian.--Ziko-W 15:01, 12 April 2008 (UTC)
    6. @Ziko: The objective here was to deal with the top ten wikipedias (since the other ones in the sections below the globe are not sorted by any criteria but alphabetical order) and the top wikipedias would arguably be the versions in languages about which such ambiguity in the speaker count would not exist. Your comment is relevant, though, if later it is decided to make a redesign of the page -- for instance, a tag cloud version, as has been suggested before, in which all wikipedias would have to have a score associated. --Waldir 00:17, 13 April 2008 (UTC)
      You're right, I did'nt have the top-ten-ness in mind. Nevertheless it is not that easy to find out how many speakers a language has, and there are a lot of by-questions.--Ziko-W 17:25, 21 April 2008 (UTC)
    7. A tag cloud version would be too confusing and would not have much utility. This version with top ten most spoken languages is the best way to go, appeals to the greatest number of people to contribute to a project. Cirt 09:17, 13 April 2008 (UTC)
    8. Ranking by number of speakers was my first preference but then I noted that Bengali and Hindi were in the top 10. Here are my unscientific observations:
      1. Just a guess but I'd say 10 to 20% or more of en.wikipedia's content comes from the subcontinent from what I can tell. If you add in the diaspora, that number gets even larger.
      2. Our Indian language projects tend to be small -- the largest is ranked 30-something. Hindi is 56th, Bengali 57th
      3. I clean up a lot of cross-wiki spam. Even spam from Bollywood sites about movies made in Tamil or some other language is not spammed to the respective projects.
      I am not from the subcontinent nor am I a linguist. My gut hunch is that when it comes to Wikipedia and the Internet in general, there may be a form of diglossia in play here: a person who speaks Hindi all day expects to use English when she/he comes home, boots up the computer and goes to Wikipedia. From looking at List of Wikipedias by speakers per article, I suspect this is true of other countries where English is a common second language among those with regular access to a computer. So I am resistant to having Bengali and Hindi in our top 10; on the other hand, others might take my observation and say that's exactly why they should be in the top 10.
      It's too bad we can't rank Wikipedias based on page views by language of the viewer. From our raw server data, we could get the country of each page view based on the IP. It would be hard to sort out languages within a country, but you could perhaps assign weights to each language within a country based on census data.
      Most of these approaches are valid in their own way. Just, please, keep Volapuk out of the top 10 and I can otherwise live with the community's consensus.--A. B. (talk) 18:47, 19 April 2008 (UTC)
    9. top ten by highest ratio of speakers per article - this choice will strongly encourage creation of stub-grade and bot-generated articles. some discussion on quality requirements.
      top ten languages - it is very difficult to measure number of language speakers, especially number of second-language speakers for languages like English, French or Russian.DonaldDuck 02:50, 6 July 2008 (UTC)

By IP[edit]

A further way is to create ip-dependent start pages. The most ip numbers are easy to distinguish between USA, Europe, Asia a.s.o., and if there is an IP, which has no corresponding region, we can have a fallback solution.

Example: In Asia you have Chinese, Russia a.s.o. on top, in European English, German, French, Italian, maybe Turkey, Russia... --84.147.109.24 11:28, 7 April 2008 (UTC)

  • pros:
    1. More languages can be displayed on the front page. The preceding unsigned comment was added by Micke (talk • contribs) 11:40, 7 April 2008.
  • cons:
    1. This isn't a solution in it self it only moves the problem a bit further along, how should we decide wich laguages make it to, for example, the European site? The preceding unsigned comment was added by Micke (talk • contribs) 11:40, 7 April 2008.
  • comments:
    1. See also this discussion in the talk page, it is relevant to the subject. Anyway, I agree with Micke above, that is not a solution itself, should we have the fallback versions. We still need well defined criteria. I don't get the "pro" argument, though... maybe you mean that more languages get to be in the top ten, depending on the regional version? --Waldir 00:50, 12 April 2008 (UTC)
Exactly, more language versions will have the opportunity to attract visitors from the frontpage. --MiCkEdb 15:15, 20 April 2008 (UTC)
    1. I think this is the best idea. It's the fairest and most user friendly. --A. B. (talk) 20:14, 19 April 2008 (UTC)

Hard work[edit]

users / article[edit]

  • pros:
  • cons:

very active users / speakers of the language[edit]

  • pros:
  • cons:

edits per user[edit]

  • pros:
  • cons:

very active users × gini coefficient[edit]

relevant link for comprehension of the gini index: [4] (page 12)

  • pros:
  • cons:

number of manual edits (excl. bots)[edit]

  • pros:
    1. Possibly the sort order should be changed to number of manual edits (excl. bots) as best comparison of efforts. (Erik Zachte on http://stats.wikimedia.org/EN/Sitemap.htm)
    2. Might be a good idea, but I think taking the number of articles created by non-bot users might be a better measure (don't know if it would be useful to any user to know that en.wp has 200.000.000 non-bot edits - that number isn't imaginable...)--91.13.220.249 10:25, 24 March 2008 (UTC)
  • cons:
    1. Considering lots of non-bot edits (including article creation) are vandalism, I'm not so sure this would be a good measure. Besides, well-used bots increase the quality of an article at least as much as small manual edits and corrections do. It would be fairer to exclude all small edits (e.g. orthographic corrections, infobox format changes, adding pictures), regardless of whether or not they were done by bots, plus also all vandalism. (But then again, the infinitesimal increases in quality made by those small edits might actually add up to a non-negligible contribution... Wouldn't it be better to simply weight it somehow?) --Smeira 17:12, 26 March 2008 (UTC)
    2. Not all bots have the bot flag set on all Wikipedias - for example, on rmwiki, bot changes appear on RecentChanges even if bots are filtered out, because the bots are not recognised. So this would only work if all Wikipedias recognised bots in the same way. -- pne 15:25, 22 July 2008 (UTC)
    3. Why so many wikipedias grew up very slowly? Because using bots to create series of useful pages does not so free and easy. 220.129.120.17 19:17, 26 October 2008 (UTC)
  • comments:
    1. Number of not reverted manual edits? Przykuta 20:08, 25 June 2008 (UTC)
    2. That would make the calculation much harder, we don't want that. We want a simple measure, yet efficient in the sense that it conveys the nature of what we consider to be the best qualities of a wikipedia. --Waldir 22:12, 25 June 2008 (UTC)

number of active editors[edit]

Set a limit (f.e. 200 edits per mounth in articles, excluding bots) - and count active editors. #!George Shuklin 23:59, 27 March 2008 (UTC)

  • pro:
    1. bots and their work not counting.
    2. Slowing down or accelerating will be visible
  • cons:
    1. Counting vandals, unless you exclude non-registered editors.
    2. Not very active editors (even large group with big contribution) will be ignored.
    3. If we don't normalize this value (for example dividing it by the number of speakers as suggested above, we end up privileging wikipedias which have a bigger potential user base (lots of people speak english, for instance). Other communities might never reach this number of editors simply because there are not enough speakers of the language. Waldir 14:42, 28 March 2008 (UTC)
    4. Any "active" editor should produce articles. Let's measure the result, not the tool size. Vlsergey 17:23, 31 March 2008 (UTC)

Combined ratios[edit]

Hard work * Quality of service[edit]

Rating something is a way to give a message about what is being expected. It is very difficult to get a fair rate and avoid perverting the system by changing the community behavior towards the easiest way to get the maximum value with the minimum effort. Some ratings can measure the total value: number of articles, size of the database, number of hits (like rating a company by its profits). This, unfairly overvalue the biggest communities and have the drawback of being discouraging for the smaller ones that never will be able to reach those standards. (All economists rate the companies by the ratio between profits and capital invested, getting 1m€ profit by investing 0.1m€ is a great deal, but getting 10m€ by investing 1000m€ is a very poor business). There is also the risk of encouraging low quality automatically created articles.

To try avoiding those effects I propose a system: Combine two ratios:

a) INTENSITY OF EFFORT: Effort done by the community. This can be measured by dividing the size of the database by the size of the community. The size of the database should be measured by example using the number of articles greater than 500 bytes to avoid the low quality automatic ones.

b) QUALITY OF SERVICE: The intensity the outer community uses it. This can be measured by number of hits (ideally hits done by non wikipedists to avoid endogamic activity that can be very big in some artificial languages) divided by the size of the community.

The final rate could be: Rate= EFFORT * QUALITY So if Na: Number of articles with size >= 500k Hit: Number of hits SPk: people that can speak the language. Wiki: Number of active wikipedists K: constant to estimate the endogamic activity generated by the own editors. The final formula could be:

RATIO: (Na/SPk)*(Hit-k*Wiki)/SPk
  • pros:
    1. Sends the message that hard work is valuated.
    2. Discourages the creation of low quality automatically created articles.
    3. Defines the quality as the quality of service. The intensity the community uses the tool. Not just the length with more or less meaningfulness.
    4. Gives to small communities (but highly active in creating and in using Wikipedia), the chance to get the top ten.
    5. Sends big communities the message that being big is not enough they can drop from the top ten if then are not active enough.
  • cons:
    1. Formula is slightly complicated and can be difficult to be understood.
    2. Parameter K has to be fine tuned.
    3. And the bigger con. Huge communities with low activity and low quality of service can be in top 10 with any other system but its position is in risk with this system. In any voting system, and in any consensus system, if they act in an egoistic way, never will accept this system.
  • comment:
    Thanks a lot for your input. It was actually my intention to propose a combined system, but only after the discussion had reached some consensus in the precise way to measure each criteria. Doing one thing at a time is the best way to move forwards, otherwise we risk having too many things in discussion and it might not get anywhere.
    Also, about the biggest "con" you refer: that is not actually a disadvantage of the system in the true sense, as you surely realize, and since it was made clear from the very beginning of this discussion that this is not a vote, only the strength of arguments should prevail, not the strength of numbers.
    Finally: given what I said above, if you agree I'd like you to move this section to the discussion or to a subpage (keeping a note on the discussion in case the idea occurs to someone else), and only introduce it back when we reach some consensus in the first part of the discussion. Waldir 01:44, 25 March 2008 (UTC)

Combination of The Table[edit]

I used The Table from the beginning of this page. I chose columns: article count, most speakers 100K+, most visited 100K+, 1-stub ratio 100K+, speakers /article 100K+, users / article 100K+, and compressed DB size. Then I gave 10 points to the first languages, 9 points to the second languages and so on to 1 point for the tenth languages in each column. The result is (if I did no mistake).

  1. en 49 points
  2. fr 39
  3. de 35
  4. jp 30
  5. pl 30
  6. es 27
  7. ru 25
  8. it 22
  9. nl 22
  10. pt 22
  11. vo 20
  12. zh 18
  13. sv 17
  14. no 13
  15. fi 10
  16. ro 7

The top 10 languages are: English, French, German, Japanese, Polish, Spanish, Russian, Italian, Dutch and Portuguese. I provide no deeper analysis of this evaluation method because it is quite straightforward. It would be probably better not to count articles smaller than 500 B (or 1 kB). If speakers /article 100K+ and compressed DB size columns omitted (they are a bit redundant), the top 10 is: en, fr, de, ru, jp, es, pl, pt, zh, nl. Miraceti 12:43, 11 July 2008 (UTC)

More useful links[edit]