Talk:List of Wikipedias by sample of articles/Archives/2010

Please do not post any new comments on this page. This is a discussion archive first created in 2010, although the comments contained were likely posted before and after this date. See current discussion or the archives index.

Untranslated pages shouldn't be counted

I was going to add the following logic to to filter out articles that are not translated and incorrectly receiving high scores.

def IsArticleEnglish(article):

    # convert article to lower case word list
    word_list = re.split('\s+', article.lower())

    # create dictionary of word:frequency pairs
    freq_dic = {}

    # punctuation marks to be removed
    punctuation = re.compile(r'[.?!,":;{}]') 
    for word in word_list:
        word = punctuation.sub("", word)
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1

    # usually English is ~20% these words and non-English at most a few percent
    common_english_words = ['the','of','on','a','is','in','his','have','by','but','that','to','with','for','an','from''are','was','he','which','be','as','it','this','first','new']
    en_word_count = 0
    for word in common_english_words:
        if freq_dic.has_key(word):
            en_word_count = en_word_count + freq_dic[word]

    percent_thats_common_english = 100.0 * en_word_count / len(word_list)

    # flag if 10% or more in the list which means more than half the article is English 
    return percent_thats_common_english > 10 and  en_word_count > 10

It is not the most elaborate code but it seems to identify the articles that are mostly English and shouldn't be counted. It shouldn't affect the scores too much except for a few wikis that have "cut and pasted" from the en.wp article for there initial article (a dubious practice IMO) and let the untranslated English languish. Is this okay with everyone? MarsRover 02:16, 24 January 2010 (UTC)

Sounds like a good idea to me, at least in general. The logic could be problematic, though. For example, two of the words on your list are conjunctions in Slovene language ("in" and "a"), and another is potentially quite common in articles as well ("to", meaning "this"). Those three words almost surely won't make for 10% of the text, but please make a test run and list excluded articles somewhere so we can check if there are problems. Shouldn't be too much work. Also, the code has to exclude commented out text (like the script does, if this function isn't inherited). Yerpo 10:44, 25 January 2010 (UTC)

I like this idea too. ...Aurora... 11:25, 29 January 2010 (UTC)

Yes, some of these words are common in Latin as well ("in," "a," "is"), but others like "the" and "was" are all but impossible in that language. Similarly, a closely related language like German has different forms for pretty much all of them ("ist" for "is," or "war" for "was") and won't get significantly many false matches. So while Yerpo's suggestion of a test run is sensible and prudent, I don't think this is going to cause problems in general. I also support the proposal. A. Mahoney 17:26, 26 January 2012 (UTC)

oops, missed the fact that this happened two years ago! Well, I still agree with it. :-) A. Mahoney 17:28, 26 January 2012 (UTC)

The function only seemed to have false positives with the Scots language. So, I disable the check for that wiki along with the obvious English and Simple English wikis. Also, I bumped up the criteria percent from 10% to 20% in the actual code. You can see the various errors found in this document. Even the errors with some Old English articles look correct. --MarsRover 08:09, 27 January 2012 (UTC)

An idea about adding one more column

I was reading about how how some people are trying to expand the wikipedia editions in Africa. (Jimmy Wales seeks the development of Wikipedia in African languages, Building Wikipedia in African languages, Swahili Wikipedia now the largest African-language). It seems a shame the table cannot be sorted to list the "African languages" together. I am just curious how those wikis are doing with the score. Including a new column called "region" that is the continent where the language originated would enable this. Note that this is not be the same as "language macrofamily" classification which would divide Africa into multiple categories. We could also include some exceptions like all constructed languages be combined together as "Constructed". Anybody else think that would be useful or at least interesting? --MarsRover 21:36, 8 February 2010 (UTC)

Interesting, sure, but I think it's about time to start making separate pages instead of adding more columns. You know, analogous to List of Wikipedias and List of Wikipedias by language family. Else it might get overcrowded here. Yerpo 20:34, 11 February 2010 (UTC)

Yup, interesting. But how about languages used widely on more than one continent, e.g. English, French? Table width is also an issue, like Yerpo said. ...Aurora... 13:36, 17 February 2010 (UTC)

Article metric

MarsRover seems to enjoy programming this, so I have another idea for measuring vital articles coverage, this time from the perspective of articles. A simple script could sort the articles according to the number of Wikipedias in which they are short/normal/long. For example,

Article	Number of Wikipedias in which this article is
	long	medium	short	non-existent
en:Earth	34	122	101	15
en:World War II	28	83	130	31
etc.	#	#	#	#

(the numbers are made-up, of course)

I think it should be sorted by the number of Wikipedias in which it's long, then the number in which it's medium, etc. What do you think? — Yerpo ^Eh? 11:26, 18 May 2010 (UTC)

Interesting idea. I could probably just enhance List of Wikipedias by sample of articles/Neglected#Popular Articles to be a table with a few more columns. --MarsRover 15:33, 18 May 2010 (UTC)

Right, forgot about that one. The average size could also stay, of course. — Yerpo ^Eh? 06:17, 20 May 2010 (UTC)

MarsRover, I said it before and I'll say it again: you rock! On a side note, this table could potentially be quite useful in determining what to replace in the list of important articles. I know popularity isn't a perfect measure of importance, but at least in the cases where only 1 Wikipedia managed to gather more than 10.000 characters of information on a particular topic in all these years (hint, hint), we could seriously consider to replace that topic with a better representable one. — Yerpo ^Eh? 07:02, 3 June 2010 (UTC)

02-08-2010 results have errors

The calculated results for august have included Toronto wich was added without prior discussing. By other hand, a vandalic modificaction has forced a lot of wikis to point to :en:Fespa instead en:Printing. --Loupeter 06:53, 3 August 2010 (UTC)

~~And now Absent + Stubs + Art. + Long Art. = 1001 instead of 1000. --Nk 12:58, 3 August 2010 (UTC)~~

Two problems should at most only cause a 0.2% error. JAnDbot changed the en:Printing article's iwlinks not really a vandal, right? --MarsRover 17:09, 3 August 2010 (UTC)

Calculating the mean and the median

Are the mean and median article sizes being calculated in the most appropriate manner? These metrics—which could be quite useful in interpreting interwiki progress—seem to be describing their universes without regard to missing articles. For example, Tarandíne (roa-tara), has a mean article size of 23,448 : that makes it the wiki with the tenth-largest mean article size—and therefore, it would seem, a wiki quite well on the way to being an adequately compiled encyclopedia. However, it has a score (stubs + articles*4 + long.articles*9) of 3.14, and by that metric is the 122nd-most-complete wiki. This is a rather sharp dissonance! Surely the true mean size of all 1000 designated articles in that wiki is far below what the current calculations are showing? A similar situation exists with many other wikis, by this metric and that of the median. § Consider an extreme hypothetical case: a wiki with one article 75 000 long and all 999 other designated articles missing. If zeros for the missing articles aren't included in the calculations, this wiki's mean article size would be 75 000—making it look like the most complete of all present wikis, since the mean size of the designated articles in the English wiki is only about 51 000. Its median size would also be 75 000. But including the sizes of the missing articles would reduce the mean to 75 and the median to zero—much more appropriate descriptors of that hypothesized reality. Jacob. 71.178.147.198 13:24, 22 August 2010 (UTC)

Originally the average included absent articles but it basically results in the same metric as the "score" if you do that. Also, if we did include absent articles, "median" would be very boring for every wiki with more than 500 absent articles. --MarsRover 17:21, 22 August 2010 (UTC)

Weight for be-x-old wiki

Hello. The Babel text for Belarusian (in Taraškievica orthography variant) is available at [1]. According to it the character count for Belarusian (Taraškievica) is 834, so the weight appears to be 1162 / 834 = 1.39 → 1.4. Could you please add this weight for be-x-old wiki? (The weight for be-wiki will be nearly the same, but the original text for Belarusian at omniglot is presented in a variant of Taraškievica orthography used in be-x-old wiki.) —zedlik 20:14, 17 November 2010 (UTC)

I've updated the Source code and the table in Talk:List of Wikipedias by sample of articles/Archives/2007#Proposed weighting of characters for formula (Option#2 using Babel text) with the weight of 1.4 according to the Babel text. —zedlik 20:35, 18 November 2010 (UTC)

I double-checked the weights and you're correct that it is 834 characters. This scores will reflect this change at the end of the month. --MarsRover 06:53, 19 November 2010 (UTC)

Great, thank you! —zedlik 15:03, 19 November 2010 (UTC)

sh wikipedia

I do not understand what has happened so ..... During November I have writen, added on Srpskohrvatski / Српскохрватски (serbocroatian or like I like to say Yugoslav) wikipedia tens of great articles, but number of great articles for this wikipedia has not changed. On 2 November list sh wiki is having 108 long articles, but on 1 December list we are having again only 108 long articles.

During November I have writen or added from other wiki many, many great articles because I believe that "Yugoslav" wiki can become archive of best south slavic articles. Example of articles which I have writen or with needed changes taken from other wiki during November:

Airbas A380 (new article 67.000),
Uzbekistan (from 10.000 to 43.000)
Henry VI, Holy Roman Emperor (from 0 to 42.000)
Subotica (from 11.000 to 35.000).

I have used this examples because all this articles are having interwiki links on english wikipedia, so for me is hard to understand why this month they are not "used" by bot which is looking articles ?--Rjecina2 20:14, 4 December 2010 (UTC)

Hello! Please look at List of articles every Wikipedia should have. These articles are counted for this table, so only changes to these will be reflected here. -- Prince Kassad 20:17, 4 December 2010 (UTC)

Weights

Indonesian uses 1,256 characters, therefore it should use a weight of 1162/1256 = 0.92515... = 0.9.

Galician uses 992 characters, therefore it should use a weight of 1162/992 = 1.13508... = 1.1.

-- Prince Kassad 19:01, 7 December 2010 (UTC)