Talk:List of Wikipedias by sample of articles/Archives/2008

From Meta, a Wikimedia project coordination wiki

January 08 update

Current intro states that the listing has been updated on 24 January 2008. I'm not sure whether the listing was calculated again or were only the preexisting values weighed according to Proposed weighting of characters for formula. Could somebody clarify that? Because it's slightly misleading if the values are old. And if so, it would be nice to have a more up-to-date listing. --193.2.89.2 14:39, 14 February 2008 (UTC) (Yerpo @ sl)

Yes, it was fully recalculated again using the same list of articles as before. I used the modified source code which has a few changes such as weighting the character size and excluding the "interwiki text" from the size calculation. Also, the "average article size" now excludes absent articles since it seemed more intuitive (that change doesn't alter the scores). --MarsRover 03:27, 15 February 2008 (UTC)
Thanks for the clarification. And, as the commenter before me said, Good ramking! :P Too bad I only got around updating the interwikis of vital articles for "my" WP after I saw that the listing was updated. Ah well, maybe the position will improve the next time. --193.2.89.2 11:52, 15 February 2008 (UTC)
Is the list updated regularly? And if so, how often? --140.180.20.42 19:01, 27 February 2008 (UTC)
I was thinking of updating it again (I guess that makes it updated every month). It takes two days to run the program so that why I was waiting until it gets more out of date. Also, was going to enhance the program a bit to keep track and publish the list of the missing articles for each wiki. --MarsRover 20:58, 27 February 2008 (UTC)
Oh good, I'd very much like to see this list updated regularly :) --140.180.20.42 23:39, 27 February 2008 (UTC)
There's something wrong with the methodology used. The Icelandic Wikipedia has many articles which are claimed to be missing and has had them for a long time, e.g. Epistemology. How is it determined whether or not an article exists? --140.180.2.62 23:46, 2 March 2008 (UTC)
Out of the 207 articles that the list claims are missing from the Icelandic Wikipedia, only 76 really are. So 63% of the "absent" articles aren't absent at all. I imagine many of the other languages are in a similar situation, they're not all missing all of the articles, rather what's missing are the iw-links from the English Wikipedia. Also, the list of articles is badly chosen as some of the articles that all wikipedia's should have are disambiuation pages! For example en:Processor and en:Cell. But there may not be a need to disambiguate in the same way in all the languages and therefore many of the languages will be missing an iw-link on this particular page on the English Wikipedia. These pages should clearly not be taken into account, since it would be absurd to maintain that all wikipedias ought to have these same disambiuation pages. --140.180.2.62 03:04, 3 March 2008 (UTC)
I noticed that problem with some of the pages being too generic (ex. en:Square should have been en:Square (geometry) and en:China should the en:People's Republic of China) but not sure there enough problems to redo the list again. Also, you are correct the important thing in the iw link in en.wp article. --MarsRover 03:18, 3 March 2008 (UTC)
Certainly we can't expect all languages to have the same disambiguation pages. Like en:Processor and en:Cell. That's what I find absurd. And since these words won't be ambiguous in all languages there won't be links on the English disambiguation pages. --140.180.2.62 03:22, 3 March 2008 (UTC)
I agree, but probably should first fix the problems in List of articles every Wikipedia should have. Then I will update this list for the next run. --MarsRover 03:28, 3 March 2008 (UTC)
Well, I've taken out the disambig pages. --140.180.2.62 03:36, 3 March 2008 (UTC)

Relative Weights Estimates for some Germanic / Franconian languages.

Knowing enough about languages in question, I can say that weight factors for the following can be safely assumed to be pretty much the same, or at least very close together: bar (Austro-Bavarian) de (Standard German), en (English), fy ((West)Fresian*), gsw (named als in the Wikipedias) (Alsatian/Allemannic/Swiss German), ksh (Kölsch/Ripuarian), lb (Luxemburgish), li (Limbugian*), nds (Low Geman/Plattdüütsch), nds-NL (Nedersaksisch), nl (Dutch*), pdc (Pennsylvanian Dutch*), pfl (Palatinate German), vls (West Flaams*), stq (Saterlandic Fresian), zea (Zeelandic), and likely few more, which are closely related to one of the ones mentioned here. They may deviate by imho ±0.5 or so at most, with those marked with an asterisk* more likely tending to the lower end of he scale. --Purodha Blissenbach 15:15, 14 March 2008 (UTC)

Estimations can be applied for all Chinese-related languages (zh-xx) too. I suppose they are more or less simular to Chinese (zh). -- Kevinhksouth 14:47, 17 March 2008 (UTC)

I agree but zh-min-nan: doesn't use chinese characters so seems to be the exception to having same weight as zh:. --MarsRover 05:23, 4 April 2008 (UTC)
Yes, Min Nan (zh-min-nan:) should have a much lower weight (1.2), as I stated above. This should apply to two other Chinese languages, Min Dong (cdo:) and Hakka (hak:), both wikipedias use Latin alphabet. I agree that modern Chinese languages using Chinese characters, such as Cantonese (zh-yue:) and Wu (wuu:), may have weights close to Madarin Chinese (zh:) (3.7). But Classical Chinese (zh-classical:) is a more succinct language, so should have a even higher weight. 83.200.43.51 01:29, 18 April 2008 (UTC)

Other possibility of maximum score

Note: please do not archive this section, as it is linked from the content page.

Current Formula: ?

For example ukrainian score for April 2010 is calculated as follows: ??+??/?? = 39.48

The maximum score is about 1092 x9 = 9828.

I propose to remplace this score by the old score divide by the number of articles. By this way, the maximum score is clearly 9.

And a new score of 4 seems clearer than a old score of 4368.--Jauclair 18:41, 19 March 2008 (UTC)

Interesting idea. To be useful you would need about 3 decimals for the score (ex. "9.000"). I also was thinking of a way to normalize the score since we potentially can have a different number of articles for each run. And, it would be nice if the scores use the same scale to tell if the wikis are improving with each update.
  • We can normalize to 9 but that might be a little counter intuitive (people may think 10 is the best)
  • We can normalize to 10000 since the maximum score is almost that number ( 1092 x9 / 9828 x 10000 = 10000)
  • We can normalize to 100 since that is the usual maximum score for most scales ( 1092 x9 / 9828 x 100 = 100)
--MarsRover 19:44, 19 March 2008 (UTC)
Another approach is to reduce the number of basic articles back to 1000 (It was planned to have 1000, but now have 92 more and nobody in meta wants to reduce that, while the list in English Wikipedia has been reduced to 1002 exactly 1000 already.), then the score of long articles changes to 10, i.e. 1000 x 10 = 10000. -- Kevinhksouth 16:19, 20 March 2008 (UTC)
You would also have to tweak Smeira's original formula to have long articles be worth 10 (not 9) else the maximum score would be 9000. But the biggest problem is what you mentioned "nobody in meta wants to reduce" the list to 1000 articles (and have it stay at that number). --MarsRover 19:37, 20 March 2008 (UTC)
There are three differents problems.
  • The number of article in the list. I think it is a false problem, because it's obvious that this list have to change in the time and that it would not be always possible to take away an article when a new article has to be added. So the size of the list will grow in the future and it is normal, (and the final maximum score must be independant of the size of the list).
  • The list itself. The problem is that ther is not ONE list, but many lists. There is the metalist used here and there is a list in each wikipedia (in the more importants). The metalist is not maintained and discussed by eveybody but only by metausers ... who have to speak english. For me it's a real problem because the links for this list are links with the english wikipedia list. So it's obvious to english wikipedia to have all articles created, and not possible to other wikipedia to find easely which articles are missing. This problem is almost the same problem than the problem discussed in the firt chapter of this talk Any "lists of absent articles of each major Wikipedias"?
  • The maximum score itself about which we are dicussing here !
Now I think the better maximum score is simply 100%. The percent is important because it makes the result more obvious to read. The goal is to have 100% of articles of the list which are long articles.--Jauclair 22:09, 20 March 2008 (UTC)
I also argee that "a maximum index of 100.00" is better than "a maximum score of 9828". Since the number of articles is unlikely to be reduced to exactly 1000 in the near future, converting the score to an index would probably be the best solution. -- Kevinhksouth 06:02, 21 March 2008 (UTC)
Ok, I will do that in the next run. You are correct this would be an index. This would not be the "percent of long articles" since you may have 1000 stubs which results in about a score of 10.00 which doesn't mean you have about 100 long articles. MarsRover 17:03, 21 March 2008 (UTC)
What I said is just that if you have 100(%) it means you have 100% of articles of the list which are long articles. Of course other scores are an average between short and long articles, but even if 10% could have many significations, it means than about 10% of the work have be done, and i think that this "%" is more clear for people who discover this list at first time (it was my case two weeks ago).--Jauclair 00:57, 22 March 2008 (UTC)

Problems with the list which is actually used

The list which is used by the program is at the end of List of Wikipedias by sample of articles/Source code. It is supposd to be the same list than List of articles every Wikipedia should have. But they are different. For example, you have "China" and "People's Republic of China" in the programm but only "People's Republic of China" in the article List. For the anatomy part, you have "Gustatory system" and "Olfactory system" one side and "Ear", "nose", "eye" on the other side. Have I misunderstanded something, or not ?--Jauclair 22:54, 22 March 2008 (UTC)

I've juste understood, that it is the version marked 1.0 (Version 1.0) which is used, not the current version !--Jauclair 15:33, 24 March 2008 (UTC)
False conclusion, but this problem of differences between the two lists is remaining. The history shows that the list must be the list of the 1th of December 2008. So it is clearly out of date.--Jauclair 19:49, 24 March 2008 (UTC)

Based on version 1.0?

Is it really based on version 1.0? I have just splited this version to a new subpage so that I can add a statistic table. If I did not count the list wrongly, there should be 1365 articles instead of 1092 which is claimed on this page. -- Kevinhksouth 16:23, 24 March 2008 (UTC)

As far I know the list is just out of date and several problems in the original list have already been fixed (ex. "China" --> "People's Republic of China"). I don't believe its version 1.0 and if you look at the history of the list it should be the version of the list when the program was created. -- MarsRover 16:33, 24 March 2008 (UTC)
I think, that I have concluded a little to fast ... the fact it was the version 1.0 was just a possibility ... but not the only one ...
So the problem is that this version is out of date. What must be clear is which version of the list is used. I think a good solution is to update the list which is in the code by a programm and to add the version which is used in the results. It would avoid this kind of interrogation ! --Jauclair 16:42, 24 March 2008 (UTC)
The history shows that the list must be the list of the 1th of December 2008. So it is clearly out of date (and in this list you have both "China" and "People's Republic of China").--Jauclair 19:48, 24 March 2008 (UTC)

Would it be possible to list the "missing articles"?

I'm active in the Esperanto Wikipedia, and I'm pretty sure we have written articles for all 1000-odd articles in the list of articles. However, I think we might be missing some interwiki links in and out. Would it be possible to have the bot spit out a list of the articles which are lacking? Would a "scoring page" might be too much to ask? If we had something like a score page, it might give our Wikipedians something to strive for. Thanks... -- Yekrats 13:16, 25 March 2008 (UTC)

List of Wikipedias by sample of articles/Absent Articles#eo_Esperanto and the absent count column in this list would be score. MarsRover 18:55, 25 March 2008 (UTC)
Thanks, you're awesome! -- Yekrats 17:14, 26 March 2008 (UTC)
I am preparing a tool on the toolserver that will be capable to provide such a list (and more) for all wikipedias (and more), see http://tools.wikimedia.de/~purodha/tool/alwipagsh/batch.php which however is not really working yet. --Purodha Blissenbach 20:30, 28 March 2008 (UTC)

Update?

I'm wondering whether the list will be updated soon. Will there be an update at the beginning of this month? --140.180.3.65 00:10, 3 April 2008 (UTC)

I am waiting for it too. -- 158.182.99.155 10:40, 3 April 2008 (UTC)

Bugs in detecting articles?

I have checked the lastest version of List of Wikipedias by sample of articles/Absent Articles. Although all basic articles are existed in Chinese (zh) Wikipedia, there are still 5 article claimed "absent". One of them is en:Voodoo, which is a disambiguation page. However, for all the other 4 "absent" articles: en:Akbar, en:Hundred Years War, en:Carribean Sea and en:Drugs, I have checked that the interwiki links of zh were there long time ago. So I wonder whether there are any bugs in detecting articles. -- Kevinhksouth 15:02, 3 April 2008 (UTC)

Funny, Hundred Years War, Carribean sea and Drugs are incorrectly missing for sl: too. This is probably because all are misspelling redirects. --84.41.32.68 16:50, 3 April 2008 (UTC) (Yerpo @ sl:)
Same for is:, although even more articles are incorrectly considered missing, such as Akbar. And in addition some iw links that had been put in the English articles had been removed since last month (e.g. epistemology, Random access memory etc.) --140.180.3.65 18:11, 3 April 2008 (UTC)
There might be a bug with having redirects in the article list. I'll check it out. As far as the missing iw links it appears to be vandals and bots running amok (ex. JAnDbot did this) --MarsRover 05:03, 4 April 2008 (UTC)
Found the bug (when the #redirect was lower case) but shouldn't have altered the scores too much. I will update the source code in case someone want to run it themselves. Otherwise have to wait until next month. --MarsRover 05:58, 4 April 2008 (UTC)
Thank you. -- Kevinhksouth 14:02, 4 April 2008 (UTC)
For Undred Years War, it was do to vandalism in the article .. so it was missing for all wikipedias!!! To avoid this kinf of problem, it is necessary to start from many differents wikipedia, not only from the english one !!!--Jauclair 21:51, 4 April 2008 (UTC)
I don't think this is an option. To start from all the different wikipedias, you'd need up-to-date translated list for all languages, which is impossible when it's changing so much. Hundred years' war wasn't due to vandalism, but due to a missed lower case in the redirect (as for the other three), so it isn't really such a problem. --84.41.32.68 11:28, 5 April 2008 (UTC)

Size calculation bug?

Do - potentially large - html comments ( <-- something -->) add to article sizes? I am afraid, I found at least one case of suspicion, even though in this case final relations seem unaffected. --Purodha Blissenbach 14:08, 4 April 2008 (UTC)

Ok, that should be easy to do. Do you have an example of an article with comments for me to test? --MarsRover 04:09, 5 April 2008 (UTC)

Warning about InterWiki links

I examined the edit by JAnDbot. And the Bot was correct in deleting the interwiki link. If you are trying to fill-in absent articles by adding interwiki links, make sure the articles are exact matches otherwise Bots will revert them. For example:

en:Random access memory does not appear to be the same as is:Lesminni since all of its interwiki links refer to en:Read-only memory. It simply appears to be an article about the wrong type of computer memory. It looks like the link was fixed to now point toward is:Vinnsluminni which appears to be en:Computer data storage (Computer memory that includes both RAM and ROM). So, that isn't an exact match either and I would not be surprized if its reverted by a Bot. --MarsRover 04:00, 6 April 2008 (UTC)

Latin weight

At the moment, the formula gives Latin the default weight of 1.0 (see the asterisk). But in every example I've seen where Latin & English appear on facing pages, the Latin is more compact: if the parallel texts are printed in identical faces, sizes, and spacings, the Latin always takes up (say) 80 to 90 percent of the space that the English does. Shouldn't Latin's weight in this formula therefore be greater than 1.0? 71.178.144.67 17:03, 6 April 2008 (UTC)

Yes, you are correct. The weight is about 1.1 (90% less space than English). I change it in the next run. --MarsRover 17:22, 6 April 2008 (UTC)
Thanks! A user in Vicipaedia (http://la.wikipedia.org/wiki/Usor:Harrissimo/Weight) has compared a sample of long Latin texts with their English translations: his conclusion is that "overall, the Latin pages are on average (mean) 79% the length of the English pages." The reciprocal of that, 1.2658, might suggest that a more accurate weight for Latin would be 1.3, or at least 1.2. 71.191.124.63 15:20, 8 April 2008 (UTC)
You might be correct but I think we need to use a common text to calculate the weights. I was using the babel text. If you can find a better translation of this text on a webpage I will change the weight. --MarsRover 13:52, 11 April 2008 (UTC)

Vietnamese weight

From the Tower of Babel program above, it appears that Viet is 1100 chars, which should give a weight of 1.057, which is rounded to 1.1 so it seems...Can this be incorporated in there? Blnguyen 05:46, 7 April 2008 (UTC)

I examined the text with MSWord and calculated 1129 chars which results in a weight of 1.029 which rounds down to 1.0 How are you calculating the characters? --MarsRover 13:59, 11 April 2008 (UTC)
Actually it should be 1103. When I pasted in, I cut the 1. 2. 3. up to 9. which should account for about 27 characters. It appears that the en weight of 1163 also cuts off the numbering. With the numbering, it would be about 1190, but in this case, the table appears to be ignoring the numbering. Blnguyen 06:25, 16 April 2008 (UTC)
Strange MSWord bug: If you select everything including the last space, it pastes in the bullit numbers (1129 characters). But if you skip the trailing space doesn't paste in the numbers and it is only 1102 characters. I'll use with the count that doesn't include the numbers in the next run. --MarsRover 04:02, 17 April 2008 (UTC)

Disambiguation pages

What will happen, if the script checks for interwiki links on page en:Railroad? This is a disambiguation page. --81.189.26.196 13:05, 13 April 2008 (UTC)

There are NOT supposed to be any diambiguation pages on the list of articles that all Wikipedias should have. If it's in the source code, it's by mistake; and if it's on the list, then that's a mistake. --140.180.10.67 13:31, 13 April 2008 (UTC)
There are several diambiguation pages on the list if we take enwiki as basis... - 81.183.216.212 12:53, 14 April 2008 (UTC)
The problem is not in using "enwiki as basis" but in not having a clean list. I think we have cleaned up all known disamb pages (removed en:Voodoo, corrected en:Railroad). If you see any more in the List of articles every Wikipedia should have, describe them in the talk section. --MarsRover 20:53, 14 April 2008 (UTC)

Still 1 missing?

I just realized that it was updated on 18th April. However, it is funny that it claims that zh still has 1 absent article - en:Skeleton. I have checked the history of that article in both en and zh, their interwikis existed long time ago. However, possibly because there was vandalism in en article when running the program, so the article was considered absent. From this incident, I hope when running the program next time, please notice whether there is something wrong in the result (it is impossible to have en:Skeleton missing in most languages). If yes, please correct it and run it once more, in order to avoid any incorrect result. -- Kevinhksouth 01:39, 24 April 2008 (UTC)

Some vandal butchered the en:Skeleton article the day zh: was calculated. I will rerun this again at the end of the month so we can have one other wiki besides en: with zero absent. --MarsRover 05:11, 24 April 2008 (UTC)

I am too lazy to open a new heading and I just report it here. In 2nd May's update, there are still some errors. For Chinese (zh), still 1 absent article was listed in the table, and List of Wikipedias by sample of articles/Absent Articles even listed 2. Another problem is found. Besides zh, there are also other languages in List of Wikipedias by sample of articles/Absent Articles having abnormal strings (started with <class 'wikipedia.NoPage'>:). -- Kevinhksouth 16:42, 3 May 2008 (UTC)

Cool. The problem seems to be fixed less than 1 minute before I submitted the above message. Thanks. -- Kevinhksouth 16:45, 3 May 2008 (UTC)
Yeah, I recalculated 4 of the wikies to see how many 0 absent we could get. The "abnormal string" after the article name is the error message during calculating. The one above means at that moment the interwiki link was pointing at missing article in zh:. Also, someone vandalized the en:Word during the calculating so you notice that one missing a lot. Oh well. --MarsRover 17:05, 3 May 2008 (UTC)

Norwegian Nynorsk weight

At the moment, the formula gives Nynorsk the default weight of 1.0. Nynorsk has the same chacateristics as Norwegian bokmål (weight 1.2) for instens the word for the title the county deputy chairperson in Nynorsk is fylkesvaraordføraren. Hogne

Ok, I'll fix that next time it calculated. -MarsRover 16:01, 8 May 2008 (UTC)
Thank you! Hogne 9 May 2008.

Script extension

Note: please do not archive this section, as it is linked from the content page.

MarsRover, I have an idea for what else this script might do (if you get really bored one day): calculate how much a Wikipedia improved since the last update and give the option to sort by that. Although I admit the table with results is pretty crowded already...

Second thing, what list was used for generating the current set? The one from 3rd of April as it says at the top or newer? Thanks, --84.41.32.68 05:22, 9 May 2008 (UTC) (Yerpo @ sl)

I was thinking the same thing. It would be cool to see the growth rate. There are a few things holding me back. The table is already wrapping the text in some cells. Also, wasn't sure whether to use the growth amount, growth percent or rank change (who over took who). Definitely not all three. Also, I want anybody to be able to update the table with the script. So, I didn't want to make it too complicated to run.
The current table used the article list from May 2nd which hasn't change since the Apr 18th version. --MarsRover 07:20, 9 May 2008 (UTC)
I think growth amount would be the most useful value. Rank doesn't change so much in the larger Wikipedias to be of any significant use, and growth percent depends heavily on the absolute value (meaning that the value would always be very small in most of the larger Wikipedias). Maybe those two would be useful (or at least exciting to watch) for the smallest Wikipedias, but if I had to pick one, it would still be growth amount. --193.2.89.2 08:17, 9 May 2008 (UTC) (Yerpo @ sl)
This is really good. You rock! --Yerpo 18:06, 2 June 2008 (UTC)
Yes, an excellent idea! 71.178.149.148 16:41, 14 June 2008 (UTC)

Removing minor figures

When is the list going to be revised so as to get rid of minor figures (like Bedřich Smetana)? 71.178.149.148 16:41, 14 June 2008 (UTC)

First discuss removals here and just remove them from here: list of articles.--MarsRover 18:31, 14 June 2008 (UTC)

Chance to fix missing articles (for July 24)

I going to run the script tomorrow so if you care about keeping your zero absent articles need to fix these:

ca Català

  1. en:Delhi

en English

eo Esperanto

  1. en:Delhi
  2. en:Microprocessor

ru Русский

  1. en:Microprocessor

simple Simple English

  1. en:Rail transport

zh 中文

--MarsRover 07:39, 24 July 2008 (UTC)

Notes on Finnish

I went through all the Finnish missing articles, created a few and found several missing interwikis. I'm not sure if you are interested, but I could elaborate a bit on the difficulty on translating certain terms and concepts from English to other languages – it might help improve this list and similar lists in the future.

Very clear cases that are still missing in Finnish are Arnaut Daniel, Rubén Darío, Ferdowsi, Fuzûlî, New Age music (New Age -musiikki), Horse Racing (Hevosurheilu), Respiration (physiology) (Hengitys (fysiologia)), Mental illness (mielisairaus), Heart disease (sydänsairaudet), Natural disaster (luonnonkatastrofi) and Millenium (vuosituhat), and these should be created. However, there are several difficult cases. Many lemma are overlapping terms that don't really have a Finnish equivalent. For example, "city" is usually translated into Finnish as "kaupunki", which has an interwiki to the English article "town". Finnish only has one word for these two concepts (a possible solution would be to create an article "suurkaupunki", similar to the German de:Großstadt). "Country" is usually translated as "maa" into Finnish, but this term has several meanings including "earth" and "soil": the more official term for country is fi:valtio, which has an interwiki to the English article "state". I fail to see how creating an artificial article corresponding to the English term "country" would improve anything, as these terms are synonyms in Finnish.

Next there's "Military", which doesn't seem to have a good translation into Finnish – I thought of fi:asevoimat, which does have an article in the Finnish Wikipedia, but it has a link to the English article en:Armed forces. Is there really need for two separate articles about the same thing in the English Wikipedia, as they seem to be fairly synonymous? Then we have "pottery". Finnish has an all-encompassing word fi:keramiikka, which includes pottery and all types of ceramic art. The article has very few interwikis, as most languages seem to sepatare these two concepts. Another similar case is "drug", which can be translated as "lääke" (for healing purposes) or "huume/päihde" (for narcotic purposes), but I can't think of a Finnish word including both. "Legume" also seems a bit redundant, as well as "butterfly", as the Finnish wiki already has the articles fi:hernekasvit and fi:perhoset, respectively, and the two English articles about these concepts seem to be fairly synonymous with "legume" and "butterfly".

Then there are also a few cases of mergism, such as "deity" (fi:jumaluus) included in "god" (fi:jumala), and a reverse example with "Russian revolution" split into two articles. "Colon" is an interesting case: the Finnish translation seems to be "lynkkysuoli", but this terms appears to be fairly rare and I haven't even heard it before.

Anyway, these were just musings and a note for myself, nothing to take seriously. But they go to show that this list has some problems when it comes to cultural and linguistic differences. --Orri 14:06, 15 June 2008 (UTC)

Some of the choices for the list of articles have been debated for problems you mentioned. In English there is a fine difference between en:Town and en:City, some are slight supersets of the other like en:Large Intestine and en:Colon (anatony), some wikipedias like to use scientific names and so don't match the common name (en:Butterfly is a subset of the scientific en:Lepidoptera classification which includes en:Moths).
I think some encyclopedias have not grown enough to include all the fine differences (Is there an article regarding both en:Tobacco as a product and Tobacco as a plant?). If there are cases where something is just impossible to create then I would discuss in on the Talk page of the list of articles and have the article removed. --MarsRover 05:59, 16 June 2008 (UTC)


Just FYI, Esperanto has a similar problem with "butterfly". Our word for butterfly includes moths. Frankly I don't think "butterfly" belongs on this list, but that's a discussion for another page. -- Yekrats 18:11, 2 July 2008 (UTC)
You could solve this by linking es:Papilio with en:Lepidoptera and en:Butterfly with the article describing Lepidopterans that fly during the day. Only you have to still write the general article about all Lepidopterans. But I agree that en:Lepidoptera would be more suitable for the list. --Yerpo 06:45, 3 July 2008 (UTC)
Without further elaboration: I confirm, that I found similar issues with ksh, nds, de, and that solving them towards having comparable, and thus interwiki-linkable articles, at times appears a bit artificial to me. Wanting interwiki links, which I almost always prefer over not having them, makes us shift our style of describing the world slightly towards the way, the others do it. This applies to English, too, there is some backscatter.
Remark: some secondary meanings, imho, do not prevent interwiki linking a concept. In Kölsch, e.g. we have tons of abstract concepts, like "love affair", "(in)decency", "(un)reasonability", etc. that can as well be "a person, who …". There is no real need to split e.g. "love affair (trait)" and "love affair (person)". Even though interwiki links of "love affair" may go to articles in foreign languages that indisputably exclude "love affair (person)", such interwiki links should be good. --Purodha Blissenbach 09:53, 29 October 2008 (UTC)

Suggestion: A more granular curve

I'd like to see a more granular approach to the points awarded for this list. A lot of good can come from improving an article in even 5k or 1k increments, and small granular improvements seems to be what Wikipedia is all about, right? Also, some articles just look ridiculous amplified to 30k bytes. They could still get a respectable score without being artificially inflated with strange fluff to build it to over 30k.

In the current system, there are plateaus 1, 10000, and 30000 bytes. What if we were to keep the same general progression, but make it more granular, making the plateaus occur more often, so that a single Wikipedian can make a visible difference with just a little effort?

  • 1k = 1 point (roughly the same as current system, but makes a minimum threshold for an "article")
  • 4k = 2 points
  • 7k = 3 points
  • 10k = 4 points (same as current system)
  • 14k = 5 points
  • 18k = 6 points
  • 22k = 7 points
  • 26k = 8 points
  • 30k = 9 points (same as current system)

Also, I think it would be important to *not* count interwiki links in the article length count, because a decently linked article can add 4000+ bytes of interwiki onto it. Is this doable? Any thoughts? -- Yekrats 18:08, 2 July 2008 (UTC)

The mathematics are doable. But I do have a concern that with your break down it would result in 6 more columns in an already cramped table. And I think you need those columns so people can see how the score is calculated (so its not some mysterious number). Not sure how it would actually be displayed.
The modification would change the whole labeling scheme. An article that is 500 characters couldn't be truly called "Absent". So, we would have to dump the current labels and just use number ranges. These number are not true characters but weighted characters. So to say here are articles that are 4K means really in Chinese is about 1000 characters. And to go to the next level means adding about 1000 more characters. I liked the labels: "Absent", "Stubs", "Articles" and "Long Articles" since they apply for all languages. But, I wouldn't care for labels like "Negligible", "Micro-stub", "Tiny-stub", "Stub", "Small article", etc. to be more granular.
As far as making editors feel like they are making a difference. Any, improvement get reflected in the "Average Article Size" column but its disconnected from the score. I recently added a link so if you click on the Average Size number it will display 10 articles that if either created or enhanced would almost surely improve the score.
The current logic excludes interwiki links from the size calculation (and also comments).
I agree with your point that some articles would look ridiculous as a 30k. For example en:Measurement although sounds important really is just a definition that would result in a single paragraph. I think we should weed these articles out and only have ones that potentially can be 30k or more and not look silly. If topic is that important and essential there should be enough material. --MarsRover 23:50, 2 July 2008 (UTC)

Article names changed

Some of the articles in the list were moved in en.wiki and their names aren't up to date. For example Pyramids of Giza -> Giza pyramid complex, Pieter Brueghel the Elder -> Pieter Bruegel the Elder, and there are many more. Doesn't the bot notice it? --Amir E. Aharoni 20:25, 30 June 2008 (UTC)

The bot can handle one redirect. So, the article list does not have to be exact. I learned the first time I ran it that with 1000 articles there will be at least one renamed article between the time you correct the list and the time you run it again. --MarsRover 22:49, 30 June 2008 (UTC)

why it does not see articles?

List_of_Wikipedias_by_sample_of_articles/Absent_Articles#uk

en:Ramayana, en:St. Peter's Basilica, en:Sugar have uk interwikis, but listed as absent --Ilya K 12:11, 3 July 2008 (UTC)

Also List_of_Wikipedias_by_sample_of_articles/Neglected#uk Says en:Iron(223), but the article is obviously several times longer --Ilya K 12:18, 3 July 2008 (UTC)

The interwiki links for the 3 articles mentioned where added recently. I started the calculation process several days ago (around June 28th) and it just finished July 2nd. So, I think it was just bad timing and it will get fixed next time its calculated. The Iron article size problem might be an bug in the program with excluding comments from the size calculation. I have to look at it some more. --MarsRover 16:32, 3 July 2008 (UTC)
After looking at the history of uk:Залізо, I found some vandalism that has since been fix where the text "#Redirect" was in the article. This confused the script and resulted in the low size. Next time the size will be about 6000 characters. There should be no reason other than a real redirect for this text to be in an article so I didn't change the script. --MarsRover 21:51, 5 July 2008 (UTC)

Queue.Empty

In List of Wikipedias by sample of articles/Absent Articles the string "Queue.Empty" appears in a few places. --Amir E. Aharoni 09:14, 5 July 2008 (UTC)

It means some error occurred reading the article wiki article. There was a bug in the script where all types of errors (timeouts, invalid article name, etc.) were converted to the incorrect Queue.Empty message. I fixed it so next time should be better. --MarsRover 18:56, 5 July 2008 (UTC)
OK. If it's relevant, you may want to know that it appeared in the Hebrew section at Giacomo Puccini. he.wiki had an article about him (he:ג'אקומו פוצ'יני), but is now temporarily deleted for maintenance. (Its main author doesn't like to use "under construction" templates and the community respected his request to edit it in his sandbox.) So, maybe because it's deleted the http response for it is different and it caused Queue.Empty to appear. (Just a wild guess.) --Amir E. Aharoni 10:37, 7 July 2008 (UTC)

Size Calculation Bug?

The page in ml.wiki corresponding to en:India is ml:ഇന്ത്യ. It is 61KB in size. Removing interwiki, comments and even infobox, templates and category lists places the article at 48.5KB as shown here. However, bot lists (here) this article as with size less than 30KB. Similar issue with other pages as well. eg: page corresponding to en:United States is 80KB. I wonder how it has been calculated as of size <30KB. --Jacob 21:33, 9 July 2008 (UTC)

The script measures characters in the article (not bytes). If you select the text of the article and paste it into Microsoft Word you will see it is about 26K characters of text in the statistics window. --MarsRover 23:48, 9 July 2008 (UTC)
Thanks, I got it ! --Jacob 15:33, 10 July 2008 (UTC)

Number of polled Wikipedias

In this list there are 202 Wikipedias. List of Wikipedias has over 250. Why? --Amir E. Aharoni 14:59, 3 August 2008 (UTC)

I limited the list to wikis that have at least 250 total articles. Several are so small they will have a score of zero. But it might be simplier to include them all. --MarsRover 04:48, 17 August 2008 (UTC)

Currency

The list of missing articles incorrecty lists Currency for several Wikipedias, for example Dansk, Eesti and Slovene (didn't check very thoroughly, but some are OK, such as the German). I don't see any messing with interwiki links in the English article for the period the listing was updated, so I assume there's a problem with the script. Could you check? Thanks, --Yerpo 19:02, 30 September 2008 (UTC)

Problem was someone mangled the en:Currency article the day before the script was run. [1]. The bogus "REDIRECT" statement causes the script to not find the interwiki links. The article has since been fixed. --MarsRover 20:27, 30 September 2008 (UTC)
Ah, that was it, thanks. I was confused because the interwikis themselves weren't touched. --Yerpo 06:11, 1 October 2008 (UTC)

"Last item" issue?

In the latest Update of 25-26 October 2008, the last line in the list became:

264  yo  Yorùbá  1.0*  0    870  0  0  0  0.00  -1.37

while a handful of preceeding ones all look like this:

26x  zxx  Name…  1.0*  0  1,009  0  0  0  0.00  +0.00

Imho, missing and present articles should add up to 1009, the total number of articles. This happens to be true for all but the Yorùbá entry. Does not look healthy to me. --Purodha Blissenbach 10:07, 29 October 2008 (UTC)

I corrected yo.wiki in the table. The script had a problem reading the valid articles in yo.wiki because that wiki add a linefeed in the HTML in an unexpected place. I had to change a line of code in the "pywikipedia" library. --MarsRover 05:14, 30 October 2008 (UTC)

Whale problem?

Hi! I removed Catalan from the listing on this page the other minute, as cawp clearly had the article in question - ca:Balena - which does point out both definitions of the word in Catalan. Now I see Swedish is present in the list as well, with the very same "missing" article. But the iw link to svwp exists at en:Whale, and furthermore svwp contains articles about both definitions of English 'whale' (sv:Valar and sv:Bardvalar). BTW, the listing here contains an "Absent: 1" count for arwp as well, very possibly referring to the missing counterpart of... en:Whale! Do we have a problem?--Paracel63 18:03, 13 November 2008 (UTC)

ca:Balena is actually about the Baleen whale. So, iw bots will eventually revert your change. The swedish article sv:Valar is actually about the Cetacea biological order and sv:Bardvalar is also about the Baleen whale. Both of which are different than the en:Whale article which is an unscientific subset of Cetacea.
The problem of using the English word 'Whale' which sometimes doesn't have a good foreign language translation has been discussed here. It been decided to switch the word from en:Whale to en:Cetacea. --MarsRover 20:04, 13 November 2008 (UTC)
Thanks for the input. Yes, I realise there could be a specific article on the concept in Catalan. However, looking at the cawp article discussion on the subject, there doesn't seem to be a clear-cut idea about what it should be called. Similar problem in Swedish, as 'Valar' (plural of 'Val') is representing both English 'Whale' and 'Cetacea'. So I think the English language is the one creating most of the problems here. I thanked Yerpo at your link, for changing 'Whale' for 'Cetacea' at the list. I think this will remove more problems than it will create.--Paracel63 15:27, 15 November 2008 (UTC)

Once more: more differentiation on the small side

I appreciate the above arguments under "A more granular curve". Still I am not happy with the span froom 1-10,000. On some topics 6,000-8,000 bytes already are a meaningful entry (depending how it is written). Lots of "long" articles which essentially carry less information because they are just blown up and chaotic. But in this first span short articles are just lumped together with mini-stubs that just say "xx is a place in yy and has zz inhabitants". Is it not possible to set a line that will keep most robot-created ministubs either out or lower than a short entry? --Kipala 23:44, 27 December 2008 (UTC)

Your example of a "robot-created mini-stub" meets the en.wp definition of a true "stub". So not sure how you tell whether to disregard it or not. I think you will always find good articles classified at stubs and vice versa. But being that the score is generated using a thousand articles the cases should cancel each other out. If you have a new scoring scheme, I can try it but I don't think it will change the score much. --MarsRover 08:40, 28 December 2008 (UTC)