Talk:List of Wikipedias by expanded sample of articles

From Meta, a Wikimedia project coordination wiki

Stats[edit]

I see the mean and median sizes are identical for every WP here. I realize this is statistically possible, but it seems a bit implausible!  :-) I also like the idea of using the alternate language weights. This will be useful. A. Mahoney (talk) 12:25, 21 August 2013 (UTC)[reply]

Oops, that was a mistake. Thanks for spotting it. Boivie (talk) 06:25, 22 August 2013 (UTC)[reply]

Hello, Boivie. How long does your bot make this sheet? :) Zemliakov (talk) 09:38, 6 September 2013 (UTC)[reply]

I don't understand exactly what you're asking for. But it takes a few hours to run the code, and I intend to run it once a month for a while. When I have found time to clean up the messy parts of the code I plan to publish it here somewhere, so it will be easy for someone else to update this page when I no longer do it. Boivie (talk) 16:57, 6 September 2013 (UTC)[reply]

Any ideas why (760*2 + 1453 *3 + 7721*4) / 400 ≠ 92.30 for enwiki? Since I see the same for other wikies there is no complains, but I am curious. --Igel B TyMaHe (talk) 19:18, 17 March 2014 (UTC)[reply]

That score should show percent of the maximum possible points. The formula as it is written is based of it being 10000 articles in the list. So it should really be 100 * (stubs*2 + articles*3 + long.articles*4) / (total.items*4). That means enwiki should get 100 * (760*2 + 1453 *3 + 7721*4) / (9957*4) = 92.30. Boivie (talk) 20:37, 17 March 2014 (UTC)[reply]
16 May 2014. enwiki: 100*(755*2+1457*3+7765*4)/(9957*4) = 92.75 ≠ 92.35. (755*2+1457*3+7765*4)/400 = 92.35. total.items is now 10000? --Igel B TyMaHe (talk) 09:18, 25 May 2014 (UTC)[reply]
Yes, it was 10000 on the 16th of May. I forgot to update the number in the top of the page. Boivie (talk) 19:43, 25 May 2014 (UTC)[reply]

"Shortest"[edit]

I wonder what's the point of the "shortest articles" listing. At this scale, it only displays 200-entry subset of missing articles anyway (except for :enwiki). Perhaps something like the Neglected article list from the List of Wikipedias by sample of articles would be more useful. — Yerpo Eh? 09:56, 31 March 2015 (UTC)[reply]

The point is to answer the question (that no one has asked): "If I want to improve my Wikipedia, where should I start?". So I suppose it's similar to the point of the Neglected page. I see some problems with using the Neglected page here. First, I see the Neglected page like a complement to the Absent Articles page. "What can I do besides creating the absent articles?" And here we don't have a page for (all) absent articles, because it would be to large. Secondly, I don't really like the edge factor. It seems to be more focused on improving scores, than improving Wikipedia. But the popularity factor is carried over to this page in a way. The absent articles are sorted with the most popular first. So you get the 200 most popular articles that are absent in each Wikipedia. Popularity is here counted by number of languages that have the article. Boivie (talk) 12:53, 31 March 2015 (UTC)[reply]
Oh, if they are sorted by popularity, then it makes much more sense, yes. Sorry, I didn't look at it too closely, so I thought they were only selected by name or position within the expanded list of articles. — Yerpo Eh? 14:14, 31 March 2015 (UTC)[reply]

Maithili[edit]

I suggest adding :maiwiki to the list, the pywikimedia framework has been finally updated this month so the wiki doesn't register as missing anymore. Plus, the community seems to be quite active. — Yerpo Eh? 07:10, 16 June 2015 (UTC)[reply]

Please update[edit]

Please update the list every early month.It will be more use ful--AJITH MS (talk) 17:16, 7 September 2015 (UTC)[reply]

I've been trying to update this list the 16th each month. Why would it be more useful if it was updated on another date? Boivie (talk) 05:56, 8 September 2015 (UTC)[reply]

Here internet is very limited so every early month we get the internet.I understood the reality.Sorry for my suggestion and thank for your information--AJITH MS (talk) 10:11, 8 September 2015 (UTC)[reply]

Gothic Wikipedia[edit]

Why the language column for Gothic Wikipedia is ðミフᄇðミフ﾿ðミヘトðミフᄚðミヘツðミフᄚðミフᄊðミフᄈðミフᄚ, and not 𐌲𐌿𐍄𐌹𐍃𐌺 as in List of Wikipedias by sample of articles? Hanif Al Husaini (talk) 13:23, 26 February 2017 (UTC)[reply]

It's a code table issue. I fix the List of Wikipedias by sample of articles by hand every month (but not the sub-pages - see e.g. List of Wikipedias by sample of articles/Stubs). — Yerpo Eh? 15:56, 26 February 2017 (UTC)[reply]

Absent Articles page[edit]

It would be helpful if Absent Articles page (https://meta.wikimedia.org/wiki/List_of_Wikipedias_by_expanded_sample_of_articles/Shortest) can be extended for all language wikis. This could help Editors to easily identify missing articles and start them - currently this page is populated for first 40 wikis only. — The preceding unsigned comment was added by 132.183.13.69 (talk) 13:30, 5. julij 2017 (UTC)

Unfortunately, such a page would be huge, so it is not practically possible. If the community is active and diverse, I encourage someone to figure out how to run the script locally and make a separate list somewhere in the project space. It can be easily modified to show all absent articles for one language. — Yerpo Eh? 16:49, 6 July 2017 (UTC)[reply]
I've done this for Latin -- I set it up as a copy of this list, but with links to the Latin pages if they exist, or a selection of other languages if they don't. See la:Vicipaedia:Paginae_quas_omnibus_Wikipediis_contineri_oportet/Expansio for the list, and see la:Usor:Amahoney/Myrias_epitome for our statistics. I'm happy to share the Perl code if it's useful. A. Mahoney (talk) 16:57, 12 July 2017 (UTC)[reply]

Weights of Chinese wikipedias[edit]

I noticed that the weights of zh.wiki and zh-classical.wiki are both 3.786. I think there should be more in zh-classical.wiki because classical Chinese uses much shorter sentences to express one thing.

Language Example 1 Example 2 Example 3
Chinese 走一千里路,是从迈第一步开始的。(14) 我怎么能够将你比作夏天?
你比夏天更美丽温婉(20)
过氧乙酸可以通过乙醛的自氧化反应制得。(18)
Classical Chinese 千里之行,始于足下。(8) 卿如夏日,载欣载和。
西风列列,众芳独嗟。(16)
过氧乙酸者,乙醛自氧化制之。(12)
English A journey of a thousand li begins with a single step. Shall I compare thee to a summer's day?
Thou art more lovely and more temperate
Peracetic acid is produced industrially by the autoxidation of acetaldehyde.

--Leiem (talk) 15:53, 8 July 2018 (UTC)[reply]

Redirects are not encounted in absent column[edit]

For example if I click on absent Russian wiki articles the first will be an "elephant". This article doesn't exist and redirects to elephantine. Yanpas (talk) 21:51, 20 July 2018 (UTC)[reply]

The script completely relies on Wikidata, so if a redirect is included there, it will be counted as an article. I'm not sure what's current policy about listing redirects in Wikidata items, but it could probably be removed. In a wider context, it's a problem of content organization. Do we describe organisms in line with the common (usually English) use of their name or in line with taxonomy? We haven't really come to a consensus about it yet. — Yerpo Eh? 06:43, 22 July 2018 (UTC)[reply]

Please help updating this[edit]

The list supposes to be updated arround 16 August, but it has still not been updated after a week. Would somebody help updating this? Thank you very much.--Yaukasin (talk) 04:47, 23 August 2018 (UTC)[reply]

It seems like I don't have time to get the script working on my computer, so I won't be able to keep on updating this list monthly anymore. If someone else would like to run the script and update the list, please do! A version of the script is at List of Wikipedias by expanded sample of articles/Source code. Boivie (talk) 09:59, 23 October 2018 (UTC)[reply]
@Boivie: I have tried to run this with pywikibot but it seems that the code is out of date in print and it requires a json module that I don't have. -Theklan (talk) 10:02, 28 October 2018 (UTC)[reply]
Yes, that kind of print statements was okay in Python 2, but not in Python 3 that is mostly used nowadays. I don't think it should be too difficult to install a module if you can control your environment. But I can't guarantee that you won't run into more problems along the way. Boivie (talk) 04:29, 31 October 2018 (UTC)[reply]
I've taken the liberty of updating the script so it doesn't return a ton of 'rvslots' notifications, and include language editons that were started since the last update. Unfortunately, I cannot take responsibility for updating both sample rankings, but I can help occasionally. — Yerpo Eh? 14:33, 2 November 2018 (UTC)[reply]

Does anyone have any idea why the script would return "Q902 has no wikidata item" and then quit (also: "UnboundLocalError: local variable 'pagetext' referenced before assignment")? I stumbled upon that error with the expanded list for Djibouti (Q977), which is why I didn't update it two weeks ago, but now it's happening with the 1000 list too. I would imagine this to happen if there was a redirect linked, but I clicked through all the interwikis and didn't find any such case. — Yerpo Eh? 18:51, 6 November 2018 (UTC)[reply]

@Yerpo: @Boivie: I can't get it running. I get this error message:
Traceback (most recent call last):
  File "####\core\pwb.py", line 263, in <module>
    if not main():
  File "####\core\pwb.py", line 256, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "####\core\pwb.py", line 121, in run_python_file
    main_mod.__dict__)
  File ".\scripts\ListExpandedSample.py", line 15, in <module>
    import simplejson as json
ModuleNotFoundError: No module named 'simplejson'
<class 'ModuleNotFoundError'>
CRITICAL: Closing network session.
I don't know what to do now. -Theklan (talk) 18:58, 8 November 2018 (UTC)[reply]

Fixing[edit]

Let's fixing.--Jacek Janowski nr2 (talk) 09:08, 16 March 2019 (UTC)[reply]

@Jacek Janowski nr2: It is not clear what you are requesting. - dcljr (talk) 21:19, 10 April 2019 (UTC)[reply]

Redirects for section[edit]

How size is calculated, if redirect indicates for a section of article? For example, "Railroad" (d:Q22667) in Russian Wiki redirects to section of "Rail transport" article (d:Q3565868). In this case size will be taken into account of full article or only of section? --Rg102 10:55, 12 January 2020 (UTC)[reply]

@Регион102: This is intended to be a list of representative articles. If a topic does not warrant a stand-alone article in the majority of languages, it should probably be replaced with a more significant one. Therefore, the code does not slice the text in an attempt to extract relevant content, which could be spread across multiple sections (e.g. reference might reside in other sections, etc). --Dcirovic (talk) 00:51, 16 January 2020 (UTC)[reply]
See also the answer above. — Yerpo Eh? 12:31, 16 January 2020 (UTC)[reply]

About "has too much untranslated English"[edit]

I found a potential issue on calculating the scores. When I review m:List of Wikipedias by expanded sample of articles/Shortest#zh 中文, the No. 24 states: "Puppet state 0 Wrong language, zh:傀儡政權 has too much untranslated English." But when I enter into zh:傀儡政權, I didn't find any text in English, except the referrences. I'm thinking that the referrences shouldn't be counted, since we can't avoid to use English in referrences. So I think the reason may be there're a bunch of images with English names. So I think the calculating should be refined, right? -- Ma3r (talk) 09:08, 26 June 2020 (UTC)[reply]

I agree, the main reason for this is the file names of all the images. It would be quite difficult to exclude file names from the English language common words word count. But if someone can do it, I agree that it would be good. Boivie (talk) 15:05, 26 June 2020 (UTC)[reply]

I found it also happens on the Internal Link Assistant (like {{link-en|中文|English}}). I learned it from zh:政治局, which is No.49 (Politburo 0 Wrong language, zh:政治局 has too much untranslated English) in m:List of Wikipedias by expanded sample of articles/Shortest#zh 中文. So I suggest to exclude those texts in various templates and template-like items (such as [[File:filename]] when we say "has too much untranslated English". -- Ma3r (talk) 08:15, 28 June 2020 (UTC)[reply]

That may be easier said than done, though, since the "File" keyword won't always be in English. One would probably need a list of all the ways to link to a file at Commons. A. Mahoney (talk) 15:09, 28 June 2020 (UTC)[reply]
Actually I mean, we can use different counts between calculating scores and deciding "has too much untranslated English". During the latter, maybe we can consider ignore all the texts in {{}} and [[]]. Yeah, I know it's much more difficult to do it. So, this is only a suggestion as FYI and must be immature. Hope we can solve this issue eventually, and thanks for all the efforts. -- Ma3r (talk) 03:51, 30 June 2020 (UTC)[reply]
Can we just count "File"? If so, at least I can replace all "文件" into "File" to avoid such fault. Doing something is better than doing nothing, right? Ma3r (talk) 06:41, 3 April 2023 (UTC)[reply]

About score values 16 Dec.2020[edit]

I manually recalculated values in the table and have some questions. There are score values for russian and chinese wikipedias: 82.46 и 82.47, respectively. There are formules for calculation:

rawscore = stubs*2 + articles*3 + long_articles*4

score = rawscore / (total_items * 0.04)

There are my calculations:

For the russian:

stubs = 2,227

articles = 2,526

long_articles = 5,238

rawscore = 2227*2 + 2526*3 + 5238*4 = 32984

score = 32984 / 40000 * 100 = 82.46

For the chinese:

stubs = 2,136

articles = 1,857

long_articles = 5,778

rawscore = 2136*2 + 1857*3 + 5778*4 = 32955

score = 32955 / 40000 * 100 = 82.38

I see a contradiction between actual table values and results of real recalculation. As a result, the chinese wiki has higher score than russian. How can it be? Maybe I misunderstand the algorithm?

P.S. I recalculated values for the french wiki and they are right.

P.P.S. In previous month there were the same situation with chinese wikipedia. Below you can see calculations:

stubs = 2,151

articles = 1,858

long_articles = 5,759

rawscore = 2151*2 + 1858*3 + 5759*4 = 4302 + 5574 + 23036 = 32912

score = 32912 / 40000 * 100 = 82.28

But the actual value in the table was 82.36. Ковалевич Тимофей (talk) 18:22, 17 December 2020 (UTC)[reply]

I believe there is a bug in the code for pages with "Wrong language, ... has too much untranslated English." See List of Wikipedias by expanded sample of articles/Shortest#zh_中文. I don't remember exactly, but I think those articles are not included in the total number of articles for the calculation. So, if you have 10 "untranslated" articles, the maximum score isn't 4*10000, but 4*9990. That leads to the percentage getting higher than it should be. Boivie (talk) 10:31, 22 December 2020 (UTC)[reply]
It is to be noted that in the article page it is indicated that the 8000 are characters NOT bytes (...weighted size in characters...), and that the counting does not take into account comments within the text, and any interwiki text at the end of the article. These aspects would provide a difference with respect if the size is computed in bytes. Best regards, --Uruk (talk) 18:24, 4 April 2023 (UTC)[reply]

Update[edit]

Greetings everyone,

It's been almost a month since last update. Has anyone ever experienced this long of a wait ?

Just curious if this is something that happens very often.

Боки (talk) 22:06, 10 October 2022 (UTC)[reply]

Yes, it happens every month. Boivie (talk) 14:08, 11 October 2022 (UTC)[reply]

Adding non-existing articles for the calculation of the median and the mean[edit]

Hello! Some Wikipedias with very few articles in the list have a better result on the "mean article size" and "median", because non-existing articles are not counted. I think they should be counted as 0 bytes, so the statistics are more fair for languages with more articles. Imagine that a language makes every article in the list but all of them are around 5.000 bytes. The mean and median would be far away from languages with only 50-60 articles done but longer. What do you think? Theklan (talk) 21:05, 15 October 2022 (UTC)[reply]

Changing color code to Viridis[edit]

Dear @Dcirovic. Two months ago I proposed a change to the List of Wikipedias by sample of articles color code. You can see the discussion and code here. I think that Viridis is better because it gives better information for color blind people, and we would have the two lists with a more uniform information. What do you think about changing this in the code? Thanks. Theklan (talk) 09:40, 7 April 2023 (UTC)[reply]

@Theklan: In my opinion the Viridis colors are ugly. The color blind people could use the first column with ordinal numbers instead of colors. However, if the user community overwhelmingly prefers those unsightly colors, I will implement them. --Dcirovic (talk) 14:19, 7 April 2023 (UTC)[reply]
I think there's no "overwhelming community" in this discussion. The color schema is more logical than the current one, where green and orange don't have the meaning they usually have. Theklan (talk) 14:49, 7 April 2023 (UTC)[reply]