2017 Community Wishlist Survey/Miscellaneous/Word count on statistics

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

⬅ Back to Miscellaneous The survey has concluded. Here are the results!


  • Problem: We don't have an actual word count since 2014, and this is a basic statistic to calculate Wikipedia's size
  • Who would benefit: Statistic-lovers and everyone who want to show the size of Wikipedia
  • Proposed solution: Having a word count from the dump would be the solution
  • More comments:
  • Phabricator tickets:
  • Proposer: Theklan (talk) 22:40, 13 November 2017 (UTC)

Discussion

This is relatively straight forward, we already have the per-article word counts broken out (they are in search results), there just isn't a public way to ask for a sum. FWIW a sum on en.wikipedia.org content index currently reports: 3.049711774E9 EBernhardson (WMF) (talk) 03:17, 18 November 2017 (UTC)

Where did you find this number? -Theklan (talk) 17:57, 20 November 2017 (UTC)
I wrote a custom query against the elasticsearch cluster to aggregate the stored word count (as I'm a developer working on search at WMF). I've put up a patch in code review to integrate this into Special:Statistics. I would expect this to be merged and roll out sometime in December. This is only the raw word count of pages considered articles, not any of the more advanced things discussed below. EBernhardson (WMF) (talk) 19:09, 28 November 2017 (UTC)
@Theklan: This has now rolled out to all wiki's, you can get the counts from the Special:Statistics page. 2601:648:8402:C015:307E:5334:1490:C6B9 19:09, 15 December 2017 (UTC)
@EBernhardson (WMF): Are you sure that this number is correct? The number was considerably higher in 2014 according to Wikistats. -Theklan (talk) 00:57, 16 December 2017 (UTC)
@Theklan: Wikistats may have been calculating something different, would have to dig into what they counted. This particular count takes the content (main namespace), removes some non-content portions (tables, hatnote's, etc) and then counts the number of individual words (as determined by tokenization with lucene, the same used for full text search). If we were to include non-content pages the value would increase from 3.1 billion to 11.3 billion. EBernhardson (WMF) (talk) 17:58, 17 January 2018 (UTC)

I would suggest taking this further with basic readability statistics. there are various well-established metrics, but even simple things like average words-per-sentence and syllables-per-word would be helpful. T.Shafee(Evo﹠Evo)talk 11:02, 18 November 2017 (UTC)

Readability metrics are misleading and bullshit. Source: I built one. --Dispenser (talk) 18:03, 20 November 2017 (UTC)

Note: This idea was also suggested at wikitech-l a few days ago, and a reply pointed out a userscript that does a very simple version. Quiddity (WMF) (talk) 19:52, 20 November 2017 (UTC)

User:Dr pda made a byte and word counter years back and lists issues with counting "article text". The reason why people like word count is "100 words = 1 minute of reading" (without regard to textual difficulty). Naturally excludes infoboxes, tables, images, navboxes, etc. --Dispenser (talk) 21:10, 20 November 2017 (UTC)
Yes, but I don't want a script that measures the word count of a given article, but the global number of words in the whole Wikipedia project. There's a difference there! -Theklan (talk) 12:03, 21 November 2017 (UTC)

Voting