User:Dcljr/Article counts

From Meta, a Wikimedia project coordination wiki

The information on this page is from 2012. The article counts for all Wikipedia, Wiktionary, Wikiquote, Wikisource, Wikinews, Wikiversity, and Wikivoyage languages (but not Wikibooks) were recalculated on 29 March 2015. For more current information, see Article counts revisited.

On May 10, 2012, a bug report requesting that the "updateArticleCount.php" maintenance script be run on all Wiktionaries and Wikisources was acted upon, resulting in 60 of those wikis surpassing or falling below one or more of the article-count milestones tracked at Wikimedia News. Some of the changes were quite large and therefore questionable.

A preliminary investigation revealed only one obvious pattern in the count changes: most Wiktionaries lost articles while most Wikisources gained. The gains can be explained by the fact that most Wikisources now count more namespaces as "content" than they used to; in addition to articles in the main namespace ("ns0"), many Wikisources now count qualifying pages in 1, 2, or 3 additional namespaces (more about this later). The losses were harder to explain.

Neither the gains nor losses seemed to be related to the writing system the wiki was using (e.g., Latin script vs. Brahmic scripts, etc.), whether it was an older wiki or newer one, bigger or smaller, and so forth. Most worryingly, it wasn't at all clear whether the new or old counts were "more correct". I (User:Dcljr) tried to estimate the "true" article counts based on random samples of pages at each wiki (or as close to random as could be reasonably achieved). Sometimes the resulting count was closer to the new one, sometimes closer to the old, and sometimes it was right in the middle between them. This (incomplete) preliminary information is collected at Talk:Wikimedia News#May 10 article count updates.

To collect more in-depth and "reliable" information, I wrote a Perl script to download and parse relevant database dumps needed to count the articles for a given wiki. Initially it seemed that the "updateArticleCount.php" script was consistently undercounting articles, but it turns out I was using the wrong (or, more accurately, an out-of-date) definition of what counts as an article. Once I used the right definition, I started to get the same counts as those given by "updateArticleCount.php". (For more context, see bug 37291.)

But more about all that later. First, a summary of how article counts have been determined in the past, how they are determined now, and how the article counts actually changed when the "updateArticleCount.php" script was run on May 10, 2012.

How article counting used to be done[edit]

When wiki article counting first began, it was based on whether a page contained a comma or not. This worked fine for the English Wikipedia, but once other projects in other languages started up, people realized that this method would not work for all wikis. A very quick (one week!) discussion and vote was held here at Meta in March 2003, the details of which can be found at:

Based on the results of the vote, it was decided that a page would be counted as an article if it was:

a non-redirect in the main namespace (ns0), containing at least one [[wikilink]]

Unfortunately, the implementation of this definition left a little to be desired, and it ended up counting not only 5 different types of legitimate wikilinks (1–5 below), but two types of "false" wikilinks (6 and 7), and one type of non-wikilink (8):

  1. page links: e.g., [[Babel]] or [[Talk:Babel]], etc.
  2. category links: [[Category:Software]]
  3. image/file links: [[File:Yes.png]]
  4. interlanguage links: [[de:Wikipedia:Hauptseite]] or [[:de:Wikipedia:Hauptseite]]
  5. interwiki links: [[species:]]
  6. hidden links: <!-- [[don't look at me]] -->
  7. deactivated links: <nowiki>[[look at me]]</nowiki>
  8. any text containing the string "[[": wikilinks start with "[["...

(Note that links like [[:Category:Software]] and [[:File:Yes.png]], which start with an initial colon, are regular page links of type 1.)

In fact, number 8 describes exactly what was checked for to count a page as an article (assuming it wasn't a redirect and was in the main namespace)!

Eventually, this shortcoming led some wikis to routinely place "hidden" links (of type 6) on their main-namespace pages, just to get them counted as articles.

In June 2006, the $wgContentNamespaces configuration variable was introduced (in revision 14738) to enable namespaces other than the main one (ns0) to count as "content".

At this point, the de facto definition of an article was:

a non-redirect in a content namespace, containing the string "[["

In November 2007, bug 11868 was submitted requesting that links provided by templates be counted, too. In the course of the ensuing discussion, it was pointed out that links other than page links (types 2, 3, etc.) were being counted, and that in fact three different counting methods (all of which started with "non-redirect in a content namespace") were being employed at different places in the code:

  • every time a page was saved, the "[["-string criterion was used to see whether the page would count as an article
  • when the "initStats.php" maintenance script was run, it just checked to see whether the pages were non-empty
  • when the "updateArticleCount.php" maintenance script was run, it checked whether the "page.sql" table actually contained page links originating from each page in question (type 1 only, but also type 1 links provided by templates)

In addition, when pages were imported into a wiki, the article count was not updated correctly (see bugs 2483, 5703, and 6600).

These inconsistencies allowed the on-wiki article counts (e.g., {{NUMBEROFARTICLES}}) to diverge from the "correct" count (however that was defined!) over time.

At some point, the "meat" of the "updateArticleCount.php" script was moved elsewhere.

How article counting is done now[edit]

In May 2011, a developer finally acted to "rationalize" the way articles were counted, and in revision 88113 introduced the $wgArticleCountMethod configuration variable to specify which type of (non-empty) content-namespace non-redirect would count as articles: all such pages ("any"), only those containing a true page link ("link"), or only those containing a comma ("comma"). Article.php and SiteStats.php were modified to reflect this change.

So now, assuming $wgArticleCountMethod is set to "link" for a wiki (which it is for all but the English and Portuguese Wikibooks), a page counts as an article (presumably at all places in the MediaWiki code) if it is:

a non-redirect in a content namespace, containing (after parsing) at least one true [[wikilink]] to another page on the same wiki

Note how different this definition is from the one actually in effect before the change was made! Unfortunately, the extreme nature of the change wasn't apparent to most people until the article counts were recalculated on May 10, 2012.

Because of the "after parsing" part of the new definition, one can no longer tell whether a page will count as an article simply by examining its page source; if the page contains templates, it must be fully parsed first in order for any links created by those templates to be accounted for. Fortunately, this is done when pages are saved, so as long as the "page.sql" database is maintained correctly, the article count should no longer get "out of sync" as it did in the past.

Changes to article counts on May 10, 2012[edit]

Apart from isolated requests here and there (for example, bug 34184), the article counts of the various Wikimedia content wikis have not been updated to reflect all of these changes in how articles have been counted over time. The May 10 running of "updateArticleCount.php" on all the Wiktionaries and Wikisources was the first concerted effort to "fix" the article counts across an entire project. On that day, the changes seen in article counts for these two projects are shown in the tables below. (Note that none of these counts are based on database dumps; see key below for details.)

Key for both tables:

  • wiki name – linked to the Main Page of the wiki
  • articles before / articles after – on-wiki article count at c. 00:30 UTC on 2012-05-10 and c. 00:30 UTC on 2012-05-11, respectively (collected via API request, equivalent to {{NUMBEROFARTICLES}} and the count seen at Special:Statistics on the given wiki)
  • change – after minus before
  • pct change – relative change in article count, as a percentage of the "before" count
  • level before / level after – which milestone level (tracked at Wikimedia News) the wiki would be at based on the article count
  • level change – whether there was a change in milestone level

Note that the tables are initially shown "collapsed" (to expand one, select the "[show]" link) and are sorted by the "level after" column, then "level before", then (unfortunately) alphabetically by language code. To sort by a different column, click on the "up-down" arrows next to the column heading. For help with sorting on a "secondary sort key", see Help:Sorting#Secondary sortkey.

Wiktionary[edit]

Note: 8 Wiktionaries rose up to new milestone levels and 24 fell to lower milestone levels.

Wikisource[edit]

Note: 15 Wikisources rose up to new milestone levels and 13 fell to lower milestone levels.

Changes to article counts in other projects[edit]

Eventually the article counts will need to be updated on the other Wikimedia wikis. (Note: This happened on 29 March 2015.) The tables below show the changes that would have occurred if the "updateArticleCount.php" script were run on each of the other "content wikis" on the day that wiki's database was most recently dumped (as of the time the tables were filled in).

The columns are almost the same as in the previous section, except that the "articles before" and "articles after" counts are both based on database dumps made on the indicated dates (shown in the "date dumped" column). Each "articles before" count is from the appropriate "site_stats.sql" dump, and is equivalent to the on-wiki count given by an API request for "statistics", by the {{NUMBEROFARTICLES}} magic word, and by Special:Statistics. The "articles after" count is based on parsing the "page.sql" and "pagelinks.sql" dumps, using the current definition of what constitutes an article.

Unlike the tables above, initial sorting for these tables is by "articles before" in reverse numerical order (since no secondary sort key was used, any "ties" are listed in an arbitrary order).

Wikipedia[edit]

Note:

  • 3 Wikipedias would rise to new milestone levels and 26 would fall to lower levels.

Wikibooks[edit]

Things to note:

Wikiquote[edit]

Things to note:

  • No Wikiquotes would rise to new milestone levels and 30 would fall to lower levels.
  • The Alemannic Wikiquote exists as a separate namespace within that language's Wikipedia, so it is not included in this analysis.

Wikinews[edit]

Things to note:

  • No Wikinews languages would rise to new milestone levels and 11 would fall to lower levels.
  • The Alemannic Wikinews and Low German/Low Saxon Wikinews exist as separate namespaces within their respective language Wikipedias and so are not included in this analysis.

Wikiversity[edit]

Note: 1 Wikiversity would rise to a new milestone level and 2 would fall to lower levels.

Other possible article counting criteria[edit]

Clearly there are big differences between the old and new definitions of what constitutes an article. While the new definition may be closer to the original intent of the "Article count reform" voters (although even this is not entirely clear), people have gotten used to the old way of doing things and might be disturbed by large changes in article counts. In particular, some might consider it a "bug" in the new method that, say, category links are no longer considered.

For this reason, it might be time to think about what other criteria could be used to count articles.

For convenience, I repeat here the list of different types of links (now including "template links") that have been used — or could possibly be used — to count articles, along with the associated SQL databases that currently track such links (note that "page.sql" contains the page IDs that each of these other databases refer to):

link type examples database
page (on same wiki) [[Babel]], [[Talk:Babel]], [[:Category:Software]], [[:Image:Cat.jpg]], [[:File:Cat.jpg]] pagelinks.sql
category [[Category:Software]] categorylinks.sql
image/file [[Image:Cat.jpg]], [[File:Cat.jpg]] imagelinks.sql
interlanguage [[de:Wikipedia:Hauptseite]], [[:de:Wikipedia:Hauptseite]] langlinks.sql
interwiki [[species:]], [[wookieepedia:]] iwlinks.sql
template {{fact}}, {{fact|date=June 2012}} templatelinks.sql
hidden× <!-- [[don't look at me]] --> (none)
deactivated× <nowiki>[[look at me]]</nowiki> (none)
any text containing "[["× Wikilinks start with two open-brackets (<tt>[[</tt>). (none)
Note ×: Not a real wikilink, so not contained in any "links" database.

Note that a "template link" does not mean a wikilink provided by a template; it simply refers to any {{template call}}, regardless of whether the template provides any wikilinks (or, indeed, any content at all, since the target template may not even exist).

Now for the various definitions of what might constitute an article — all of which should be understood to begin with the phrase "non-redirect in a content namespace, containing at least one…":

  • P = "…page link" (the new definition, used since May 2011)
  • P-C = "…page or category link"
  • P+C = "…page and category link" (i.e., at least one of each — note that this is the only one of these definitions that uses an "intersection" of two criteria)
  • P-C-I = "…page, category, or image/file link"
  • P-C-L = "…page, category, or interlanguage link" (the idea here is that since interlanguage links connect "equivalent" content in different languages, they should be treated similarly to links between articles on the same wiki [i.e., page links])
  • P-C-T = "…page or category link, or any template call" (regardless of whether the template provides any links — the idea behind this definition is that all of these somehow refer to other pages definitely on the same wiki)
  • P-C-I-L-W = "…page, category, image/file, interlanguage, or interwiki link" (this more or less stands in for the "old" way of counting articles, although it's not the same — as explained above, as with all of these dump-based counting methods, it counts wikilinks that are provided via template calls, which the old method couldn't do, and doesn't count hidden or deactivated links, nor any text simply containing the string "[[")
  • P-C-I-L-W-T = "…page, category, image/file, interlanguage, or interwiki link, or any template call"

If anyone wants to suggest Yet Another definition, I can modify my Perl script to use it (as long as it's based on some combination of the database-tracked link types listed above — in particular, I can't use the "comma" counting method, since that would require parsing the actual raw wiki code for each page).

The following table contains alternate article-count statistics based on the definitions just listed above for several wikis from each project that either have shown (Wiktionaries and Wikisources) or would show if updated (the rest) the "most significant" changes in article counts.

To be more specific: the "significance" of a change in article count is defined here as the (absolute) change multiplied by the percent change; this is based on the idea that a large percent change isn't actually significant if it reflects a small actual change in count, and vice-versa. The table includes 5 or 10 wikis from each project (5 each from the smaller Wikinews and Wikiversity projects, 10 each from the rest) that showed the most significant changes in the tables above. (Note, therefore, that these are not the 60 most significant changes overall, because of the "stratified" manner in which they were selected.)

The rows of the table are initially sorted by project (chronologically, by date the project debuted — same order as the other tables above on this page), then alphabetically by language code.

As for the meanings of the columns…

Table key:

  • project – two-letter abbreviation for each project, for sorting purposes (WP = Wikipedia, WT = Wiktionary, etc. — the rest should be sufficiently obvious)
  • wiki name – linked to the Main Page of each wiki
  • dump date – matches the dump date used in the appropriate table above; if the wiki is a Wiktionary or Wikisource, the "before" date is used (i.e., before the on-wiki article counts were fixed)
  • stats articles – on-wiki article counts according to {{NUMBEROFARTICLES}}, etc. (see the discussion of article counts for Wikipedia, etc., above for more explanation)
  • content-ns non-redirs – number of non-redirects in all content namespaces (in other words, pool of "possible articles" based on common "baseline" criterion for all article counting methods)
  • P article count – number of articles based on "P" definition above (this is the current official definition of an article)
  • P-C article count, etc. – number of articles based on alternative definitions listed above
  • (%) – all percents are out of "content-ns non-redirs" and apply to the column immediately to their left (for example, the first "(%)" column is the percent of content-namespace non-redirects that qualified as articles under the "pagelink" criterion)

Note: 58 more wikis to be added to table…

Note: Progress in this section has come to a halt because I lost the working version of my Perl script that counted the articles in these various other ways. Although it's possible I could reconstruct this script, it doesn't look likely in the forseeable future. Sorry...