User:Isaac (WMF)/Analysis gotchas

From Meta, a Wikimedia project coordination wiki

This is a (likely incomplete) compendium of all the ways in which we find the correct API or HDFS table and build the query and analysis and are feeling great about our accomplishment just to learn that we weren't aware of some oddity of the Mediawiki landscape and are missing a bunch of data (or are swimming in false positives). ‘Tis a shame and I'm hoping that this list slowly dwindles down to nothing. There is no particular order to what is below.

Redirects[edit]

Any given article in Wikipedia might have many alternative titles that will redirect the reader to the canonical article name. How these redirects are accounted for in pageview counts varies though and can have a large impact on pageview analyses. See this short research paper[1] for more details on why accounting for redirects is important. Here's jow to handle it for various data sources:

  • wmf.webrequest (and any HDFS derived tables)
    • Page ID is post-redirect -- i.e. you can count who viewed a given article by grouping by page ID regardless of what redirects / page moves happened
    • Page title reflects redirects -- i.e. you can see what links people requested by using page title (if you want to study redirect-specific trends) but generally page ID is the much better choice for joins, aggregation, etc.
  • Pageviews API
    • Articles are grouped only by title so redirects have to be compiled and separately queried
    • Subscribe to T159046 for updates on when this change might be made
  • Pageviews Tool
    • The redirects=1 parameter will automatically gather pageviews for the article and all associated redirects -- e.g., [1]
      • This still breaks on the langviews tool and presumably others for querying pageviews from old titles but works pretty well -- the pageID fix to the underlying API would presumably lead to support here as well.

Page Moves[edit]

Page moves are a special case of redirects where the canonical title is changed for a given article -- e.g., the "2019-20 Coronavirus Outbreak" becomes the "2019-20 Coronavirus Pandemic". This should have no impact on pageview analyses if you're already handling redirects as described above, but it can affect many other data sources / analyses too. How to handle for various data sources:

  • The "easy" fix to handling page moves is to use page IDs instead of titles as the primary key for any article-level analyses. Page IDs do not change in page moves and therefore are stable identifiers for articles -- excepting page deletions and page merges, which are relatively infrequent or at least not generally associated with "high-importance" articles.
    • For instance, with the MediaWiki API, either use pageids parameter (which most endpoints support) or titles parameter as well as redirects parameter in case the title you are using is out-of-date.
  • wmf.wikidata_item_page_link HDFS table
    • This table maintains a mapping of page title, page ID, and wikidb to wikidata item ID. Joining in Wikidata IDs supports language agnostic modeling / research and is an increasingly important component of our research. Using page ID when joining against this table is best practice, but there are some caveats to also be aware of that relate to how the table is built:
      • The creation of this table relies on a join between titles from two separate datasets, which is fragile in that the data source for sitelinks must have the same vintage as the data source of page titles + page IDs, which is not currently true. This causes issues for snapshots that are later in the month as one data source is from the start of the month and the other is from the end of the month. See task T249773 for up-to-date information on the status of fixing this.
  • Gathering the page move history of an article is not trivial, but the following query on the wmf.mediawiki_page_history table works pretty well where the page_title is the current title for that snapshot and the page_title_historical is the title at the time of the move:
 SELECT page_id, page_title, caused_by_event_type, start_timestamp, page_title_historical
   FROM wmf.mediawiki_page_history
  WHERE snapshot = '2020-03' AND page_id = <page-id> AND wiki_db = <wiki-db> AND page_namespace = <page-namespace> AND page_is_redirect != true
  ORDER BY start_timestamp
  LIMIT 10000;

Unidentified Bots[edit]

When analyzing readership, we generally are only interested in what people are reading and not what pages are being scraped by programs. Out of all the server requests to view a given article, we must filter out requests associated with these bots. Wikimedia's bot detection is fairly straightforward -- flag all bots that appropriately identify themselves and identify some additional likely-automated traffic via a few heuristics. This presumably catches most bot traffic but some still slips through. At an aggregate level, this is not of high concern, but unfortunately this bot traffic often concentrates on single pages and so can lead to very odd spikes in traffic for certain articles. This can cause seemingly random articles to appear in top-k lists, affect recommender systems that use pageviews as a feature, and skew reader behavior metrics. For more details, see these discussions of bots:

There are a few simple-ish but relatively effective approaches that can further help filter this traffic:

  • Remove pages that get <5% or >95% of pageviews from mobile.
    • Bots tend to use either desktop or mobile but not both (like real humans).
  • Use Wikimedia app views only.
    • Bots don't use the app so the app views are almost guaranteed to be human

Page protections[edit]

Articles can be put under a variety of editing restrictions with the effect of substantially reducing vandalism, and, often as a collateral damage, good-faith editing on those pages. When analyzing edit behavior at the article level, it is therefore important to account for which articles are under the various forms of protection. Hill and Shaw[2] demonstrate the importance of accounting for page protections in research. Page protection statistics can be gathered from a few places:

  • Mediawiki API (Info)
  • page_restrictions Mediawiki table
    • This table is also provided as a .sql.gz file with the monthly dumps
  • In HDFS, there is the wmf_raw.mediawiki_page_restrictions table, which contains monthly snapshots of the Mediawiki table from MariaDB. There is also a protections event log at event.mediawiki_page_restrictions_change to see when protections were applied but it's hard to get all the information (e.g., expiration date) out of that table.

Article Content (text or links)[edit]

When you're doing research / analyses that depend on the text or links in a page, there are two sources that you can go to that can vary somewhat substantially in the data they provide: wikitext dumps and the fully parsed page. The difference arises from the usage of templates and modules to dynamically add content to Wikipedia articles. A very good overview of this difference can be found in Mitrevski et al.[3] where they compare wikitext to article HTML for the entire history of English Wikipedia. The wikitext is on average missing half of the links in the actual article (anecdotally, these are generally metadata templates), though those links are clicked on substantially less by readers so their importance is case-dependent. The pagelinks table holds the links from the fully parsed page (with the caveat that if links from a transcluded template are changed, these changes will not be reflected in the pagelinks table until an edit is made to the article).

  • APIs
    • Both the raw wikitext and parsed HTML can be extracted via the Parse API.
    • Just the links in parsed HTML can be gathered via the pagelinks API.
  • Dumps
    • Wikitext: the dumps store the wikitext history for all (non-deleted) pages. This is by far the easiest way to access Wikipedia content in bulk, though it is missing many links as noted above.
      • A hack to get partway between wikitext and fully parsed is to gather the wikitext from static templates (e.g., navigation templates that sit at the end of articles) and include that when the template is transcluded
    • Parsed: Fully English Wikipedia history through 1 March 2019 available on zenodo.
    • Pagelinks: the pagelinks table is available as a sql dump.

Sub-articles[edit]

Many concepts, depending on their importance and information that is available, are broken into multiple articles. See Lin et al. [4] for a detailed study of this issue. This has two major impacts:

  • If you are interested in the level of attention to specific topics e.g., Barack Obama, then you shouldn't just measure pageviews to the Barack Obama article but also you should include pageviews to the Early life and career of Barack Obama, Family of Barack Obama, etc.
  • If you are looking for content that exists in one language but not another -- e.g., for the purposes of recommending that that content be translated or more basic research -- you need to understand that sitelinks do not always capture that there might be a full subarticle in one language but only a section in another.
    • This can maybe be addressed by determining what templates are used for identifying subarticles and assuming that if a given article is a subarticle in any language, the content is too specific to recommend for creation and likely already described in a section of a more general article (but not properly sitelinked).
    • See task T207406#5890503 for more details.

Page merges[edit]

I'm pretty sure that page merges are also relatively rare and I suspect usually one of the pages in the merge is much more developed than the other, so "throwing out" the history / pageviews associated with the article that is being merged into the other is unlikely to affect analyses. The source page in the merge is then turned into a redirect so the above guidelines apply.

Page / Revision deletions[edit]

Page and revision deletions are relatively rare so I suspect they don't affect too many analyses. They may, however, be pertinent to research around harassment, disinformation, or other sensitive topics. The only way to get deleted page content is via the API and with special user rights (e.g., Researcher). More details / statistics:

Querying web request data on Special pages[edit]

The Special namespace is a virtual namespace. This means that pages in that namespace do not have page IDs, they do not exist in MediaWiki's page database table. Secondly, their names are localized: it's "Special:CreateAccount" in English, "Especial:Crear una cuenta" in Spanish, "Spécial:Créer un compte" in French, and so on. Lastly, these pages can be accessed through two different URL paths. In English both /wiki/Special:CreateAccount and /w/index.php?title=Special:CreateAccount lead to the account creation page.

Querying pages in the Special namespace in the webrequest dataset is facilitated by the x_analytics_map field having the key special set to the canonical English name of the special page. For example, a request to Especial:Crear una cuenta on Spanish Wikipedia will have x_analytics_map["special"] set to "CreateAccount". Below is an example Presto query demonstrating this:

 SELECT
     date(from_iso8601_timestamp(dt)) AS log_date,
     access_method,
     agent_type,
     count(1) AS num_page_views
 FROM wmf.webrequest
 WHERE webrequest_source = 'text' -- as opposed to 'upload'
 AND year = 2022
 AND month = 5
 AND day = 3
 AND normalized_host.project_family = 'wikipedia'
 AND normalized_host.project = 'es'
 AND element_at(x_analytics_map, 'special') = 'CreateAccount'
 AND http_status IN ('200', '304') -- 'success' used in Pageview Definition
 AND content_type LIKE '%text/html%' -- also defines a "pageview"
 GROUP BY date(from_iso8601_timestamp(dt)), access_method, agent_type

Stubs and Spikes[edit]

It can be very difficult to choose a measure of central tendency for analyses of content on the wikis. The content of some wikis is dominated by stub articles creating a long-tail that will greatly skew metrics about pageviews per article, average number of sections, etc. towards zero. When these stub articles are predominantly created by bots and receive relatively few pageviews, it is hard to argue that they are representative of the wiki. Conversely, some wikis have articles that receive a massive amount of attention or edits or are extremely long, biasing metrics to much higher values than are typical for that wiki.

Likely, the most appropriate metrics are going to be some sort of truncated mean. The spikes are easier -- even just removing the topmost 1% of data should greatly diminish power-law dynamics while retaining most of the data. The stubs are much harder and the right threshold for removing them is likely wiki-dependent. For example, wikis with large numbers of bot-generated articles might need the bottommost 90% of content to be thrown out to get to a more representative measure while others shouldn't throw out any content as it's all human-generated. This challenge suggests that rather than identifying a percentage of content to remove, it would be better to split articles on wikis into three categories and report metrics individually for each (categories should be calculated exclusively and spikes calculated first):

  • Spikes: content created by either bots or humans in the top 1% for a metric -- e.g., edits or pageviews. This is clearly important content but it is not expected that most articles will ever become this popular.
  • Bot-generated: articles that were generated by bots (many of which are likely stubs though some may have grown organically)
  • Standard human-generated: the bread-and-butter of the wikis. An article created by a user or anonymous editor that receives some normal amount of attention.

Below is related data on what proportion of each wiki are articles with only 0 or 1 sections (a proxy for stubs) and what percentage of articles don't receive pageviews on any given day (a complementary proxy for stubs). Eventually, it would be good to have data on bot-generated articles but that is a more complicated analysis.

SQL query for generating table below
WITH wikipedia_projects AS (
    SELECT DISTINCT
      dbname,
      SUBSTR(hostname, 0, LENGTH(hostname) - 4) AS project
     FROM wmf_raw.mediawiki_project_namespace_map
     WHERE
       snapshot = '2021-02'
       AND hostname LIKE '%wikipedia%'
),
stub_counts AS (
    SELECT
      wiki_db,
      COUNT(1) AS num_pages,
      SUM(IF(num_headings <= 1, 1, 0)) AS num_stubs
    FROM isaacj.qual_features
    GROUP BY
      wiki_db
),
pages_with_pageviews (
    SELECT COUNT(DISTINCT(pv.page_id)) AS num_pages_with_pvs,
           wp.dbname AS wiki_db
      FROM wmf.pageview_hourly pv
     INNER JOIN wikipedia_projects wp
           ON (pv.project = wp.project)
     WHERE year = 2021
           AND month = 2
           AND day = 15
           AND namespace_id = 0
           AND agent_type = 'user'
     GROUP BY wp.dbname
)
SELECT sc.wiki_db,
       sc.num_pages,
       sc.num_stubs,
       pv.num_pages_with_pvs
  FROM stub_counts sc
  LEFT JOIN pages_with_pageviews pv
       ON (sc.wiki_db = pv.wiki_db)
ORDER BY num_pages DESC
wikidb # articles # stubs (<2 sections) % stubs # articles with daily pageviews % daily unseen
enwiki 6260556 1501117 24.0% 4427190 29.3%
cebwiki 5546111 3282539 59.2% 319163 94.2%
svwiki 3398512 1958366 57.6% 403634 88.1%
dewiki 2543038 506616 19.9% 1458139 42.7%
frwiki 2304459 333644 14.5% 1217089 47.2%
nlwiki 2046933 1433262 70.0% 537226 73.8%
ruwiki 1703247 263286 15.5% 1014094 40.5%
itwiki 1677064 244919 14.6% 862455 48.6%
eswiki 1609903 202110 12.6% 937099 41.8%
plwiki 1460488 554146 37.9% 702992 51.9%
warwiki 1265000 1041815 82.4% 169816 86.6%
viwiki 1261985 142433 11.3% 186334 85.2%
jawiki 1256122 133662 10.6% 950616 24.3%
arzwiki 1206509 12804 1.1% 20446 98.3%
zhwiki 1180437 413934 35.1% 476579 59.6%
arwiki 1104446 291506 26.4% 324650 70.6%
ukwiki 1077033 165429 15.4% 282264 73.8%
ptwiki 1058222 462339 43.7% 535078 49.4%
fawiki 771124 179654 23.3% 258039 66.5%
cawiki 672521 206156 30.7% 256662 61.8%
srwiki 643311 68081 10.6% 119783 81.4%
idwiki 563587 270466 48.0% 252159 55.3%
nowiki 551441 157286 28.5% 260682 52.7%
kowiki 534857 179061 33.5% 256617 52.0%
fiwiki 504132 158314 31.4% 290441 42.4%
huwiki 484462 50942 10.5% 246044 49.2%
cswiki 475348 88225 18.6% 277824 41.6%
shwiki 454726 68874 15.1% 105740 76.7%
zh_min_nanwiki 430768 144640 33.6% 33152 92.3%
rowiki 417277 139657 33.5% 185909 55.4%
trwiki 393166 131572 33.5% 241732 38.5%
euwiki 368295 72591 19.7% 125974 65.8%
cewiki 353901 9302 2.6% 2622 99.3%
mswiki 347140 137007 39.5% 133066 61.7%
eowiki 293038 129918 44.3% 123296 57.9%
hewiki 289617 40871 14.1% 169349 41.5%
hywiki 281603 59117 21.0% 42161 85.0%
bgwiki 269533 72385 26.9% 104225 61.3%
ttwiki 265785 17596 6.6% 4642 98.3%
dawiki 265146 107347 40.5% 164181 38.1%
azbwiki 240077 134198 55.9% 3836 98.4%
skwiki 236047 84114 35.6% 100897 57.3%
kkwiki 232242 93437 40.2% 39270 83.1%
minwiki 224563 164103 73.1% 4189 98.1%
etwiki 216990 97603 45.0% 124040 42.8%
hrwiki 210879 58018 27.5% 122595 41.9%
bewiki 201783 105828 52.4% 17996 91.1%
ltwiki 199105 74960 37.6% 92822 53.4%
elwiki 189017 45425 24.0% 101323 46.4%
simplewiki 183418 104358 56.9% 109879 40.1%
azwiki 178772 52397 29.3% 76909 57.0%
glwiki 171663 27154 15.8% 79110 53.9%
slwiki 171521 47722 27.8% 90149 47.4%
urwiki 163782 28825 17.6% 18564 88.7%
nnwiki 157495 77725 49.4% 87517 44.4%
hiwiki 149556 61822 41.3% 59043 60.5%
kawiki 149466 62127 41.6% 43914 70.6%
thwiki 142601 41208 28.9% 85489 40.1%
uzwiki 139916 56106 40.1% 36729 73.7%
tawiki 139574 39806 28.5% 37328 73.3%
lawiki 134936 67404 50.0% 77598 42.5%
cywiki 132611 30375 22.9% 48164 63.7%
vowiki 126354 93610 74.1% 6974 94.5%
mkwiki 113258 32523 28.7% 24308 78.5%
astwiki 108296 15181 14.0% 46400 57.2%
zh_yuewiki 107931 60798 56.3% 26888 75.1%
lvwiki 106263 44890 42.2% 52135 50.9%
bnwiki 104447 13559 13.0% 43933 57.9%
mywiki 102874 85981 83.6% 7827 92.4%
tgwiki 102867 11598 11.3% 5135 95.0%
afwiki 96861 28018 28.9% 49721 48.7%
mgwiki 93801 17708 18.9% 30390 67.6%
sqwiki 91160 38834 42.6% 40729 55.3%
ocwiki 86657 33630 38.8% 47572 45.1%
bswiki 85100 13695 16.1% 55816 34.4%
ndswiki 82479 60219 73.0% 19069 76.9%
kywiki 80802 49606 61.4% 14627 81.9%
be_x_oldwiki 73484 22216 30.2% 9149 87.5%
mlwiki 73178 20322 27.8% 33226 54.6%
newwiki 73046 3826 5.2% 1665 97.7%
tewiki 70773 10542 14.9% 17096 75.8%
mrwiki 70768 37390 52.8% 17686 75.0%
brwiki 69389 32649 47.1% 41683 39.9%
vecwiki 67315 4369 6.5% 32251 52.1%
pmswiki 65780 47440 72.1% 17459 73.5%
jvwiki 62818 30576 48.7% 19213 69.4%
htwiki 62483 5800 9.3% 7349 88.2%
pnbwiki 61321 33155 54.1% 3590 94.1%
swwiki 60857 26677 43.8% 43955 27.8%
suwiki 60788 35255 58.0% 10561 82.6%
lbwiki 59368 29882 50.3% 35752 39.8%
tlwiki 58487 34939 59.7% 29815 49.0%
bawiki 55679 2588 4.6% 3240 94.2%
gawiki 54763 38718 70.7% 20972 61.7%
szlwiki 53097 52309 98.5% 2574 95.2%
iswiki 52077 29732 57.1% 24769 52.4%
cvwiki 45779 4986 10.9% 4364 90.5%
lmowiki 45566 21120 46.4% 26104 42.7%
fywiki 45319 21386 47.2% 27829 38.6%
scowiki 42582 24951 58.6% 26928 36.8%
wuuwiki 41464 35595 85.8% 4388 89.4%
diqwiki 39948 17912 44.8% 18924 52.6%
anwiki 39551 17099 43.2% 23671 40.2%
kuwiki 38575 23626 61.2% 8938 76.8%
pawiki 37310 16190 43.4% 5685 84.8%
yowiki 33614 29061 86.5% 11883 64.6%
newiki 32167 10675 33.2% 4549 85.9%
barwiki 31631 11080 35.0% 13479 57.4%
iowiki 30521 19069 62.5% 17278 43.4%
guwiki 29677 21720 73.2% 5415 81.8%
ckbwiki 29093 11498 39.5% 5752 80.2%
alswiki 27698 4584 16.5% 20105 27.4%
knwiki 27608 10384 37.6% 10200 63.1%
nostalgiawiki 27375 26722 97.6% -- --
scnwiki 26421 19952 75.5% 17500 33.8%
bpywiki 25249 1102 4.4% 1519 94.0%
iawiki 23127 19862 85.9% 12681 45.2%
quwiki 23031 5977 26.0% 14600 36.6%
mnwiki 22141 9844 44.5% 7952 64.1%
siwiki 20628 12400 60.1% 7297 64.6%
bat_smgwiki 16997 15024 88.4% 2163 87.3%
nvwiki 16651 16523 99.2% 408 97.5%
sdwiki 15765 8937 56.7% 1437 90.9%
xmfwiki 15685 9463 60.3% 5143 67.2%
orwiki 15637 1938 12.4% 1787 88.6%
cdowiki 15513 9887 63.7% 1915 87.7%
amwiki 15398 13087 85.0% 3738 75.7%
ilowiki 15390 6596 42.9% 11993 22.1%
gdwiki 15332 6817 44.5% 9965 35.0%
yiwiki 15223 7727 50.8% 3009 80.2%
napwiki 14736 10049 68.2% 11939 19.0%
sahwiki 14565 7779 53.4% 2566 82.4%
maiwiki 14485 2365 16.3% 1095 92.4%
bugwiki 14191 14134 99.6% 586 95.9%
wawiki 13891 7597 54.7% 5393 61.2%
map_bmswiki 13781 11396 82.7% 3785 72.5%
hsbwiki 13765 3784 27.5% 4516 67.2%
pswiki 13671 7809 57.1% 1647 88.0%
mznwiki 13562 3174 23.4% 1276 90.6%
fowiki 13559 5445 40.2% 7966 41.2%
liwiki 13209 6776 51.3% 10107 23.5%
oswiki 12942 8190 63.3% 1774 86.3%
frrwiki 12675 5644 44.5% 7680 39.4%
emlwiki 12656 9406 74.3% 4950 60.9%
avkwiki 12420 1819 14.6% 1917 84.6%
acewiki 12348 11932 96.6% 1629 86.8%
gorwiki 11864 11106 93.6% 1030 91.3%
bowiki 11726 8284 70.6% 1320 88.7%
sawiki 11643 4929 42.3% 1536 86.8%
bclwiki 11011 4751 43.1% 5617 49.0%
zh_classicalwiki 10666 6548 61.4% 2477 76.8%
mrjwiki 10527 8466 80.4% 650 93.8%
mhrwiki 10321 3320 32.2% 3032 70.6%
hifwiki 10125 7494 74.0% 3552 64.9%
kmwiki 10107 5302 52.5% 4362 56.8%
hakwiki 9525 4257 44.7% 1672 82.4%
roa_tarawiki 9314 8242 88.5% 7838 15.8%
testwiki 9227 6000 65.0% 56 99.4%
pamwiki 8985 3852 42.9% 3440 61.7%
crhwiki 8895 6505 73.1% 1929 78.3%
hywwiki 8853 1615 18.2% 1029 88.4%
shnwiki 8798 1937 22.0% 567 93.6%
nsowiki 8356 5866 70.2% 4287 48.7%
aswiki 8164 1192 14.6% 3129 61.7%
ruewiki 8073 2815 34.9% 3376 58.2%
sewiki 7954 4833 60.8% 3457 56.5%
zuwiki 7659 7193 93.9% 2980 61.1%
hawiki 7616 4994 65.6% 3192 58.1%
lijwiki 7608 1632 21.5% 5084 33.2%
ugwiki 7606 4748 62.4% 1317 82.7%
bhwiki 7437 3655 49.1% 2405 67.7%
vlswiki 7384 3452 46.7% 5954 19.4%
tkwiki 7308 4145 56.7% 4258 41.7%
miwiki 7205 3773 52.4% 1917 73.4%
nds_nlwiki 7203 3092 42.9% 5672 21.3%
nahwiki 7170 3567 49.7% 4165 41.9%
sowiki 7137 4847 67.9% 4617 35.3%
scwiki 7085 5444 76.8% 3469 51.0%
snwiki 7074 5341 75.5% 3624 48.8%
vepwiki 6658 2261 34.0% 2701 59.4%
ganwiki 6505 3115 47.9% 1333 79.5%
banwiki 6475 1575 24.3% 2631 59.4%
glkwiki 6455 5107 79.1% 644 90.0%
myvwiki 6408 2185 34.1% 843 86.8%
abwiki 6237 1380 22.1% 2586 58.5%
kabwiki 6115 4101 67.1% 2907 52.5%
cowiki 5973 2341 39.2% 4271 28.5%
satwiki 5862 744 12.7% 472 91.9%
fiu_vrowiki 5786 3991 69.0% 2298 60.3%
iewiki 5548 4083 73.6% 2558 53.9%
kvwiki 5522 2677 48.5% 709 87.2%
csbwiki 5404 3600 66.6% 1511 72.0%
pcdwiki 5172 2010 38.9% 1439 72.2%
aywiki 5139 1066 20.7% 1003 80.5%
udmwiki 5050 3885 76.9% 740 85.3%
gvwiki 5043 3258 64.6% 3594 28.7%
pagwiki 4946 2890 58.4% 1183 76.1%
zeawiki 4774 1967 41.2% 2306 51.7%
lfnwiki 4677 3798 81.2% 3095 33.8%
frpwiki 4613 2352 51.0% 1750 62.1%
lowiki 4584 3014 65.8% 956 79.1%
nrmwiki 4581 2221 48.5% 2721 40.6%
kwwiki 4539 4000 88.1% 2202 51.5%
dvwiki 4314 2835 65.7% 946 78.1%
lezwiki 4198 1122 26.7% 723 82.8%
gomwiki 4195 1582 37.7% 618 85.3%
gnwiki 4134 2396 58.0% 1981 52.1%
mwlwiki 4111 2038 49.6% 1093 73.4%
stqwiki 4107 2694 65.6% 2846 30.7%
olowiki 3903 2244 57.5% 1812 53.6%
szywiki 3858 1250 32.4% 488 87.4%
mtwiki 3772 1094 29.0% 2257 40.2%
rmwiki 3762 2249 59.8% 2542 32.4%
awawiki 3710 3314 89.3% 263 92.9%
dtywiki 3604 1474 40.9% 730 79.7%
ladwiki 3586 1580 44.1% 1968 45.1%
bjnwiki 3584 1925 53.7% 1581 55.9%
arywiki 3571 1088 30.5% 908 74.6%
furwiki 3556 2029 57.1% 2326 34.6%
koiwiki 3505 1953 55.7% 367 89.5%
extwiki 3420 2315 67.7% 1469 57.0%
angwiki 3374 2857 84.7% 1348 60.0%
dsbwiki 3311 1654 50.0% 1489 55.0%
lnwiki 3304 2081 63.0% 1059 67.9%
cbk_zamwiki 3243 1037 32.0% 1796 44.6%
piwiki 3216 1034 32.2% 782 75.7%
tyvwiki 3180 1186 37.3% 381 88.0%
kshwiki 2905 1571 54.1% 1885 35.1%
gagwiki 2888 1507 52.2% 1168 59.6%
pflwiki 2716 244 9.0% 920 66.1%
avwiki 2587 1141 44.1% 677 73.8%
hawwiki 2429 2038 83.9% 551 77.3%
lgwiki 2425 2215 91.3% 418 82.8%
gcrwiki 2378 382 16.1% 802 66.3%
xalwiki 2321 1334 57.5% 802 65.4%
rwwiki 2219 1828 82.4% 863 61.1%
igwiki 2214 923 41.7% 1000 54.8%
bxrwiki 2198 1231 56.0% 732 66.7%
papwiki 2193 1752 79.9% 1270 42.1%
zawiki 2116 1679 79.3% 1017 51.9%
pdcwiki 2103 1930 91.8% 866 58.8%
krcwiki 2074 717 34.6% 508 75.5%
test2wiki 2041 1316 64.5% 32 98.4%
kaawiki 2040 1679 82.3% 890 56.4%
kbpwiki 1916 1763 92.0% 474 75.3%
arcwiki 1811 1706 94.2% 678 62.6%
novwiki 1801 1180 65.5% 1222 32.1%
towiki 1753 1348 76.9% 592 66.2%
inhwiki 1722 1177 68.4% 417 75.8%
jamwiki 1720 1651 96.0% 1089 36.7%
tcywiki 1691 362 21.4% 461 72.7%
wowiki 1671 1156 69.2% 880 47.3%
tpiwiki 1664 1565 94.1% 999 40.0%
kbdwiki 1612 759 47.1% 269 83.3%
kiwiki 1612 1543 95.7% 441 72.6%
tetwiki 1586 664 41.9% 1176 25.9%
nawiki 1580 1378 87.2% 882 44.2%
akwiki 1571 1303 82.9% 449 71.4%
atjwiki 1470 763 51.9% 497 66.2%
xhwiki 1415 1103 78.0% 591 58.2%
lldwiki 1414 824 58.3% 918 35.1%
biwiki 1407 1376 97.8% 466 66.9%
mdfwiki 1355 1133 83.6% 211 84.4%
mnwwiki 1343 415 30.9% 233 82.7%
jbowiki 1334 863 64.7% 935 29.9%
tywiki 1332 1107 83.1% 329 75.3%
roa_rupwiki 1279 964 75.4% 650 49.2%
kgwiki 1271 1203 94.6% 746 41.3%
lbewiki 1257 1044 83.1% 220 82.5%
omwiki 1197 966 80.7% 821 31.4%
srnwiki 1188 830 69.9% 374 68.5%
fjwiki 1156 1115 96.5% 551 52.3%
smwiki 1038 790 76.1% 739 28.8%
ltgwiki 1008 619 61.4% 438 56.5%
nqowiki 992 579 58.4% 194 80.4%
chrwiki 972 428 44.0% 501 48.5%
stwiki 959 887 92.5% 413 56.9%
gotwiki 957 891 93.1% 436 54.4%
klwiki 869 460 52.9% 721 17.0%
pihwiki 850 774 91.1% 627 26.2%
tnwiki 844 510 60.4% 288 65.9%
nywiki 830 606 73.0% 369 55.5%
twwiki 791 697 88.1% 183 76.9%
chywiki 783 752 96.0% 190 75.7%
cuwiki 780 537 68.8% 448 42.6%
bmwiki 759 576 75.9% 368 51.5%
tswiki 729 492 67.5% 526 27.8%
tumwiki 723 704 97.4% 305 57.8%
rmywiki 716 575 80.3% 368 48.6%
rnwiki 715 692 96.8% 168 76.5%
ikwiki 674 650 96.4% 299 55.6%
iuwiki 634 592 93.4% 366 42.3%
kswiki 569 539 94.7% 266 53.3%
adywiki 566 350 61.8% 334 41.0%
sswiki 560 181 32.3% 393 29.8%
chwiki 547 499 91.2% 242 55.8%
pntwiki 523 265 50.7% 383 26.8%
vewiki 451 436 96.7% 264 41.5%
eewiki 388 327 84.3% 231 40.5%
tiwiki 373 349 93.6% 182 51.2%
ffwiki 368 281 76.4% 226 38.6%
dinwiki 305 285 93.4% 131 57.0%
sgwiki 295 274 92.9% 129 56.3%
dzwiki 295 226 76.6% 203 31.2%
crwiki 175 167 95.4% 97 44.6%

Referral Data[edit]

See Research:Referrer.

Reverts (Patrolling and Vandalism)[edit]

Main article: Research:Revert

Revert(ed) edits on the Wikimedia projects can comprise between 5 and 20% of non-bot edits and thus have substantial impact on many edit-related analyses. Accounting for them -- either via filtering or just separating out -- can be simple, but no method is perfect and the best approach is multi-faceted as described below.

A few notes:

  • I use reverts as a general term here but any revert actually has two components: the edit(s) that were reverted and the edit did that revert. Any approach that identifies reverts should simultaneously give both of these though so further distinguishing is not necessary.
  • While reverts are often associated with vandalism (and subsequent patrolling), there are actually many reasons why editors might revert revert an edit that have nothing to do with those activities such as self-reverts or just standard collaboration.[5] Various additional heuristics can be employed to deal with these -- e.g., limiting the time between edits to be counted as reverts, excluding self-reverts by requiring at least two different editors be involved in a revert, potentially adding a filter for self-revert in the edit summary.

There are several, complementary approaches to detecting reverts.[6] Generally I combine the first two methods (identity and edit tags) -- i.e. a revert is any set of edits identified as a revert by either shasums or edit tags:

  • Identity-based reverts: a common form of reverting is to return the page to the exact state it previously was in. All edits have a shasum pre-calculated that appears in various data sources, so this is often very cheap to detect.[7] This approach is language-agnostic so will work for any Wikipedia language edition.
  • Tool-based reverts: Wikipedians can use various tools / buttons within the UI to make reverts. When this happens, the tool also adds an edit tag. The tags can be found in a special dump where they're called change tags -- e.g., enwiki-latest-change_tag.sql.gz for English Wikipedia dumps -- and cross-compared with revision IDs from the dumps.[8] This approach is actually language-agnostic too once you identify the right edit tag(s). These generally should be mw-rollback, mw-undo, and mw-manual-revert for edit tags that are doing the revert of past edits and edits that were reverted would be under mw-reverted. You can manually look for these tags by going to https://<langcode>.wikipedia.org/wiki/Special:Tags where e.g., "en" would replace "<langcode>" for English etc.
  • Self-declared reverts: editors (or tools) often indicate when they are reverting an edit within the edit summary. While imperfect, this approach can capture edits where someone manually reverted a past edit by making a new edit but also made other changes, which the other approaches would miss. This approach is not language-agnostic though and could easily lead to false positives.
  • Model-based: research has explored more probablistic approaches for identifying reverts, which could be useful depending on the application.[9]

References[edit]

  1. Hill, Benjamin Mako; Shaw, Aaron (August 2014). "Consider the Redirect: A Missing Dimension of Wikipedia Research" (PDF). OpenSym '14. doi:10.1145/2641580.2641616. Retrieved 19 May 2020. 
  2. Hill, Benjamin Mako; Shaw, Aaron (August 2015). "Page protection: another missing dimension of Wikipedia research" (PDF). OpenSym '15. doi:10.1145/2788993.2789846. Retrieved 19 May 2020. 
  3. Mitrevski, Blagoj; Piccardi, Tiziano; West, Robert (21 April 2020). "WikiHist.html: English Wikipedia's Full Revision History in HTML Format" (PDF). ICWSM 2020. Retrieved 19 May 2020. 
  4. Lin, Yilun; Yu, Bowen; Hall, Andrew; Hecht, Brent (February 2017). "Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia" (PDF). CSCW '17. doi:10.1145/2998181.2998274. Retrieved 19 May 2020. 
  5. Geiger, R. Stuart; Halfaker, Aaron (2017-12-06). "Operationalizing Conflict and Cooperation between Automated Software Agents in Wikipedia: A Replication and Expansion of 'Even Good Bots Fight'". Proceedings of the ACM on Human-Computer Interaction 1 (CSCW): 49:1–49:33. doi:10.1145/3134684. 
  6. Details on their overlap (for English Wikipedia) can be found at task T266374.
  7. The Python library mwreverts can be used to do this for you and this notebook has examples.
  8. For analyzing, see the mwsql Python library and this example notebook.
  9. Flöck, Fabian; Vrandečić, Denny; Simperl, Elena (2012-06-25). "Revisiting reverts: accurate revert detection in wikipedia". Proceedings of the 23rd ACM conference on Hypertext and social media. HT '12 (New York, NY, USA: Association for Computing Machinery): 3–12. ISBN 978-1-4503-1335-3. doi:10.1145/2309996.2310000.