User:Isaac (WMF)/Analysis gotchas
This is a (likely incomplete) compendium of all the ways in which we find the correct API or HDFS table and build the query and analysis and are feeling great about our accomplishment just to learn that we weren't aware of some oddity of the Mediawiki landscape and are missing a bunch of data (or are swimming in false positives). ‘Tis a shame and I'm hoping that this list slowly dwindles down to nothing. There is no particular order to what is below.
Redirects
[edit]Any given article in Wikipedia might have many alternative titles that will redirect the reader to the canonical article name. How these redirects are accounted for in pageview counts varies though and can have a large impact on pageview analyses. See this short research paper[1] for more details on why accounting for redirects is important. Here's jow to handle it for various data sources:
- wmf.webrequest (and any HDFS derived tables)
- Page ID is post-redirect -- i.e. you can count who viewed a given article by grouping by page ID regardless of what redirects / page moves happened
- Page title reflects redirects -- i.e. you can see what links people requested by using page title (if you want to study redirect-specific trends) but generally page ID is the much better choice for joins, aggregation, etc.
- Pageviews API
- Articles are grouped only by title so redirects have to be compiled and separately queried
- Subscribe to T159046 for updates on when this change might be made
- Pageviews Tool
- The
redirects=1
parameter will automatically gather pageviews for the article and all associated redirects -- e.g., [1]- This still breaks on the langviews tool and presumably others for querying pageviews from old titles but works pretty well -- the pageID fix to the underlying API would presumably lead to support here as well.
- The
Page Moves
[edit]Page moves are a special case of redirects where the canonical title is changed for a given article -- e.g., the "2019-20 Coronavirus Outbreak" becomes the "2019-20 Coronavirus Pandemic". This should have no impact on pageview analyses if you're already handling redirects as described above, but it can affect many other data sources / analyses too. How to handle for various data sources:
- The "easy" fix to handling page moves is to use page IDs instead of titles as the primary key for any article-level analyses. Page IDs do not change in page moves and therefore are stable identifiers for articles -- excepting page deletions and page merges, which are relatively infrequent or at least not generally associated with "high-importance" articles.
- For instance, with the MediaWiki API, either use pageids parameter (which most endpoints support) or titles parameter as well as redirects parameter in case the title you are using is out-of-date.
- wmf.wikidata_item_page_link HDFS table
- This table maintains a mapping of page title, page ID, and wikidb to wikidata item ID. Joining in Wikidata IDs supports language agnostic modeling / research and is an increasingly important component of our research. Using page ID when joining against this table is best practice, but there are some caveats to also be aware of that relate to how the table is built:
- The creation of this table relies on a join between titles from two separate datasets, which is fragile in that the data source for sitelinks must have the same vintage as the data source of page titles + page IDs, which is not currently true. This causes issues for snapshots that are later in the month as one data source is from the start of the month and the other is from the end of the month. See task T249773 for up-to-date information on the status of fixing this.
- This table maintains a mapping of page title, page ID, and wikidb to wikidata item ID. Joining in Wikidata IDs supports language agnostic modeling / research and is an increasingly important component of our research. Using page ID when joining against this table is best practice, but there are some caveats to also be aware of that relate to how the table is built:
- Gathering the page move history of an article is not trivial, but the following query on the wmf.mediawiki_page_history table works pretty well where the page_title is the current title for that snapshot and the page_title_historical is the title at the time of the move:
SELECT page_id, page_title, caused_by_event_type, start_timestamp, page_title_historical FROM wmf.mediawiki_page_history WHERE snapshot = '2020-03' AND page_id = <page-id> AND wiki_db = <wiki-db> AND page_namespace = <page-namespace> AND page_is_redirect != true ORDER BY start_timestamp LIMIT 10000;
Unidentified Bots
[edit]When analyzing readership, we generally are only interested in what people are reading and not what pages are being scraped by programs. Out of all the server requests to view a given article, we must filter out requests associated with these bots. Wikimedia's bot detection is fairly straightforward -- flag all bots that appropriately identify themselves and identify some additional likely-automated traffic via a few heuristics. This presumably catches most bot traffic but some still slips through. At an aggregate level, this is not of high concern, but unfortunately this bot traffic often concentrates on single pages and so can lead to very odd spikes in traffic for certain articles. This can cause seemingly random articles to appear in top-k lists, affect recommender systems that use pageviews as a feature, and skew reader behavior metrics. For more details, see these discussions of bots:
There are a few simple-ish but relatively effective approaches that can further help filter this traffic:
- Remove pages that get <5% or >95% of pageviews from mobile.
- Bots tend to use either desktop or mobile but not both (like real humans).
- Use Wikimedia app views only.
- Bots don't use the app so the app views are almost guaranteed to be human
Page protections
[edit]Articles can be put under a variety of editing restrictions with the effect of substantially reducing vandalism, and, often as a collateral damage, good-faith editing on those pages. When analyzing edit behavior at the article level, it is therefore important to account for which articles are under the various forms of protection. Hill and Shaw[2] demonstrate the importance of accounting for page protections in research. Page protection statistics can be gathered from a few places:
- Mediawiki API (Info)
- page_restrictions Mediawiki table
- This table is also provided as a
.sql.gz
file with the monthly dumps
- This table is also provided as a
- In HDFS, there is the
wmf_raw.mediawiki_page_restrictions
table, which contains monthly snapshots of the Mediawiki table from MariaDB. There is also a protections event log atevent.mediawiki_page_restrictions_change
to see when protections were applied but it's hard to get all the information (e.g., expiration date) out of that table.
Article Content (text or links)
[edit]When you're doing research / analyses that depend on the text or links in a page, there are two sources that you can go to that can vary somewhat substantially in the data they provide: wikitext dumps and the fully parsed page. The difference arises from the usage of templates and modules to dynamically add content to Wikipedia articles. A very good overview of this difference can be found in Mitrevski et al.[3] where they compare wikitext to article HTML for the entire history of English Wikipedia. The wikitext is on average missing half of the links in the actual article (anecdotally, these are generally metadata templates), though those links are clicked on substantially less by readers so their importance is case-dependent. The pagelinks table holds the links from the fully parsed page (with the caveat that if links from a transcluded template are changed, these changes will not be reflected in the pagelinks table until an edit is made to the article).
- APIs
- Both the raw wikitext and parsed HTML can be extracted via the Parse API.
- Just the links in parsed HTML can be gathered via the pagelinks API.
- Dumps
- Wikitext: the dumps store the wikitext history for all (non-deleted) pages. This is by far the easiest way to access Wikipedia content in bulk, though it is missing many links as noted above.
- A hack to get partway between wikitext and fully parsed is to gather the wikitext from static templates (e.g., navigation templates that sit at the end of articles) and include that when the template is transcluded
- Parsed: Fully English Wikipedia history through 1 March 2019 available on zenodo.
- Pagelinks: the pagelinks table is available as a sql dump.
- Wikitext: the dumps store the wikitext history for all (non-deleted) pages. This is by far the easiest way to access Wikipedia content in bulk, though it is missing many links as noted above.
Sub-articles
[edit]Many concepts, depending on their importance and information that is available, are broken into multiple articles. See Lin et al. [4] for a detailed study of this issue. This has two major impacts:
- If you are interested in the level of attention to specific topics e.g., Barack Obama, then you shouldn't just measure pageviews to the Barack Obama article but also you should include pageviews to the Early life and career of Barack Obama, Family of Barack Obama, etc.
- If you are looking for content that exists in one language but not another -- e.g., for the purposes of recommending that that content be translated or more basic research -- you need to understand that sitelinks do not always capture that there might be a full subarticle in one language but only a section in another.
- This can maybe be addressed by determining what templates are used for identifying subarticles and assuming that if a given article is a subarticle in any language, the content is too specific to recommend for creation and likely already described in a section of a more general article (but not properly sitelinked).
- See task T207406#5890503 for more details.
Page merges
[edit]I'm pretty sure that page merges are also relatively rare and I suspect usually one of the pages in the merge is much more developed than the other, so "throwing out" the history / pageviews associated with the article that is being merged into the other is unlikely to affect analyses. The source page in the merge is then turned into a redirect so the above guidelines apply.
Page / Revision deletions
[edit]Page and revision deletions are relatively rare so I suspect they don't affect too many analyses. They may, however, be pertinent to research around harassment, disinformation, or other sensitive topics. The only way to get deleted page content is via the API and with special user rights (e.g., Researcher). More details / statistics:
- Deletion Tool: https://www.mediawiki.org/wiki/Help:RevisionDelete
- Some statistics on revision deletion from 2011: https://www.andrew-g-west.com/docs/wikisym_11_revdel_final.pdf
- Deletion Log (enwiki): https://en.wikipedia.org/wiki/Special:Log/delete
- Suppression Log (enwiki): https://en.wikipedia.org/wiki/Special:Log/suppress
Querying web request data on Special pages
[edit]The Special namespace is a virtual namespace. This means that pages in that namespace do not have page IDs, they do not exist in MediaWiki's page
database table. Secondly, their names are localized: it's "Special:CreateAccount" in English, "Especial:Crear una cuenta" in Spanish, "Spécial:Créer un compte" in French, and so on. Lastly, these pages can be accessed through two different URL paths. In English both /wiki/Special:CreateAccount
and /w/index.php?title=Special:CreateAccount
lead to the account creation page.
Querying pages in the Special namespace in the webrequest dataset is facilitated by the x_analytics_map
field having the key special
set to the canonical English name of the special page. For example, a request to Especial:Crear una cuenta on Spanish Wikipedia will have x_analytics_map["special"]
set to "CreateAccount"
. Below is an example Presto query demonstrating this:
SELECT date(from_iso8601_timestamp(dt)) AS log_date, access_method, agent_type, count(1) AS num_page_views FROM wmf.webrequest WHERE webrequest_source = 'text' -- as opposed to 'upload' AND year = 2022 AND month = 5 AND day = 3 AND normalized_host.project_family = 'wikipedia' AND normalized_host.project = 'es' AND element_at(x_analytics_map, 'special') = 'CreateAccount' AND http_status IN ('200', '304') -- 'success' used in Pageview Definition AND content_type LIKE '%text/html%' -- also defines a "pageview" GROUP BY date(from_iso8601_timestamp(dt)), access_method, agent_type
Stubs and Spikes
[edit]It can be very difficult to choose a measure of central tendency for analyses of content on the wikis. The content of some wikis is dominated by stub articles creating a long-tail that will greatly skew metrics about pageviews per article, average number of sections, etc. towards zero. When these stub articles are predominantly created by bots and receive relatively few pageviews, it is hard to argue that they are representative of the wiki. Conversely, some wikis have articles that receive a massive amount of attention or edits or are extremely long, biasing metrics to much higher values than are typical for that wiki.
Likely, the most appropriate metrics are going to be some sort of truncated mean. The spikes are easier -- even just removing the topmost 1% of data should greatly diminish power-law dynamics while retaining most of the data. The stubs are much harder and the right threshold for removing them is likely wiki-dependent. For example, wikis with large numbers of bot-generated articles might need the bottommost 90% of content to be thrown out to get to a more representative measure while others shouldn't throw out any content as it's all human-generated. This challenge suggests that rather than identifying a percentage of content to remove, it would be better to split articles on wikis into three categories and report metrics individually for each (categories should be calculated exclusively and spikes calculated first):
- Spikes: content created by either bots or humans in the top 1% for a metric -- e.g., edits or pageviews. This is clearly important content but it is not expected that most articles will ever become this popular.
- Bot-generated: articles that were generated by bots (many of which are likely stubs though some may have grown organically)
- Standard human-generated: the bread-and-butter of the wikis. An article created by a user or anonymous editor that receives some normal amount of attention.
Below is related data on what proportion of each wiki are articles with only 0 or 1 sections (a proxy for stubs) and what percentage of articles don't receive pageviews on any given day (a complementary proxy for stubs). Eventually, it would be good to have data on bot-generated articles but that is a more complicated analysis.
SQL query for generating table below |
---|
WITH wikipedia_projects AS (
SELECT DISTINCT
dbname,
SUBSTR(hostname, 0, LENGTH(hostname) - 4) AS project
FROM wmf_raw.mediawiki_project_namespace_map
WHERE
snapshot = '2021-02'
AND hostname LIKE '%wikipedia%'
),
stub_counts AS (
SELECT
wiki_db,
COUNT(1) AS num_pages,
SUM(IF(num_headings <= 1, 1, 0)) AS num_stubs
FROM isaacj.qual_features
GROUP BY
wiki_db
),
pages_with_pageviews (
SELECT COUNT(DISTINCT(pv.page_id)) AS num_pages_with_pvs,
wp.dbname AS wiki_db
FROM wmf.pageview_hourly pv
INNER JOIN wikipedia_projects wp
ON (pv.project = wp.project)
WHERE year = 2021
AND month = 2
AND day = 15
AND namespace_id = 0
AND agent_type = 'user'
GROUP BY wp.dbname
)
SELECT sc.wiki_db,
sc.num_pages,
sc.num_stubs,
pv.num_pages_with_pvs
FROM stub_counts sc
LEFT JOIN pages_with_pageviews pv
ON (sc.wiki_db = pv.wiki_db)
ORDER BY num_pages DESC
|
wikidb | # articles | # stubs (<2 sections) | % stubs | # articles with daily pageviews | % daily unseen |
---|---|---|---|---|---|
enwiki | 6260556 | 1501117 | 24.0% | 4427190 | 29.3% |
cebwiki | 5546111 | 3282539 | 59.2% | 319163 | 94.2% |
svwiki | 3398512 | 1958366 | 57.6% | 403634 | 88.1% |
dewiki | 2543038 | 506616 | 19.9% | 1458139 | 42.7% |
frwiki | 2304459 | 333644 | 14.5% | 1217089 | 47.2% |
nlwiki | 2046933 | 1433262 | 70.0% | 537226 | 73.8% |
ruwiki | 1703247 | 263286 | 15.5% | 1014094 | 40.5% |
itwiki | 1677064 | 244919 | 14.6% | 862455 | 48.6% |
eswiki | 1609903 | 202110 | 12.6% | 937099 | 41.8% |
plwiki | 1460488 | 554146 | 37.9% | 702992 | 51.9% |
warwiki | 1265000 | 1041815 | 82.4% | 169816 | 86.6% |
viwiki | 1261985 | 142433 | 11.3% | 186334 | 85.2% |
jawiki | 1256122 | 133662 | 10.6% | 950616 | 24.3% |
arzwiki | 1206509 | 12804 | 1.1% | 20446 | 98.3% |
zhwiki | 1180437 | 413934 | 35.1% | 476579 | 59.6% |
arwiki | 1104446 | 291506 | 26.4% | 324650 | 70.6% |
ukwiki | 1077033 | 165429 | 15.4% | 282264 | 73.8% |
ptwiki | 1058222 | 462339 | 43.7% | 535078 | 49.4% |
fawiki | 771124 | 179654 | 23.3% | 258039 | 66.5% |
cawiki | 672521 | 206156 | 30.7% | 256662 | 61.8% |
srwiki | 643311 | 68081 | 10.6% | 119783 | 81.4% |
idwiki | 563587 | 270466 | 48.0% | 252159 | 55.3% |
nowiki | 551441 | 157286 | 28.5% | 260682 | 52.7% |
kowiki | 534857 | 179061 | 33.5% | 256617 | 52.0% |
fiwiki | 504132 | 158314 | 31.4% | 290441 | 42.4% |
huwiki | 484462 | 50942 | 10.5% | 246044 | 49.2% |
cswiki | 475348 | 88225 | 18.6% | 277824 | 41.6% |
shwiki | 454726 | 68874 | 15.1% | 105740 | 76.7% |
zh_min_nanwiki | 430768 | 144640 | 33.6% | 33152 | 92.3% |
rowiki | 417277 | 139657 | 33.5% | 185909 | 55.4% |
trwiki | 393166 | 131572 | 33.5% | 241732 | 38.5% |
euwiki | 368295 | 72591 | 19.7% | 125974 | 65.8% |
cewiki | 353901 | 9302 | 2.6% | 2622 | 99.3% |
mswiki | 347140 | 137007 | 39.5% | 133066 | 61.7% |
eowiki | 293038 | 129918 | 44.3% | 123296 | 57.9% |
hewiki | 289617 | 40871 | 14.1% | 169349 | 41.5% |
hywiki | 281603 | 59117 | 21.0% | 42161 | 85.0% |
bgwiki | 269533 | 72385 | 26.9% | 104225 | 61.3% |
ttwiki | 265785 | 17596 | 6.6% | 4642 | 98.3% |
dawiki | 265146 | 107347 | 40.5% | 164181 | 38.1% |
azbwiki | 240077 | 134198 | 55.9% | 3836 | 98.4% |
skwiki | 236047 | 84114 | 35.6% | 100897 | 57.3% |
kkwiki | 232242 | 93437 | 40.2% | 39270 | 83.1% |
minwiki | 224563 | 164103 | 73.1% | 4189 | 98.1% |
etwiki | 216990 | 97603 | 45.0% | 124040 | 42.8% |
hrwiki | 210879 | 58018 | 27.5% | 122595 | 41.9% |
bewiki | 201783 | 105828 | 52.4% | 17996 | 91.1% |
ltwiki | 199105 | 74960 | 37.6% | 92822 | 53.4% |
elwiki | 189017 | 45425 | 24.0% | 101323 | 46.4% |
simplewiki | 183418 | 104358 | 56.9% | 109879 | 40.1% |
azwiki | 178772 | 52397 | 29.3% | 76909 | 57.0% |
glwiki | 171663 | 27154 | 15.8% | 79110 | 53.9% |
slwiki | 171521 | 47722 | 27.8% | 90149 | 47.4% |
urwiki | 163782 | 28825 | 17.6% | 18564 | 88.7% |
nnwiki | 157495 | 77725 | 49.4% | 87517 | 44.4% |
hiwiki | 149556 | 61822 | 41.3% | 59043 | 60.5% |
kawiki | 149466 | 62127 | 41.6% | 43914 | 70.6% |
thwiki | 142601 | 41208 | 28.9% | 85489 | 40.1% |
uzwiki | 139916 | 56106 | 40.1% | 36729 | 73.7% |
tawiki | 139574 | 39806 | 28.5% | 37328 | 73.3% |
lawiki | 134936 | 67404 | 50.0% | 77598 | 42.5% |
cywiki | 132611 | 30375 | 22.9% | 48164 | 63.7% |
vowiki | 126354 | 93610 | 74.1% | 6974 | 94.5% |
mkwiki | 113258 | 32523 | 28.7% | 24308 | 78.5% |
astwiki | 108296 | 15181 | 14.0% | 46400 | 57.2% |
zh_yuewiki | 107931 | 60798 | 56.3% | 26888 | 75.1% |
lvwiki | 106263 | 44890 | 42.2% | 52135 | 50.9% |
bnwiki | 104447 | 13559 | 13.0% | 43933 | 57.9% |
mywiki | 102874 | 85981 | 83.6% | 7827 | 92.4% |
tgwiki | 102867 | 11598 | 11.3% | 5135 | 95.0% |
afwiki | 96861 | 28018 | 28.9% | 49721 | 48.7% |
mgwiki | 93801 | 17708 | 18.9% | 30390 | 67.6% |
sqwiki | 91160 | 38834 | 42.6% | 40729 | 55.3% |
ocwiki | 86657 | 33630 | 38.8% | 47572 | 45.1% |
bswiki | 85100 | 13695 | 16.1% | 55816 | 34.4% |
ndswiki | 82479 | 60219 | 73.0% | 19069 | 76.9% |
kywiki | 80802 | 49606 | 61.4% | 14627 | 81.9% |
be_x_oldwiki | 73484 | 22216 | 30.2% | 9149 | 87.5% |
mlwiki | 73178 | 20322 | 27.8% | 33226 | 54.6% |
newwiki | 73046 | 3826 | 5.2% | 1665 | 97.7% |
tewiki | 70773 | 10542 | 14.9% | 17096 | 75.8% |
mrwiki | 70768 | 37390 | 52.8% | 17686 | 75.0% |
brwiki | 69389 | 32649 | 47.1% | 41683 | 39.9% |
vecwiki | 67315 | 4369 | 6.5% | 32251 | 52.1% |
pmswiki | 65780 | 47440 | 72.1% | 17459 | 73.5% |
jvwiki | 62818 | 30576 | 48.7% | 19213 | 69.4% |
htwiki | 62483 | 5800 | 9.3% | 7349 | 88.2% |
pnbwiki | 61321 | 33155 | 54.1% | 3590 | 94.1% |
swwiki | 60857 | 26677 | 43.8% | 43955 | 27.8% |
suwiki | 60788 | 35255 | 58.0% | 10561 | 82.6% |
lbwiki | 59368 | 29882 | 50.3% | 35752 | 39.8% |
tlwiki | 58487 | 34939 | 59.7% | 29815 | 49.0% |
bawiki | 55679 | 2588 | 4.6% | 3240 | 94.2% |
gawiki | 54763 | 38718 | 70.7% | 20972 | 61.7% |
szlwiki | 53097 | 52309 | 98.5% | 2574 | 95.2% |
iswiki | 52077 | 29732 | 57.1% | 24769 | 52.4% |
cvwiki | 45779 | 4986 | 10.9% | 4364 | 90.5% |
lmowiki | 45566 | 21120 | 46.4% | 26104 | 42.7% |
fywiki | 45319 | 21386 | 47.2% | 27829 | 38.6% |
scowiki | 42582 | 24951 | 58.6% | 26928 | 36.8% |
wuuwiki | 41464 | 35595 | 85.8% | 4388 | 89.4% |
diqwiki | 39948 | 17912 | 44.8% | 18924 | 52.6% |
anwiki | 39551 | 17099 | 43.2% | 23671 | 40.2% |
kuwiki | 38575 | 23626 | 61.2% | 8938 | 76.8% |
pawiki | 37310 | 16190 | 43.4% | 5685 | 84.8% |
yowiki | 33614 | 29061 | 86.5% | 11883 | 64.6% |
newiki | 32167 | 10675 | 33.2% | 4549 | 85.9% |
barwiki | 31631 | 11080 | 35.0% | 13479 | 57.4% |
iowiki | 30521 | 19069 | 62.5% | 17278 | 43.4% |
guwiki | 29677 | 21720 | 73.2% | 5415 | 81.8% |
ckbwiki | 29093 | 11498 | 39.5% | 5752 | 80.2% |
alswiki | 27698 | 4584 | 16.5% | 20105 | 27.4% |
knwiki | 27608 | 10384 | 37.6% | 10200 | 63.1% |
nostalgiawiki | 27375 | 26722 | 97.6% | -- | -- |
scnwiki | 26421 | 19952 | 75.5% | 17500 | 33.8% |
bpywiki | 25249 | 1102 | 4.4% | 1519 | 94.0% |
iawiki | 23127 | 19862 | 85.9% | 12681 | 45.2% |
quwiki | 23031 | 5977 | 26.0% | 14600 | 36.6% |
mnwiki | 22141 | 9844 | 44.5% | 7952 | 64.1% |
siwiki | 20628 | 12400 | 60.1% | 7297 | 64.6% |
bat_smgwiki | 16997 | 15024 | 88.4% | 2163 | 87.3% |
nvwiki | 16651 | 16523 | 99.2% | 408 | 97.5% |
sdwiki | 15765 | 8937 | 56.7% | 1437 | 90.9% |
xmfwiki | 15685 | 9463 | 60.3% | 5143 | 67.2% |
orwiki | 15637 | 1938 | 12.4% | 1787 | 88.6% |
cdowiki | 15513 | 9887 | 63.7% | 1915 | 87.7% |
amwiki | 15398 | 13087 | 85.0% | 3738 | 75.7% |
ilowiki | 15390 | 6596 | 42.9% | 11993 | 22.1% |
gdwiki | 15332 | 6817 | 44.5% | 9965 | 35.0% |
yiwiki | 15223 | 7727 | 50.8% | 3009 | 80.2% |
napwiki | 14736 | 10049 | 68.2% | 11939 | 19.0% |
sahwiki | 14565 | 7779 | 53.4% | 2566 | 82.4% |
maiwiki | 14485 | 2365 | 16.3% | 1095 | 92.4% |
bugwiki | 14191 | 14134 | 99.6% | 586 | 95.9% |
wawiki | 13891 | 7597 | 54.7% | 5393 | 61.2% |
map_bmswiki | 13781 | 11396 | 82.7% | 3785 | 72.5% |
hsbwiki | 13765 | 3784 | 27.5% | 4516 | 67.2% |
pswiki | 13671 | 7809 | 57.1% | 1647 | 88.0% |
mznwiki | 13562 | 3174 | 23.4% | 1276 | 90.6% |
fowiki | 13559 | 5445 | 40.2% | 7966 | 41.2% |
liwiki | 13209 | 6776 | 51.3% | 10107 | 23.5% |
oswiki | 12942 | 8190 | 63.3% | 1774 | 86.3% |
frrwiki | 12675 | 5644 | 44.5% | 7680 | 39.4% |
emlwiki | 12656 | 9406 | 74.3% | 4950 | 60.9% |
avkwiki | 12420 | 1819 | 14.6% | 1917 | 84.6% |
acewiki | 12348 | 11932 | 96.6% | 1629 | 86.8% |
gorwiki | 11864 | 11106 | 93.6% | 1030 | 91.3% |
bowiki | 11726 | 8284 | 70.6% | 1320 | 88.7% |
sawiki | 11643 | 4929 | 42.3% | 1536 | 86.8% |
bclwiki | 11011 | 4751 | 43.1% | 5617 | 49.0% |
zh_classicalwiki | 10666 | 6548 | 61.4% | 2477 | 76.8% |
mrjwiki | 10527 | 8466 | 80.4% | 650 | 93.8% |
mhrwiki | 10321 | 3320 | 32.2% | 3032 | 70.6% |
hifwiki | 10125 | 7494 | 74.0% | 3552 | 64.9% |
kmwiki | 10107 | 5302 | 52.5% | 4362 | 56.8% |
hakwiki | 9525 | 4257 | 44.7% | 1672 | 82.4% |
roa_tarawiki | 9314 | 8242 | 88.5% | 7838 | 15.8% |
testwiki | 9227 | 6000 | 65.0% | 56 | 99.4% |
pamwiki | 8985 | 3852 | 42.9% | 3440 | 61.7% |
crhwiki | 8895 | 6505 | 73.1% | 1929 | 78.3% |
hywwiki | 8853 | 1615 | 18.2% | 1029 | 88.4% |
shnwiki | 8798 | 1937 | 22.0% | 567 | 93.6% |
nsowiki | 8356 | 5866 | 70.2% | 4287 | 48.7% |
aswiki | 8164 | 1192 | 14.6% | 3129 | 61.7% |
ruewiki | 8073 | 2815 | 34.9% | 3376 | 58.2% |
sewiki | 7954 | 4833 | 60.8% | 3457 | 56.5% |
zuwiki | 7659 | 7193 | 93.9% | 2980 | 61.1% |
hawiki | 7616 | 4994 | 65.6% | 3192 | 58.1% |
lijwiki | 7608 | 1632 | 21.5% | 5084 | 33.2% |
ugwiki | 7606 | 4748 | 62.4% | 1317 | 82.7% |
bhwiki | 7437 | 3655 | 49.1% | 2405 | 67.7% |
vlswiki | 7384 | 3452 | 46.7% | 5954 | 19.4% |
tkwiki | 7308 | 4145 | 56.7% | 4258 | 41.7% |
miwiki | 7205 | 3773 | 52.4% | 1917 | 73.4% |
nds_nlwiki | 7203 | 3092 | 42.9% | 5672 | 21.3% |
nahwiki | 7170 | 3567 | 49.7% | 4165 | 41.9% |
sowiki | 7137 | 4847 | 67.9% | 4617 | 35.3% |
scwiki | 7085 | 5444 | 76.8% | 3469 | 51.0% |
snwiki | 7074 | 5341 | 75.5% | 3624 | 48.8% |
vepwiki | 6658 | 2261 | 34.0% | 2701 | 59.4% |
ganwiki | 6505 | 3115 | 47.9% | 1333 | 79.5% |
banwiki | 6475 | 1575 | 24.3% | 2631 | 59.4% |
glkwiki | 6455 | 5107 | 79.1% | 644 | 90.0% |
myvwiki | 6408 | 2185 | 34.1% | 843 | 86.8% |
abwiki | 6237 | 1380 | 22.1% | 2586 | 58.5% |
kabwiki | 6115 | 4101 | 67.1% | 2907 | 52.5% |
cowiki | 5973 | 2341 | 39.2% | 4271 | 28.5% |
satwiki | 5862 | 744 | 12.7% | 472 | 91.9% |
fiu_vrowiki | 5786 | 3991 | 69.0% | 2298 | 60.3% |
iewiki | 5548 | 4083 | 73.6% | 2558 | 53.9% |
kvwiki | 5522 | 2677 | 48.5% | 709 | 87.2% |
csbwiki | 5404 | 3600 | 66.6% | 1511 | 72.0% |
pcdwiki | 5172 | 2010 | 38.9% | 1439 | 72.2% |
aywiki | 5139 | 1066 | 20.7% | 1003 | 80.5% |
udmwiki | 5050 | 3885 | 76.9% | 740 | 85.3% |
gvwiki | 5043 | 3258 | 64.6% | 3594 | 28.7% |
pagwiki | 4946 | 2890 | 58.4% | 1183 | 76.1% |
zeawiki | 4774 | 1967 | 41.2% | 2306 | 51.7% |
lfnwiki | 4677 | 3798 | 81.2% | 3095 | 33.8% |
frpwiki | 4613 | 2352 | 51.0% | 1750 | 62.1% |
lowiki | 4584 | 3014 | 65.8% | 956 | 79.1% |
nrmwiki | 4581 | 2221 | 48.5% | 2721 | 40.6% |
kwwiki | 4539 | 4000 | 88.1% | 2202 | 51.5% |
dvwiki | 4314 | 2835 | 65.7% | 946 | 78.1% |
lezwiki | 4198 | 1122 | 26.7% | 723 | 82.8% |
gomwiki | 4195 | 1582 | 37.7% | 618 | 85.3% |
gnwiki | 4134 | 2396 | 58.0% | 1981 | 52.1% |
mwlwiki | 4111 | 2038 | 49.6% | 1093 | 73.4% |
stqwiki | 4107 | 2694 | 65.6% | 2846 | 30.7% |
olowiki | 3903 | 2244 | 57.5% | 1812 | 53.6% |
szywiki | 3858 | 1250 | 32.4% | 488 | 87.4% |
mtwiki | 3772 | 1094 | 29.0% | 2257 | 40.2% |
rmwiki | 3762 | 2249 | 59.8% | 2542 | 32.4% |
awawiki | 3710 | 3314 | 89.3% | 263 | 92.9% |
dtywiki | 3604 | 1474 | 40.9% | 730 | 79.7% |
ladwiki | 3586 | 1580 | 44.1% | 1968 | 45.1% |
bjnwiki | 3584 | 1925 | 53.7% | 1581 | 55.9% |
arywiki | 3571 | 1088 | 30.5% | 908 | 74.6% |
furwiki | 3556 | 2029 | 57.1% | 2326 | 34.6% |
koiwiki | 3505 | 1953 | 55.7% | 367 | 89.5% |
extwiki | 3420 | 2315 | 67.7% | 1469 | 57.0% |
angwiki | 3374 | 2857 | 84.7% | 1348 | 60.0% |
dsbwiki | 3311 | 1654 | 50.0% | 1489 | 55.0% |
lnwiki | 3304 | 2081 | 63.0% | 1059 | 67.9% |
cbk_zamwiki | 3243 | 1037 | 32.0% | 1796 | 44.6% |
piwiki | 3216 | 1034 | 32.2% | 782 | 75.7% |
tyvwiki | 3180 | 1186 | 37.3% | 381 | 88.0% |
kshwiki | 2905 | 1571 | 54.1% | 1885 | 35.1% |
gagwiki | 2888 | 1507 | 52.2% | 1168 | 59.6% |
pflwiki | 2716 | 244 | 9.0% | 920 | 66.1% |
avwiki | 2587 | 1141 | 44.1% | 677 | 73.8% |
hawwiki | 2429 | 2038 | 83.9% | 551 | 77.3% |
lgwiki | 2425 | 2215 | 91.3% | 418 | 82.8% |
gcrwiki | 2378 | 382 | 16.1% | 802 | 66.3% |
xalwiki | 2321 | 1334 | 57.5% | 802 | 65.4% |
rwwiki | 2219 | 1828 | 82.4% | 863 | 61.1% |
igwiki | 2214 | 923 | 41.7% | 1000 | 54.8% |
bxrwiki | 2198 | 1231 | 56.0% | 732 | 66.7% |
papwiki | 2193 | 1752 | 79.9% | 1270 | 42.1% |
zawiki | 2116 | 1679 | 79.3% | 1017 | 51.9% |
pdcwiki | 2103 | 1930 | 91.8% | 866 | 58.8% |
krcwiki | 2074 | 717 | 34.6% | 508 | 75.5% |
test2wiki | 2041 | 1316 | 64.5% | 32 | 98.4% |
kaawiki | 2040 | 1679 | 82.3% | 890 | 56.4% |
kbpwiki | 1916 | 1763 | 92.0% | 474 | 75.3% |
arcwiki | 1811 | 1706 | 94.2% | 678 | 62.6% |
novwiki | 1801 | 1180 | 65.5% | 1222 | 32.1% |
towiki | 1753 | 1348 | 76.9% | 592 | 66.2% |
inhwiki | 1722 | 1177 | 68.4% | 417 | 75.8% |
jamwiki | 1720 | 1651 | 96.0% | 1089 | 36.7% |
tcywiki | 1691 | 362 | 21.4% | 461 | 72.7% |
wowiki | 1671 | 1156 | 69.2% | 880 | 47.3% |
tpiwiki | 1664 | 1565 | 94.1% | 999 | 40.0% |
kbdwiki | 1612 | 759 | 47.1% | 269 | 83.3% |
kiwiki | 1612 | 1543 | 95.7% | 441 | 72.6% |
tetwiki | 1586 | 664 | 41.9% | 1176 | 25.9% |
nawiki | 1580 | 1378 | 87.2% | 882 | 44.2% |
akwiki | 1571 | 1303 | 82.9% | 449 | 71.4% |
atjwiki | 1470 | 763 | 51.9% | 497 | 66.2% |
xhwiki | 1415 | 1103 | 78.0% | 591 | 58.2% |
lldwiki | 1414 | 824 | 58.3% | 918 | 35.1% |
biwiki | 1407 | 1376 | 97.8% | 466 | 66.9% |
mdfwiki | 1355 | 1133 | 83.6% | 211 | 84.4% |
mnwwiki | 1343 | 415 | 30.9% | 233 | 82.7% |
jbowiki | 1334 | 863 | 64.7% | 935 | 29.9% |
tywiki | 1332 | 1107 | 83.1% | 329 | 75.3% |
roa_rupwiki | 1279 | 964 | 75.4% | 650 | 49.2% |
kgwiki | 1271 | 1203 | 94.6% | 746 | 41.3% |
lbewiki | 1257 | 1044 | 83.1% | 220 | 82.5% |
omwiki | 1197 | 966 | 80.7% | 821 | 31.4% |
srnwiki | 1188 | 830 | 69.9% | 374 | 68.5% |
fjwiki | 1156 | 1115 | 96.5% | 551 | 52.3% |
smwiki | 1038 | 790 | 76.1% | 739 | 28.8% |
ltgwiki | 1008 | 619 | 61.4% | 438 | 56.5% |
nqowiki | 992 | 579 | 58.4% | 194 | 80.4% |
chrwiki | 972 | 428 | 44.0% | 501 | 48.5% |
stwiki | 959 | 887 | 92.5% | 413 | 56.9% |
gotwiki | 957 | 891 | 93.1% | 436 | 54.4% |
klwiki | 869 | 460 | 52.9% | 721 | 17.0% |
pihwiki | 850 | 774 | 91.1% | 627 | 26.2% |
tnwiki | 844 | 510 | 60.4% | 288 | 65.9% |
nywiki | 830 | 606 | 73.0% | 369 | 55.5% |
twwiki | 791 | 697 | 88.1% | 183 | 76.9% |
chywiki | 783 | 752 | 96.0% | 190 | 75.7% |
cuwiki | 780 | 537 | 68.8% | 448 | 42.6% |
bmwiki | 759 | 576 | 75.9% | 368 | 51.5% |
tswiki | 729 | 492 | 67.5% | 526 | 27.8% |
tumwiki | 723 | 704 | 97.4% | 305 | 57.8% |
rmywiki | 716 | 575 | 80.3% | 368 | 48.6% |
rnwiki | 715 | 692 | 96.8% | 168 | 76.5% |
ikwiki | 674 | 650 | 96.4% | 299 | 55.6% |
iuwiki | 634 | 592 | 93.4% | 366 | 42.3% |
kswiki | 569 | 539 | 94.7% | 266 | 53.3% |
adywiki | 566 | 350 | 61.8% | 334 | 41.0% |
sswiki | 560 | 181 | 32.3% | 393 | 29.8% |
chwiki | 547 | 499 | 91.2% | 242 | 55.8% |
pntwiki | 523 | 265 | 50.7% | 383 | 26.8% |
vewiki | 451 | 436 | 96.7% | 264 | 41.5% |
eewiki | 388 | 327 | 84.3% | 231 | 40.5% |
tiwiki | 373 | 349 | 93.6% | 182 | 51.2% |
ffwiki | 368 | 281 | 76.4% | 226 | 38.6% |
dinwiki | 305 | 285 | 93.4% | 131 | 57.0% |
sgwiki | 295 | 274 | 92.9% | 129 | 56.3% |
dzwiki | 295 | 226 | 76.6% | 203 | 31.2% |
crwiki | 175 | 167 | 95.4% | 97 | 44.6% |
Referral Data
[edit]See Research:Referrer.
Reverts (Patrolling and Vandalism)
[edit]Revert(ed) edits on the Wikimedia projects can comprise between 5 and 20% of non-bot edits and thus have substantial impact on many edit-related analyses. Accounting for them -- either via filtering or just separating out -- can be simple, but no method is perfect and the best approach is multi-faceted as described below.
A few notes:
- I use reverts as a general term here but any revert actually has two components: the edit(s) that were reverted and the edit did that revert. Any approach that identifies reverts should simultaneously give both of these though so further distinguishing is not necessary.
- While reverts are often associated with vandalism (and subsequent patrolling), there are actually many reasons why editors might revert revert an edit that have nothing to do with those activities such as self-reverts or just standard collaboration.[5] Various additional heuristics can be employed to deal with these -- e.g., limiting the time between edits to be counted as reverts, excluding self-reverts by requiring at least two different editors be involved in a revert, potentially adding a filter for self-revert in the edit summary.
There are several, complementary approaches to detecting reverts.[6] Generally I combine the first two methods (identity and edit tags) -- i.e. a revert is any set of edits identified as a revert by either shasums or edit tags:
- Identity-based reverts: a common form of reverting is to return the page to the exact state it previously was in. All edits have a shasum pre-calculated that appears in various data sources, so this is often very cheap to detect.[7] This approach is language-agnostic so will work for any Wikipedia language edition.
- Tool-based reverts: Wikipedians can use various tools / buttons within the UI to make reverts. When this happens, the tool also adds an edit tag. The tags can be found in a special dump where they're called change tags -- e.g.,
enwiki-latest-change_tag.sql.gz
for English Wikipedia dumps -- and cross-compared with revision IDs from the dumps.[8] This approach is actually language-agnostic too once you identify the right edit tag(s). These generally should bemw-rollback
,mw-undo
, andmw-manual-revert
for edit tags that are doing the revert of past edits and edits that were reverted would be undermw-reverted
. You can manually look for these tags by going to https://<langcode>.wikipedia.org/wiki/Special:Tags where e.g., "en" would replace "<langcode>" for English etc. - Self-declared reverts: editors (or tools) often indicate when they are reverting an edit within the edit summary. While imperfect, this approach can capture edits where someone manually reverted a past edit by making a new edit but also made other changes, which the other approaches would miss. This approach is not language-agnostic though and could easily lead to false positives.
- Model-based: research has explored more probablistic approaches for identifying reverts, which could be useful depending on the application.[9]
References
[edit]- ↑ Hill, Benjamin Mako; Shaw, Aaron (August 2014). "Consider the Redirect: A Missing Dimension of Wikipedia Research" (PDF). OpenSym '14. doi:10.1145/2641580.2641616. Retrieved 19 May 2020.
- ↑ Hill, Benjamin Mako; Shaw, Aaron (August 2015). "Page protection: another missing dimension of Wikipedia research" (PDF). OpenSym '15. doi:10.1145/2788993.2789846. Retrieved 19 May 2020.
- ↑ Mitrevski, Blagoj; Piccardi, Tiziano; West, Robert (21 April 2020). "WikiHist.html: English Wikipedia's Full Revision History in HTML Format" (PDF). ICWSM 2020. Retrieved 19 May 2020.
- ↑ Lin, Yilun; Yu, Bowen; Hall, Andrew; Hecht, Brent (February 2017). "Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia" (PDF). CSCW '17. doi:10.1145/2998181.2998274. Retrieved 19 May 2020.
- ↑ Geiger, R. Stuart; Halfaker, Aaron (2017-12-06). "Operationalizing Conflict and Cooperation between Automated Software Agents in Wikipedia: A Replication and Expansion of 'Even Good Bots Fight'". Proceedings of the ACM on Human-Computer Interaction 1 (CSCW): 49:1–49:33. doi:10.1145/3134684.
- ↑ Details on their overlap (for English Wikipedia) can be found at task T266374.
- ↑ The Python library mwreverts can be used to do this for you and this notebook has examples.
- ↑ For analyzing, see the mwsql Python library and this example notebook.
- ↑ Flöck, Fabian; Vrandečić, Denny; Simperl, Elena (2012-06-25). "Revisiting reverts: accurate revert detection in wikipedia". Proceedings of the 23rd ACM conference on Hypertext and social media. HT '12 (New York, NY, USA: Association for Computing Machinery): 3–12. ISBN 978-1-4503-1335-3. doi:10.1145/2309996.2310000.