User:Nettrom/Work logs/2018-03-22

From Meta, a Wikimedia project coordination wiki

Today I'll be investigating whether there are errors in page deletion timestamps in the Data Lake. I am unsure whether the page deletion timestamp reflects the deletion event in the logging table, or is a different event.

Gather two datasets of articles, one from initial revisions, one from the page table[edit]

During our analysis of deletions for ACTRIAL, I discovered that the log_page column in the logging table was unreliable for older data (ref the Jan 29 work log). From late June 2014, it appears to be correctly set. Because I worked with article data for ACTRIAL, I'll also do that here, meaning that queries are restricted to namespace 0.

I want to get a dataset that contains pages from various time periods, but only on or after July 1, 2014. For simplicity, I'll work with one month spans. I'll chose one month from each quarter, because the English Wikipedia tends to have some seasonal effects (e.g. July is generally a quiet month). I end up choosing July 1-31 2014, February 1-28 2015, October 1-31 2016, and May 1-31 2017. This gives me the following HQL queries for creations:

-- Creations from the full history table.
USE wmf;
CREATE TABLE nettrom_articlecreations.creations_from_revision AS
SELECT page_id, event_timestamp, revision_deleted_timestamp
FROM mediawiki_history
WHERE wiki_db = 'enwiki'
  AND event_entity = 'revision'
  AND event_type = 'create'
  AND ((page_namespace = 0
        AND revision_parent_id = 0)
       OR
       (page_namespace_historical IS NULL
        AND page_namespace = 0
        AND (revision_parent_id IS NULL
             OR revision_parent_id = 0)))
  AND page_revision_count = 1
  AND (event_comment IS NULL
       OR (event_comment IS NOT NULL
           AND LCASE(event_comment) NOT REGEXP 'redir'
           AND LCASE(event_comment) NOT REGEXP '^rd'
           AND event_comment NOT REGEXP '\\{\\{R from '
           AND event_comment NOT REGEXP '.*moved .*\\[\\[([^\]]+)\\]\\] to \\[\\[([^\]]+)\\]\\].*'))
  AND event_user_id > 0
  AND ((TO_DATE(event_timestamp) >= '2014-07-01'
        AND TO_DATE(event_timestamp) < '2014-08-01')
       OR
       (TO_DATE(event_timestamp) >= '2015-02-01'
        AND TO_DATE(event_timestamp) < '2015-03-01')
       OR
       (TO_DATE(event_timestamp) >= '2016-10-01'
        AND TO_DATE(event_timestamp) < '2016-11-01')
       OR
       (TO_DATE(event_timestamp) >= '2017-05-01'
        AND TO_DATE(event_timestamp) < '2017-06-01'))
  AND snapshot = '2018-02';
-- Creations from the page table.
USE wmf;
CREATE TABLE nettrom_articlecreations.creations_from_page AS
SELECT page_id, page_creation_timestamp, end_timestamp
FROM mediawiki_page_history
WHERE wiki_db = 'enwiki'
AND page_namespace = 0
AND caused_by_user_id > 0
AND ((TO_DATE(page_creation_timestamp) >= '2014-07-01'
      AND TO_DATE(page_creation_timestamp) < '2014-08-01')
     OR
     (TO_DATE(page_creation_timestamp) >= '2015-02-01'
      AND TO_DATE(page_creation_timestamp) < '2015-03-01')
     OR
     (TO_DATE(page_creation_timestamp) >= '2016-10-01'
      AND TO_DATE(page_creation_timestamp) < '2016-11-01')
     OR
     (TO_DATE(page_creation_timestamp) >= '2017-05-01'
      AND TO_DATE(page_creation_timestamp) < '2017-06-01'))
AND snapshot = '2018-02';

Compare the creation timestamps and deletion timestamps of both datasets, see if they agree[edit]

-- Query to join the two creation tables and compare deletion timestamps.
USE nettrom_articlecreations;
set hive.auto.convert.join.noconditionaltask=false;
SELECT *
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id
WHERE end_timestamp IS NOT NULL
AND revision_deleted_timestamp IS NOT NULL
AND end_timestamp != revision_deleted_timestamp;

From spot-checking five of the pages, it seems like the deletion timestamp is correct for the page table as it corresponds to a deletion event in the logging table. I am unsure what the timestamp of the denormalized edit history table refers to.

Question: How many pages are in both datasets, and what proportion of those have disagreements between the deletion timestamps?

USE nettrom_articlecreations;
set hive.auto.convert.join.noconditionaltask=false;
SELECT count(*) AS num_pages
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id;
SELECT count(*) AS num_null_in_page
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id
WHERE end_timestamp IS NULL
AND revision_deleted_timestamp IS NOT NULL;
SELECT count(*) AS num_null_in_revision
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id
WHERE end_timestamp IS NOT NULL
AND revision_deleted_timestamp IS NULL;
SELECT count(*) AS num_deletions
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id
WHERE end_timestamp IS NOT NULL
AND revision_deleted_timestamp IS NOT NULL;
SELECT count(*) AS num_disagreements
FROM creations_from_page
JOIN creations_from_revision
ON creations_from_page.page_id=creations_from_revision.page_id
WHERE end_timestamp IS NOT NULL
AND revision_deleted_timestamp IS NOT NULL
AND end_timestamp != revision_deleted_timestamp;

I find that there are 128,764 pages in both datasets. Of those, 7 have a deletion timestamp in mediawiki_history but not in mediawiki_page_history. There are 19,451 (15.1%) that match the opposite case, they do not have a deletion timestamp in mediawiki_history, but do have one in mediawiki_page_history. There are 8,122 pages (6.3%) that have deletion timestamps in both datasets, of which 8,115 pages (99.9%) have disagreements.

Compare the deletion timestamps against the logging table and the revision table in order to identify what event matches it.[edit]

I exported two TSVs of the 8,122 matching pages with deletion timestamps in both tables, one TSV for each of the two datasets. Next, I wrote a bit of Python to read those TSVs and check whether the deletion timestamp for each page matches a revision or a logging event. This script uses page_id as the authoritative identifier, which should work given the previous findings about deletion events mentioned earlier. It queries both the revision and archive table for revision events, seeking a match on rev_page and rev_timestamp (or the archive table equivalents). For the logging table it seeks a match on log_page and log_timestamp. The resulting data is stored in the staging database on analytics-store, so we can query it.

Perhaps the most interesting question is what the revision_deleted_timestamp reflects. How many of the 8,122 entries have a match an entry in the logging table?

SELECT count(*) AS num_log_matches
FROM nettrom_creations_from_revision_sources
WHERE log_timestamp IS NOT NULL;

I find that 881 (10.8%) have a matching entry, meaning that almost 90% of the timestamps do not correspond to a logging event. For those that do, a challenge with this dataset is that for some of the pages (157, to be exact) we find multiple log entries. One example is page ID 54134725, which has four log entries at timestamp 20170524175733, but none of them are a page deletion event. Ignoring that problem (the Python script just uses the last row from the logging table in that case), we get the following overview of matching logging events:

SELECT log_type, log_action, count(*) AS num_events
FROM nettrom_creations_from_revision_sources
GROUP BY log_type, log_action;
log_type log_action num_events
delete delete 4
move move 331
move move_redir 25
pagetriage-curation reviewed 11
pagetriage-curation tag 5
pagetriage-deletion delete 147
patrol autopatrol 8
patrol patrol 341
protect protect 9

We can also investigate the end_timestamp from the mediawiki_page_history table. There, we find that 8,114 pages (99.9%) have a matching logging event, and the overview of those go as follows:

log_type log_action num_events
delete delete 8,037
delete restore 13
move move 52
move move_redir 12

Summary[edit]

The revision_deleted_timestamp in the mediawiki_history table does not reflect page deletion events, as it should. I've found that about 15% of our sample dataset does not have the timestamp set when it should. When the timestamp is set, I found almost 90% of the cases did not have a matching event in the logging table, and for those that did, it would generally not match a page deletion event.