Schema talk:EditorJourney

From Meta, a Wikimedia project coordination wiki
Maintainer:Kosta Harlan
Team:Growth
Project:Understanding first day usage
Status:
inactive
Purge:After 90 days, auto-purge PII contained in event capsule, as well as User ID, keep the rest indefinitely

Sampling[edit]

We will record all events, sampling at 100%.

Business rules[edit]

The code that logs to the EditorJourney schema is contained in the extension PageViews class.

The following business rules control when we log events to the EditorJourney schema:

  1. $wgWMEUnderstandingFirstDay is set to true in the MediaWiki configuration repository. Currently, this is set for Korean and English beta labs. The plan is to have this enabled on Korean and Czech wikipedias.
  2. The user has to be in the cohort we are interested in logging data for. The cohort is users who have created their accounts within the last 24 hours.

For users in the cohort, on each page view, redirected page view, login or logout event we construct an event object containing the following:

  1. The title text, e.g. "Help desk" if the user is on /wiki/Wikipedia:Help_desk.
  2. The page ID, e.g. 564696
  3. The HTTP request method, this will be either GET or POST depending on whether the user is reading or writing data.
  4. The action associated with the page view, for example if the user is on /w/index.php?title=Wikipedia:Help_desk&action=history then action would be "history"
  5. If the action is "edit", we generate a comma-separated list of permission errors associated with the edit action for this page, if any. Example would be "protectedpagetext,editprotected,edit" or "badaccess-group0". This is helpful in understanding if the user is encountering access permissions problems when attempting to edit a protected article, for example.
  6. If the page view is associated with the mobile front end
  7. The namespace associated with the event, e.g. 1, 2, etc
  8. The path associated with the page view, for example if the URL is https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=info, then the path would be /w/index.php
  9. The query parameters associated with the page view, for example if the URL is https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=info then the query parameters would be ?title=Wikipedia:Help_desk&action=info
  10. The user ID associated with the page view

Redacting sensitive information[edit]

Once we have the event object prepared, we redact sensitive data from it.

  1. First, we hash[1] or redact sensitive query parameters. Hashed query parameters are search, return, and returnto, while token is simply replaced with the string redacted.
  2. Then, we check to see if the event is in a sensitive namespace. The namespaces are defined in the MediaWiki-config repository, and vary by wiki.
  3. If the event's title is not in a sensitive namespace, then we log the event data and are done.

If it is in a sensitive namespace, then we have to hash several fields.

  1. Path: we replace all instances of the title db_key (i.e. Main_Page) in the path with the hashed value of the title db_key
  2. Query: We replace all instances of the title db_key (i.e. Main_Page) in the query with the hashed value of the title db key
  3. Title: We replace the title with the hashed value of the title.
  4. Page title: we replace any instances of title db_key or title text with the hashed value of title db_key

With the above steps done, we send the hashed data via the EditorJourney schema.


Footnotes[edit]

  1. We are using PHP's hash_hmac function with the whirlpool algorithm. The hash secret is generated once per user, and is stored in Redis with a 24 hour TTL. The key to lookup the secret is generated by hashing the user ID and user account registration timestamp. The end result is that we obfuscate the URLs that a user is visiting to protect privacy, but can still see patterns in page views, though we don't know which pages in particular they are viewing.