User:MPopov (WMF)/Notes/Counting unique app users

When unique device identifiers are not available (which is the case across all our sites and if a user has not opted-in to in-app analytics^[1]), we want to be able to count the number of unique devices. Analytics Engineering has done a lot of work in this area and have come up with a great solution for counting unique devices.^[2] In cases where we can't use the last-access "timeslip" and need to estimate the number of unique devices from webrequests, I thought it would be helpful to compare different combinations of fields.

Methods[edit]

Below are 11 different combinations of fields that were hashed and used to count unique devices:


Method	Fields hashed
1	User agent
2	IP address
3	User agent, IP address
4	User agent, Accept-Language header
5	IP address, Accept-Language header
6	User agent, City (IP-geolocated)
7	User agent, Timezone (IP-geolocated)
8	User agent, Accept-Language header, IP address
9	User agent, Accept-Language header, City
10	User agent, Accept-Language header, Timezone
11	User agent, Accept-Language header, Geographical Subdivision

Querying Hive[edit]

I restricted my query to pageviews made by mobile devices (which are not known spiders/bots, although with apps that's not really an issue) through the Wikipedia Android & iOS apps by users who have opted-in to in-app analytics, since those requests include an app install ID that we can count to get the true number of unique devices. Each method's accuracy is assessed by calculating a relative error between the true count and the estimated count.

Results[edit]

Accuracy by OS[edit]

Multiple users connected to the same wireless network (e.g. school campus) – and thus sharing a public IP address – is a problem, but not as big of a problem as mobility. Specifically, we can see requests from the same device from multiple IP addresses throughout the day due to the owner traveling throughout the day. Every new location is potentially a new IP address, and using the IP address by itself or in combination with User-Agent (a popular technique) yields the best results with a consistent over-counting error of 20% on both operating systems. After seeing the results, I added a twelfth method wherein I take the count from method 3 and multiply it by a scaling constant (1/1.2 ≈ 0.83). Even though using an IP address yields to over-counting, none of the other methods came close.

Accuracy by country, grouped by userbase size[edit]

Curious about the accuracy per-country and how the results vary by how many users we see from each country, I calculated the daily error for each method-country combination and took the median. I also calculated the median number of users per day, which I used to group the countries into 5 categories of users per day: very few (0-100), few (100-500), decent (500-1K), many (1K-5K), and A LOT (more than 5K). Note: these categories were chosen sensibly (albeit arbitrarily), not scientifically. Unsurprisingly, the more actual users there were in a country, the larger the relative error was for every method that did not rely on the IP address.

Conclusion[edit]

Device identification through hashing the concatenation of the User-Agent string with the IP address remains the best option in the absence of an actual device identifier or a last-access timeslip. Whenever possible, I recommend scaling by a pre-calculated constant.

Appendix: Code[edit]

Querying[edit]

library(zeallot)

query <- "SET mapred.job.queue.name=nice;
WITH android_app_pageviews AS (
  SELECT DISTINCT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
    user_agent_map['os_family'] AS operating_system,
    x_analytics_map['wmfuuid'] AS app_install_id,
    accept_language,
    user_agent,
    client_ip AS ip_address,
    geocoded_data['city'] AS city,
    geocoded_data['timezone'] AS timezone,
    geocoded_data['subdivision'] AS subdivision,
    geocoded_data['country'] AS country
  FROM wmf.webrequest
  WHERE webrequest_source = 'text'
    AND year = ${year} AND month = ${month} AND day = ${day}
    AND user_agent LIKE('WikipediaApp%')
    AND (
      (PARSE_URL(CONCAT('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
      OR (uri_path LIKE '/api/rest_v1%' AND uri_query == '')
    )
    AND COALESCE(x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
)
SELECT
  date, country, operating_system,
  COUNT(DISTINCT app_install_id) AS n_users_true,
  COUNT(DISTINCT MD5(user_agent)) AS n_users_est01,
  COUNT(DISTINCT MD5(ip_address)) AS n_users_est02,
  COUNT(DISTINCT MD5(CONCAT(user_agent, ip_address))) AS n_users_est03,
  COUNT(DISTINCT MD5(CONCAT(user_agent, accept_language))) AS n_users_est04,
  COUNT(DISTINCT MD5(CONCAT(ip_address, accept_language))) AS n_users_est05,
  COUNT(DISTINCT MD5(CONCAT(user_agent, city))) AS n_users_est06,
  COUNT(DISTINCT MD5(CONCAT(user_agent, timezone))) AS n_users_est07,
  COUNT(DISTINCT MD5(CONCAT(user_agent, accept_language, ip_address))) AS n_users_est08,
  COUNT(DISTINCT MD5(CONCAT(user_agent, accept_language, city))) AS n_users_est09,
  COUNT(DISTINCT MD5(CONCAT(user_agent, accept_language, timezone))) AS n_users_est10,
  COUNT(DISTINCT MD5(CONCAT(user_agent, accept_language, subdivision))) AS n_users_est11
FROM android_app_pageviews
GROUP BY date, country, operating_system
HAVING n_users_true > 1
ORDER BY date, country, operating_system
LIMIT 10000;"

get_components <- function(date) {
  year <- lubridate::year(date)
  month <- lubridate::month(date)
  day <- lubridate::mday(date)
  return(list(year, month, day))
}

start_date <- as.Date("2018-02-01")
end_date <- as.Date("2018-02-28")
results <- do.call(rbind, lapply(seq(start_date, end_date, by = "day"), function(date) {
  message("Fetching data from ", format(date))
  c(year, month, day) %<-% get_components(date)
  query <- glue::glue(query, .open = "${", .close = "}")
  result <- wmf::query_hive(query)
  return(result)
}))

# Packages
library(magrittr)
library(ggplot2)
library(ggrepel)

# Utility Functions
aggregate_error <- function(grouped_input) {
  collapsed_output <- grouped_input %>%
    dplyr::summarize(
      n_users_true = sum(n_users_true),
      n_users_estimated = sum(n_users_estimated)
    ) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(
      relative_error = abs(n_users_true - n_users_estimated) / n_users_true,
      direction = dplyr::if_else(n_users_estimated > n_users_true, "overcounted", "undercounted")
    )
  return(collapsed_output)
}
expand_method <- function(input) {
  methods <- dplyr::data_frame(
    method = paste("Method", 1:12),
    fields = c("UA", "IP", "UA+IP", "UA+AL", "IP+AL", "UA+City", "UA+TZ", "UA+AL+IP", "UA+AL+City", "UA+AL+TZ", "UA+AL+Subdiv", "83% Method 3")
  )
  output <- input %>%
    dplyr::left_join(methods, by = "method") %>%
    dplyr::mutate(
      label = paste0(method, " (", fields, ")"),
      label = factor(label, levels = unique(label))
    )
  return(output)
}

Importing Data[edit]

unique_counts <- readr::read_csv("data/unique-counts_2018-02.csv") %>%
  dplyr::mutate(
    country = stringi::stri_trans_general(country, "Latin-ASCII"),
    n_users_est12 = n_users_est03 * 0.83
  ) %>%
  tidyr::gather(method, n_users_estimated, -c(date, country, operating_system, n_users_true)) %>%
  dplyr::mutate(
    method = factor(as.numeric(sub("n_users_est([0-9]{2})", "\\1", method)), 1:12, paste("Method", 1:12)),
    relative_error = abs(n_users_true - n_users_estimated) / n_users_true,
    direction = dplyr::if_else(n_users_estimated > n_users_true, "overcounted", "undercounted")
  )

Data Visualization[edit]

Here is the R code I used to visualize accuracy of unique user counting methods:

unique_counts %>%
  dplyr::group_by(date, operating_system, method) %>%
  aggregate_error %>%
  expand_method %>%
  ggplot(aes(x = date, y = relative_error, color = operating_system)) +
  geom_point(aes(shape = direction)) + geom_line() +
  scale_shape_manual(values = c(24, 25)) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~ label) +
  labs(
    x = "Date", y = "Relative error",
    color = "Mobile device operating system", shape = "Direction of error",
    title = "Accuracy of unique user counting methods",
    subtitle = "Each estimate was compared to the true count of unique app install IDs"
  ) +
  wmf::theme_facet(14)

Here is the R code I used to visualize accuracy of unique user counting methods within countries by userbase size:

# Aggregates
per_country <- unique_counts %>%
  dplyr::group_by(method, country, date) %>%
  aggregate_error %>%
  dplyr::group_by(method, country) %>%
  dplyr::summarize(
    avg_relative_error = median(relative_error),
    avg_daily_users = median(n_users_true)
  ) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    daily_users_short = factor(
      as.numeric(cut(avg_daily_users, c(0, 100, 500, 1000, 5000, Inf))), 1:5,
      c("0-100", "100-500", "500-1K", "1K-5K", "5K+")
    ),
    daily_users_long = factor(
      as.numeric(cut(avg_daily_users, c(0, 100, 500, 1000, 5000, Inf))), 1:5,
      c("Very few (0-100)", "Few (100-500)", "Decent (500-1K)", "Many (1K-5K)", "A LOT (5K+)")
    )
  ) %>%
  expand_method

# Outliers
per_country_outliers <- per_country %>%
  dplyr::group_by(method, label, daily_users_short, daily_users_long) %>%
  dplyr::mutate(
    Q1 = quantile(avg_relative_error, 0.25), Q3 = quantile(avg_relative_error, 0.75),
    IQR = Q3 - Q1, lower = Q1 - 1.5 * IQR, upper = Q3 + 1.5 * IQR,
    outlier = avg_relative_error >= upper | avg_relative_error <= lower
  ) %>%
  dplyr::top_n(2, avg_relative_error) %>%
  dplyr::ungroup() %>%
  dplyr::filter(outlier)

# Plot
ggplot(per_country, aes(x = daily_users_short, y = avg_relative_error, color = daily_users_long)) +
  geom_boxplot(outlier.shape = NA, position = "dodge") +
  geom_point(alpha = 0.01) +
  geom_point(data = per_country_outliers) +
  geom_label_repel(data = per_country_outliers, aes(label = country), show.legend = FALSE) +
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(0, 1, 0.2)) +
  coord_cartesian(ylim = c(0, 1)) +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~ label) +
  labs(
    x = "Users/day", y = "Relative error",
    color = "Number of app users/day (median across 2018-02)",
    title = "Accuracy of unique user counting methods within countries, by userbase size",
    subtitle = "Each estimate was compared to the true count of unique app install IDs"
  ) +
  wmf::theme_facet(14)

References[edit]

↑ X-Analytics#Keys
↑ Introducing the unique devices dataset: a new way to estimate reach on Wikimedia projects (blog.wikimedia.org)

[1] X-Analytics#Keys

[2] Introducing the unique devices dataset: a new way to estimate reach on Wikimedia projects (blog.wikimedia.org)

[1]

[2]