User:MPopov (WMF)/Notes/Querying JSON-containing data

From Meta, a Wikimedia project coordination wiki
event.edit_tasks in different useragent.wmf_app_version
2.7.50309-r-2020-01-08 2.7.50322-r-2020-06-08
{
	"add-caption" : {
		"cancels" : 0,
		"clicks" : 0,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 0,
		"suggestions_clicked" : 0
	},
	"add-description" : {
		"cancels" : 0,
		"clicks" : 1,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 1,
		"suggestions_clicked" : 1
	},
	"translate-caption" : {
		"cancels" : 0,
		"clicks" : 0,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 0,
		"suggestions_clicked" : 0
	}
}
{
	"a-c" : {
		"cxl" : 0,
		"clk" : 1,
		"fl" : 0,
		"imp" : 7,
		"suc" : 1,
		"sg" : 1
	},
	"a-d" : {
		"cxl" : 2,
		"clk" : 2,
		"fl" : 0,
		"imp" : 7,
		"suc" : 0,
		"sg" : 1
	},
	"t-c" : {
		"cxl" : 0,
		"clk" : 0,
		"fl" : 0,
		"imp" : 9,
		"suc" : 1,
		"sg" : 0
	}
}

These are my notes for querying JSON-containing event data with Presto, which can be easily used in Superset's SQL Lab. HiveQL's library of UDFs has very limited support[1] for JSON data while Presto has a lot of functions.[2]

Data[edit]

For this example we will be working with the schema MobileWikiAppSuggestedEdits which has a field edit_tasks that the Suggested Edits instrument in the Wikipedia Android app populates with a JSON string containing counts of suggestions seen ("impressions"), clicked ("unique clicks"), and edited ("successes") split by edit task – the type of suggestions (image captions to add/translate, article descriptions to add/translate). Two examples of this data can be found in the table on the right.

Note: because this field is specified as a free-form string in the legacy EventLogging schema, the instrument can put whatever in there – including changing the names of the fields between versions without changing the schema's revision. In a modern Event Platform schema, that's certainly possible but it would be better to define the edit_tasks field as an object with sub-fields, so that the event data can be automatically made into a map in the table.[3]

Query in Hive[edit]

First, the JSON string is processed via a user-defined table generating function (UDTF) JSON_TUPLE and then it needs to be used in conjunction with Lateral View, which turns the fields into tables and makes those fields' sub-fields available as columns:

SELECT
  SUBSTR(dt, 1, 10) AS `date`, event.source,
  SUM(COALESCE(caption_addition.successes, 0L)) AS captions_added,
  SUM(COALESCE(description_addition.successes, 0L)) AS descriptions_added
FROM mobilewikiappsuggestededits
LATERAL VIEW
  JSON_TUPLE(event.edit_tasks, 'a-d', 't-d', 'a-c', 't-c', 'i-t') edit_tasks
  AS add_desc, translate_desc, add_cap, translate_cap, image_tags
LATERAL VIEW
  JSON_TUPLE(edit_tasks.add_desc, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') description_addition
  AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
LATERAL VIEW
  JSON_TUPLE(edit_tasks.add_cap, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') caption_addition
  AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
  AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08'
GROUP BY SUBSTR(dt, 1, 10), event.source
ORDER BY `date`
LIMIT 1000;

Note: While there is a GET_JSON_OBJECT to retrieve a single key's value, this solution is more efficient than calling GET_JSON_OBJECT multiple times.

The results in 112.011 seconds
date source captions_added descriptions_added
2020-06-15 suggestedEdits 9 87
2020-06-15 feed 1 22
2020-06-16 feed 5 37
2020-06-16 suggestedEdits 16 100
2020-06-17 feed 3 47
2020-06-17 suggestedEdits 22 119
2020-06-18 feed 1 26
2020-06-18 suggestedEdits 17 166
2020-06-19 feed 0 39
2020-06-19 suggestedEdits 14 148
2020-06-20 feed 3 30
2020-06-20 suggestedEdits 20 74
2020-06-21 feed 2 24
2020-06-21 suggestedEdits 26 86
2020-06-22 feed 7 28
2020-06-22 suggestedEdits 16 69
2020-06-23 feed 4 30
2020-06-23 suggestedEdits 19 88

Query in Presto[edit]

The process is vaguely similar to Hive. First, we need to process the JSON string and only then can we operate on it. But unlike Hive's UDTF which generates a table and requires LATERAL VIEW, we can cast the JSON data into a map of maps. Presto has extensive set of functions for working with maps.[4]

WITH parsed_json AS (
    SELECT
        SUBSTR(dt, 1, 10) AS "date", event.source,
        CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
    FROM mobilewikiappsuggestededits 
    WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
      AND event.edit_tasks != '{}'
      AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
)
SELECT
    "date", source,
    SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) AS captions_added,
    SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) AS descriptions_added
FROM parsed_json
GROUP BY "date", source
ORDER BY "date", source;

Note: had the instrument not omitted the fields and we could rely on the data in edit_tasks to be complete (even if the values were all 0s) and use underscores as separators, we would be able to use the JSON_EXTRACT function in Presto with a JSONPath like '$.a_c.suc' to retrieve the exact values from the parsed JSON without converting it to a map.

The results in 3.47 seconds
date source captions_added descriptions_added
2020-06-15 feed 1 22
2020-06-15 suggestedEdits 9 87
2020-06-16 feed 5 37
2020-06-16 suggestedEdits 16 100
2020-06-17 feed 3 47
2020-06-17 suggestedEdits 22 119
2020-06-18 feed 1 26
2020-06-18 suggestedEdits 17 166
2020-06-19 feed 0 39
2020-06-19 suggestedEdits 14 148
2020-06-20 feed 3 30
2020-06-20 suggestedEdits 20 74
2020-06-21 feed 2 24
2020-06-21 suggestedEdits 26 86
2020-06-22 feed 7 28
2020-06-22 suggestedEdits 16 69
2020-06-23 feed 4 30
2020-06-23 suggestedEdits 19 88

Reducing maps and arrays[edit]

Suppose we're interested in counting the total number of interactions. The edit_tasks map looks like

{"i-t": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 5}, "a-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "a-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 3}, "t-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "t-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 6}}

And we want to sum across all of those to yield a single number: 18. Here's how we can accomplish this in Presto:

WITH parsed_json AS (
    SELECT
        SUBSTR(dt, 1, 10) AS "date",
        CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
    FROM mobilewikiappsuggestededits 
    WHERE year = 2020 AND month = 9 AND day = 24 AND hour = 12
      AND event.edit_tasks != '{}'
      AND useragent.wmf_app_version >= '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
    LIMIT 100
)
SELECT
  date, edit_tasks,
  -- REDUCE(MAP_VALUES(TRANSFORM_VALUES(edit_tasks, (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s))), 0, (s, x) -> s + x,s -> s) AS total_actions
  -- ^ Deconstructed:
  REDUCE(
    MAP_VALUES( -- Second: Now that we have a map like { "i-t": 4, ..., "a-c": 0 }, let's extract those counts into an array [4, ..., 0]
      TRANSFORM_VALUES( -- First: We're going to aggregate within each type of task (i-t, a-c, etc.)
        edit_tasks,
        (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s)) -- this sums across suc, clk, sg, cxl, fl, imp
      ),
    0, (s, x) -> s + x, s -> s -- Third: Reduce/sum that array (e.g. [4, ..., 0]) to give us the total number of actions performed
  ) AS total_actions
FROM parsed_json;

The above query uses map functions map_values() and transform_values() to operate on the map and extract values to arrays so that the array function reduce() can be used.

References[edit]