Quite often you have to deal with data sets that contain JSON objects in some columns only. For example, consider the following event stream data containing JSON in payload
column:
id | timestamp | type | payload |
---|---|---|---|
1 | 2016-11-09 11:11:13.235 | A | {“a”:”1″,”b”:”2″,”c”:”3″} |
2 | 2016-11-09 13:18:34.000 | A | {“a”:”1″,”b”:”8″,”c”:”0″} |
3 | 2016-11-09 15:20:31.568 | A | {“a”:”3″,”b”:”7″,”c”:”6″} |
Parsing Simple JSON Objects
If JSON data contains just key-value pairs i.e. {"key":"value","key2":"value2",...}
you can use JsonStringToMap
UDF to extract the required values by key. For example,
-- JsonStringToMap is not built-in UDF, so you have to register it register '/usr/lib/pig/lib/*.jar'; define JsonStringToMap com.twitter.elephantbird.pig.piggybank.JsonStringToMap(); -- Load data d = load 'data.txt' using PigStorage() as (id:chararray, ts:chararray, type:chararray, payload:chararray); -- Transform JSON object to Pig MAP d1 = foreach d generate JsonStringToMap(payload) as payload; -- Now you can work with JSON object as a Pig MAP and extract values d2 = foreach d1 generate payload#'a', payload#'b', payload#'c'; dump d2;
The last statement outputs values for a, b
and c
keys for every row:
(1,2,3) (1,8,0) (3,7,6)
Parsing JSON Arrays
Unfortunately, JsonStringToMap
UDF does not work with JSON arrays, i.e. data in the following format:
[ {"key":"value","key2":"value2",...}, {"key":"valueA","key2":"value2A",...}, ... ]
There are a dozen of custom UDFs written to work with JSON arrays, but I would like to show you how to write your own Python UDF to iterate JSON arrays and extract any required data.
Let’s assume that payload
column now contains the following data:
[{"a":"1"},{"b":"2"},{"c":"3"}] [{"a":"1"},{"a":"8"}] [{"a":"3","b":"7","c":"6"}]
Here we have 3 rows, every row contains a JSON array. Note that the first array contains 3 JSON objects, the second array contains 2 objects, and the third array contains just one JSON object (with 3 key-value pairs).
Let’s write a Pig UDF in Python that returns the number elements in array, and the last value for a
key in each array:
import com.xhaus.jyson.JysonCodec as json @outputSchema('result:(cnt:int, a:int)') def getItems(data): cnt = int(0) val = int(0) jdata = json.loads(data) # iterate items in JSON array, each item is a JSON object for i in jdata: cnt += 1 # check whether key "a" exists in the current JSON object if "a" in i: val = i["a"] return (cnt, val)
Now we can invoke our UDF as follows:
register './getItems.py' USING jython as gi; d1 = foreach d generate gi.getItems(payload); dump d1;
The last statement outputs the number of items in the array and the last value for a
for every row:
(3,1) (2,8) (1,3)
You can also extend the Python UDF to parse JSON structure of any complexity.