Map-side Aggregation in Hive

In MapReduce, you usually use map tasks to filter out records and reduce tasks to perform aggregations. But when you need to perform an aggregation of a large number of rows then using reducers can lead to a very high network I/O. Fortunately, Hive is capable of doing map-side aggregations whenever possible.

Consider a query to calculate the total number of orders per store and product:

SELECT store, product, count(*) AS cnt 
FROM orders
GROUP BY store, product;

Without map-side aggregation, all rows have to be sent to reducers causing high network I/O. Hive has option that is set to true, by default, to specify whether to use map-side aggregation in GROUP BY.

Here is some statistics of the query execution for a sample orders table:


The query does not contain a WHERE clause, and you can see that while the input number of rows is 1,075,477,972 (Map input records), the number of map output rows is just 7,370,737. That’s quite a good result as map tasks were able to reduce the number of rows from 1 billion to 7.3 million.

Only 7,370,737 rows have to be sent to reducers that finally reduced them to 1,657,744 rows.

To perform map-side aggregation, Hive uses a in-memory hash table to hold aggregate values. Besides, Hive offers the following options to configure map-side aggregation:

  • (Default: 0.9) – When the size of the hash table exceeds threshold, it is flushed. This means that for some keys a map task can produce multiple local counts.
  • (Default: 0.5) – Percent of total map task memory that can be used for hash table.
  • (Default: 0.5) – Ratio between hash table and input size to turn map-side aggregation off. If input keys are unique (or highly selective) within each map task there is no value in map-side aggregation.
  • hive.groupby.mapaggr.checkinterval (Default: 100,000) – After this number of rows Hive checks the number of items in the hash table.

SELECT COUNT(*) on SequenceFiles and Text

What happens when you run SELECT COUNT(*) statement for a table stored in the SequenceFile or text file format in Hive? How many resources does this operation require?


Let’s run the query on a table containing 1 billion rows (1,075,477,972 rows) to investigate details. There were 242 mappers and 1 reducer launched in my example, and the job result is as follows:

Job 0: Map: 242  Reduce: 1   Cumulative CPU: 2763.69 sec   HDFS Read: 80211919808 HDFS Write: 11 SUCCESS
Total MapReduce CPU Time Spent: 46 minutes 3 seconds 690 msec
Time taken: 70.251 seconds, Fetched: 1 row(s)

First, you can see that mappers had to read all table data from HDFS (80 GB), and only the final count was written to HDFS (11 bytes):


The good news is that each mapper calculated its own local row count, so the reducer just had to sum 242 counts to get the final row count:


So when calculating the number of rows in a table stored in SequenceFile or text format you should remember that map tasks will read all table rows from HDFS and this operation can be quite expensive depending on the table size.

At the same time the reducer does not need to process all rows, it just receives local counts from each map, so network I/O between mappers and the reducer is very low, and the reduce operation is not expensive no matter how many rows are in the table.

Note that if your query includes a WHERE condition, it still requires the same I/O resources to execute the statement:

SELECT COUNT(*) FROM orders WHERE code = 'A1';

Mappers still need to read all data from HDFS, and code = 'A1' condition just affects on their local rows counts (and additional time to compare values). The reducer still needs to read local row counts from 242 mappers and summarize them.

Reduce = 99% or Skewed Joins in Hive

Often running a HQL query you may notice that it progresses to 99% reduce stage quite fast and then stucks:

2014-10-07 08:46:01,149 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 6905.85 sec
2014-10-07 08:47:01,361 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 6999.45 sec
2014-10-07 08:48:01,441 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7065.59 sec
2014-10-07 08:49:01,670 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7125.26 sec
2014-10-07 08:50:01,808 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7188.12 sec

The problem is that Hive estimates the progress depending on the number of reducers completed, and this does not always relevant to the actual execution progress. It is possible that a query can reach 99% in 1 minute and then execute remaining 1% during 1 hour.

The most typical reason of this behavior is skewed data. For example, assume that you have a table that tracks all visits to the specific sites and SITE.COM has 100M rows while there are a dozen of other sites SITE.ORG, SITE.NET etc. that have just 10K visitors each.

Then when you join this table with another by site name, one reducer has to process 100M rows while other reducers process just 10K rows each.

So if you have 99 sites having 10K visitors, single site having 100M visitors and specify 100 reducers then 99% of reducers will finish their work very quickly and you have to wait for a long time when the last reducer terminates.

Not only joins

Data skew issue can arise not only in joins. For example, if you perform a GROUP BY SITE_NAME in our example then a single reducer has to deal with 100M rows while others have to process much smaller number of rows.