SELECT COUNT(*) on SequenceFiles and Text

What happens when you run SELECT COUNT(*) statement for a table stored in the SequenceFile or text file format in Hive? How many resources does this operation require?

SELECT COUNT(*) FROM orders;

Let’s run the query on a table containing 1 billion rows (1,075,477,972 rows) to investigate details. There were 242 mappers and 1 reducer launched in my example, and the job result is as follows:

Job 0: Map: 242  Reduce: 1   Cumulative CPU: 2763.69 sec   HDFS Read: 80211919808 HDFS Write: 11 SUCCESS
Total MapReduce CPU Time Spent: 46 minutes 3 seconds 690 msec
OK
1075477972
Time taken: 70.251 seconds, Fetched: 1 row(s)

First, you can see that mappers had to read all table data from HDFS (80 GB), and only the final count was written to HDFS (11 bytes):

selcount_seq_20141013

The good news is that each mapper calculated its own local row count, so the reducer just had to sum 242 counts to get the final row count:

selcount_seq_20141013_2

So when calculating the number of rows in a table stored in SequenceFile or text format you should remember that map tasks will read all table rows from HDFS and this operation can be quite expensive depending on the table size.

At the same time the reducer does not need to process all rows, it just receives local counts from each map, so network I/O between mappers and the reducer is very low, and the reduce operation is not expensive no matter how many rows are in the table.

Note that if your query includes a WHERE condition, it still requires the same I/O resources to execute the statement:

SELECT COUNT(*) FROM orders WHERE code = 'A1';

Mappers still need to read all data from HDFS, and code = 'A1' condition just affects on their local rows counts (and additional time to compare values). The reducer still needs to read local row counts from 242 mappers and summarize them.

Reduce = 99% or Skewed Joins in Hive

Often running a HQL query you may notice that it progresses to 99% reduce stage quite fast and then stucks:

...
2014-10-07 08:46:01,149 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 6905.85 sec
2014-10-07 08:47:01,361 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 6999.45 sec
2014-10-07 08:48:01,441 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7065.59 sec
2014-10-07 08:49:01,670 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7125.26 sec
2014-10-07 08:50:01,808 Stage-8 map = 100%,  reduce = 99%, Cumulative CPU 7188.12 sec

The problem is that Hive estimates the progress depending on the number of reducers completed, and this does not always relevant to the actual execution progress. It is possible that a query can reach 99% in 1 minute and then execute remaining 1% during 1 hour.

The most typical reason of this behavior is skewed data. For example, assume that you have a table that tracks all visits to the specific sites and SITE.COM has 100M rows while there are a dozen of other sites SITE.ORG, SITE.NET etc. that have just 10K visitors each.

Then when you join this table with another by site name, one reducer has to process 100M rows while other reducers process just 10K rows each.

So if you have 99 sites having 10K visitors, single site having 100M visitors and specify 100 reducers then 99% of reducers will finish their work very quickly and you have to wait for a long time when the last reducer terminates.

Not only joins

Data skew issue can arise not only in joins. For example, if you perform a GROUP BY SITE_NAME in our example then a single reducer has to deal with 100M rows while others have to process much smaller number of rows.

Hive on MapReduce Can Launch Multiple ApplicationMasters per Single Query

Hive on MapReduce converts a HQL statement to one or more MapReduce jobs. For example, if you have 2 subqueries joined together, Hive launches multiple jobs to execute this query:

SELECT * FROM
(
  SELECT DISTINCT SOURCE_NAME
  FROM LZ_OUT
  WHERE QDATE = '1900-01-01'
) w LEFT OUTER JOIN
(
  SELECT DISTINCT SOURCE_NAME
  FROM LZ_OUT
  WHERE QDATE != '1900-01-01'
) d ON w.SOURCE_NAME = d.SOURCE_NAME; 

Excerpt from Hive log:

Launching Job 1 out of 4
Submitted application application_1412375486094_61806
The url to track the job: http://chsxedw:8088/proxy/application_1412375486094_61806/
Stage-1: number of mappers: 67; number of reducers: 1
...
Launching Job 2 out of 4
Submitted application application_1412375486094_61842
The url to track the job: http://chsxedw:8088/proxy/application_1412375486094_61842/
...
Launching Job 4 out of 4
Submitted application application_1412375486094_61863
The url to track the job: http://chsxedw:8088/proxy/application_1412375486094_61863/

For each job Hive asks YARN ResourceManager to launch a separate ApplicationMaster. Each ApplicationMaster needs to allocate new containers for its job from scratch. This approach adds overhead to query execution in Hive.

But for long running batch queries this approach ensures that if one of the jobs fails, it can be safely restarted and there is no need to start the query execution from the beginning.