SELECT COUNT(*) on SequenceFiles and Text

What happens when you run SELECT COUNT(*) statement for a table stored in the SequenceFile or text file format in Hive? How many resources does this operation require?

SELECT COUNT(*) FROM orders;

Let’s run the query on a table containing 1 billion rows (1,075,477,972 rows) to investigate details. There were 242 mappers and 1 reducer launched in my example, and the job result is as follows:

Job 0: Map: 242  Reduce: 1   Cumulative CPU: 2763.69 sec   HDFS Read: 80211919808 HDFS Write: 11 SUCCESS
Total MapReduce CPU Time Spent: 46 minutes 3 seconds 690 msec
OK
1075477972
Time taken: 70.251 seconds, Fetched: 1 row(s)

First, you can see that mappers had to read all table data from HDFS (80 GB), and only the final count was written to HDFS (11 bytes):

selcount_seq_20141013

The good news is that each mapper calculated its own local row count, so the reducer just had to sum 242 counts to get the final row count:

selcount_seq_20141013_2

So when calculating the number of rows in a table stored in SequenceFile or text format you should remember that map tasks will read all table rows from HDFS and this operation can be quite expensive depending on the table size.

At the same time the reducer does not need to process all rows, it just receives local counts from each map, so network I/O between mappers and the reducer is very low, and the reduce operation is not expensive no matter how many rows are in the table.

Note that if your query includes a WHERE condition, it still requires the same I/O resources to execute the statement:

SELECT COUNT(*) FROM orders WHERE code = 'A1';

Mappers still need to read all data from HDFS, and code = 'A1' condition just affects on their local rows counts (and additional time to compare values). The reducer still needs to read local row counts from 242 mappers and summarize them.

Leave a Reply