Let’s evaluate the performance of table scan operation using various file formats available in Hive. I have a table containing 32M rows and I need to execute a simple query that returns 43 rows only:
SELECT * FROM channels WHERE code = 'NOT APPLICABLE';
The size of the table when stored in different formats with different compression codecs is as follows:
Format | Compression | Size |
---|---|---|
Text | Uncompressed | 13.1 GB |
SequenceFile | Uncompressed | 12.3 GB |
SequenceFile | Snappy | 1.3 GB |
ORCFile | Snappy | 561 MB |
ORCFile | ZLIB | 341 MB |
You can see how compression and especially ORCFile file format allow you to reduce the storage size. Now let’s run the query and see its performance.
SequenceFile Snappy
First let’s query the table containing data in SequenceFile format with Snappy compression:
SELECT * FROM channels WHERE code = 'NOT APPLICABLE';
Data in HDFS:
9 map tasks were executed that completed within 15-30 seconds:
Looking at the log of the fastest mapper, we can see that it read and processed 3M rows:
The job read 1.2 GB of data from HDFS that equals to the full SequenceFile size:
ORCFile ZLIB
Now let’s query the table containing data in ORFile format with ZLIB compression:
SELECT * FROM channels WHERE code = 'NOT APPLICABLE';
Data in HDFS:
1 map task was launched that completed in 12 seconds:
The mapper log shows that it processed 1000 rows only:
Note that although the table size is 341 MB only 86 MB were read from HDFS:
ORCFile format not only allows reducing the storage size, but also helps avoid full table scan. Its inline indexes help define which blocks of data to read and which to skip. As you can see ORCFile can dramatically reduce resources required to perform table scan.