ORCFile Table Scan Performance

Let’s evaluate the performance of table scan operation using various file formats available in Hive. I have a table containing 32M rows and I need to execute a simple query that returns 43 rows only:

SELECT * FROM channels WHERE code = 'NOT APPLICABLE';

The size of the table when stored in different formats with different compression codecs is as follows:

Format Compression Size
Text Uncompressed 13.1 GB
SequenceFile Uncompressed 12.3 GB
SequenceFile Snappy 1.3 GB
ORCFile Snappy 561 MB
ORCFile ZLIB 341 MB

You can see how compression and especially ORCFile file format allow you to reduce the storage size. Now let’s run the query and see its performance.

SequenceFile Snappy

First let’s query the table containing data in SequenceFile format with Snappy compression:

SELECT * FROM channels WHERE code = 'NOT APPLICABLE';

Data in HDFS:

orcfile_20141010

9 map tasks were executed that completed within 15-30 seconds:

orcfile_20141010_1

Looking at the log of the fastest mapper, we can see that it read and processed 3M rows:

orcfile_20141010_2

The job read 1.2 GB of data from HDFS that equals to the full SequenceFile size:

orcfile_20141010_3

ORCFile ZLIB

Now let’s query the table containing data in ORFile format with ZLIB compression:

SELECT * FROM channels WHERE code = 'NOT APPLICABLE';

Data in HDFS:

orcfile_20141010_4

1 map task was launched that completed in 12 seconds:

orcfile_20141010_5

The mapper log shows that it processed 1000 rows only:

orcfile_20141010_6

Note that although the table size is 341 MB only 86 MB were read from HDFS:

orcfile_20141010_7

ORCFile format not only allows reducing the storage size, but also helps avoid full table scan. Its inline indexes help define which blocks of data to read and which to skip. As you can see ORCFile can dramatically reduce resources required to perform table scan.