Implementation Limitations of MapJoin in Hive 0.13 on MR

When you need to join a large table (fact) with a small table (dimension) Hive can perform a map side join. You may assume that multiple map tasks is started to read the large table and each mapper will read its own full copy of the small table and perform the join locally.

In Hive 0.13 on MapReduce engine it is implemented in a slightly different and unfortunately not always optimal way:

Step 1 – Download Side-table to the Hive Client machine

First, the data file of the side table is downloaded to the local disk of the Hive client machine which typically is not a Data Node.

You can see this from log:

Starting to launch local task to process map join;
Dump the side-table into file: file:/tmp/v-dtolpeko/hive_2014-10-01 ...
End of local task; Time Taken: 2.013 sec.
Execution completed successfully
MapredLocal task succeeded

Step 2 – Create HashTable and archive the file

Then Hive transforms the side-table data to HashTable and creates .gz archive file:

Archive 1 hash table files to file:/tmp/v-dtolpeko/hive_2014-10-01.../Stage-4.tar.gz

Step 3 – Upload HashTable to HDFS

Hive uploads the archive .gz file to HDFS and add it to the Distributed Cache.

Upload 1 archive file  from file:/tmp/v-dtolpeko/hive_2014-10-01.../Stage-4.tar.gz to 
Add 1 archive file to distributed cache. Archive file: hdfs://chsxe... 


As you can see the side-table is downloaded from HDFS, transformed into Hash Table and archived, and finally written back to HDFS.

You should take into account this behavior when optimizing Hive queries using MapJoin.

Leave a Reply