Map Join Limitations – Out of Memory in Local Task

When Hive performs a map join it firstly starts a local task to read the side table (“small” table in join) from HDFS (direct read without launching MapReduce) and builds a hash table (for more details, see MapJoin Implementation).

Hive creates the hash table in memory and it imposes significant overheard. Additional factor is compression – the table may look quite small, but its size can grow 10x when the table is decompressed.

For example, I have a table that contains 32 million rows and takes 1.3 GB in HDFS (SequenceFile with Snappy compression codec), and I wanted to use this table in a map join. When this table is stored as a text file without compression it takes about 12 GB.

By default, the maximum size of a table to be used in a map join (as the small table) is 1,000,000,000 bytes (about 1 GB), so I have to increase it for my table:

set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=2000000000;

Then when my table is used in a join, Hive uses a map join and launches a local task:

Starting to launch local task to process map join;      maximum memory = 4261937152
Processing rows:   200000  Hashtable size: 199999  Memory usage:  137524984    percentage:  0.032
Processing rows:   300000  Hashtable size: 299999  Memory usage:  183819200    percentage:  0.043
....

You can see that the maximum memory is 4 GB, and percentage shows (0.032 actually means 3.2%) the fraction of the in-memory hash table in the process memory. The hash table can grow until it reaches hive.mapjoin.localtask.max.memory.usage having 0.9 by default.

Eventually, my local task failed:

Processing rows:   7800000 Hashtable size: 7799999  Memory usage:  3726270024   percentage:  0.874
Processing rows:   7900000 Hashtable size: 7899999  Memory usage:  3784069496   percentage:  0.888
Processing rows:   8000000 Hashtable size: 7999999  Memory usage:  3832235792   percentage:  0.899
Processing rows:   8100000 Hashtable size: 8099999  Memory usage:  3890035464   percentage:  0.913
Execution failed with exit status: 3

The local task reached its memory limit 0.913 (91.3% of process memory) and just processed 8.1M of 32M rows. This example shows that map join local task is memory expensive, and 4 GB was used just to load 25% of rows.

You should take this into account especially when tables are stored with compression. The table size maybe not too large but when it is decompressed it can grow 10x and more and Hive will not be able to build in-memory hash table and use it in a map join.

Leave a Reply