Hive on Tez – 2 Seconds (The Fastest Query, October 2014)

I am trying to estimate what the fastest performance you can achieve in a live environment using Hive on Tez. What is the overhead of launching tasks on Tez?

Environment (Live production cluster):
cluster_20141009

Let’s query a single row/column table dual. Hive settings:

set hive.execution.engine=tez;
set hive.prewarm.enabled=true;

Hive on Tez does not automatically allocate a session and containers, you have to launch any query to warm up Tez. For this reason I did not take into account the first execution of the query. After the first attempt, the best attempt is as follows:

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009085353_1303a726-cddb-421c-bdc7-d47db1678fa4
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_125195)

Map 1: -/-
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
OK
1
Time taken: 2.05 seconds, Fetched: 1 row(s)

An attempt in a less busy environment:

cluster_20141009_2

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009085555_10aa1761-0ccd-422b-b147-b3391b5f512f
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_125195)

Map 1: -/-
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
OK
1
Time taken: 1.791 seconds, Fetched: 1 row(s)

Just for comparison, let’s run on MapReduce:

set hive.execution.engine=mr;

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009090707_43ff36eb-7eb5-46d3-a53a-1c383ce558b1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1412375486094_125419, Tracking URL = http://chsxedw:8088/proxy/application_1412375486094_125419/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1412375486094_125419
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2014-10-09 09:07:45,439 Stage-1 map = 0%,  reduce = 0%
2014-10-09 09:08:04,474 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1412375486094_125419
MapReduce Jobs Launched:
Job 0:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1
Time taken: 31.837 seconds, Fetched: 1 row(s)

You can see that Tez allows you to reduce the query start up overhead to 2 seconds, but still not to 0.01-0.1 seconds.

Tez Sessions – Using the Same ApplicationMaster for Queries

Hive on Tez execution engine allows you to use sessions. This means that you can use the same ApplicationMaster to submit your queries.

Let’s start Hive and run the first query:

set hive.execution.engine=tez;

select count(*) from tab1;

Query ID = v-dtolpeko_20141008091515_ccd90ab1-4771-4515-beb7-6696be762263
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_99119)

Map 1: -/-      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK

Then let’s start the second query:

select count(*) from tab2;

Query ID = v-dtolpeko_20141008091616_e8f7f0ed-7f64-41b7-ba6e-c0fbe49191d7
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_99119)

Map 1: -/-      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK

You can see that both queries use the same application_id application_1412375486094_99119 that means they use the same ApplicationMaster. In contrast to MapReduce that has to launch a new ApplicationMaster for every query (and even multiple ApplicationMaster per single query), Tez sessions allow you to reduce the job startup time.

Note that ApplicationMaster does not execute jobs, it has to launch one or more Containers to execute the query. To reuse containers consider also specifying:

set hive.prewarm.enabled=true;