Hive on Tez – 2 Seconds (The Fastest Query, October 2014)

I am trying to estimate what the fastest performance you can achieve in a live environment using Hive on Tez. What is the overhead of launching tasks on Tez?

Environment (Live production cluster):
cluster_20141009

Let’s query a single row/column table dual. Hive settings:

set hive.execution.engine=tez;
set hive.prewarm.enabled=true;

Hive on Tez does not automatically allocate a session and containers, you have to launch any query to warm up Tez. For this reason I did not take into account the first execution of the query. After the first attempt, the best attempt is as follows:

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009085353_1303a726-cddb-421c-bdc7-d47db1678fa4
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_125195)

Map 1: -/-
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
OK
1
Time taken: 2.05 seconds, Fetched: 1 row(s)

An attempt in a less busy environment:

cluster_20141009_2

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009085555_10aa1761-0ccd-422b-b147-b3391b5f512f
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1412375486094_125195)

Map 1: -/-
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
OK
1
Time taken: 1.791 seconds, Fetched: 1 row(s)

Just for comparison, let’s run on MapReduce:

set hive.execution.engine=mr;

select 1 from dual where 1 != 0;
Query ID = v-dtolpeko_20141009090707_43ff36eb-7eb5-46d3-a53a-1c383ce558b1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1412375486094_125419, Tracking URL = http://chsxedw:8088/proxy/application_1412375486094_125419/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1412375486094_125419
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2014-10-09 09:07:45,439 Stage-1 map = 0%,  reduce = 0%
2014-10-09 09:08:04,474 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1412375486094_125419
MapReduce Jobs Launched:
Job 0:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1
Time taken: 31.837 seconds, Fetched: 1 row(s)

You can see that Tez allows you to reduce the query start up overhead to 2 seconds, but still not to 0.01-0.1 seconds.

Leave a Reply