PL/HQL 0.03 Released

PL/HQL 0.03, an open-source (Apache License 2.0) procedural SQL-on-Hadoop tool released today. Below are the most interesting new features available:

Managed Temporary Tables

PL/HQL allows you to use DECLARE TEMPORARY TABLE statement even with Hive 0.13 and earlier:

SET plhql.temp.tables = managed;
 
DECLARE TEMPORARY TABLE orders_tmp
(
   id INT,
   name STRING
);
 
INSERT INTO orders_tmp SELECT id, name FROM orders;
SELECT * FROM orders_tmp;

PL/HQL automatically adds an UUID to the table name to prevent name conflicts between multiple sessions, and drops the table at the end of session.

FOR CURSOR Loop

Now you can iterate through the query results using the FOR statement:

FOR item IN (
    SELECT dname, loc as location
    FROM dept
    WHERE dname LIKE '%A%'
    AND deptno > 10
    ORDER BY location)
LOOP
  DBMS_OUTPUT.PUT_LINE('Name = ' || item.dname || ', Location = ' || item.location);
END LOOP;

PL/HQL executes the query and executes the specified block of code for each row. You can print results, add conditional statements, other loops and so on.

On-the-fly SQL Conversion Introduced

PL/HQL introduced a feature to convert SQL statements on the fly, so you need to spend less time making your SQL statements compatible with Hive and other SQL-on-Hadoop tools.

For example, if you execute the following CREATE TABLE statement in Hive:

CREATE TABLE dept 
  (
    deptno NUMBER(2,0),
    dname  NUMBER(14),
    loc    VARCHAR2(13),
    CONSTRAINT pk_dept PRIMARY KEY (deptno)
  );

It is converted to the following Hive-compatible statement before execution:

CREATE TABLE dept 
(
  deptno DECIMAL(2,0),
  dname  DECIMAL(14),
  loc    STRING
);

There are many more new features, for more details, see PL/HQL 0.03 What’s New.

Architecture of a Modern Data Warehouse

The BI and Big Data worlds converge and it affects how we build enterprise data warehouses (EDW) today. In the traditional approach we use a single RDBMS product such Teradata, Oracle, DB2, SQL Server, Netezza, Greenplum or Vertica to serve all needs of BI users. But now it is not enough.

Why? Firstly, we have to deal with too large volume of raw data that is expensive to store in a RDBMS paying $10,000 per each terabyte. Secondly, we need to provide near real time analytics today.

So the modern approach involves three layers, and some analysts call them: cold, warm and hot (see, Forrester’s definition).

Cold Layer

You use Hadoop to store all enterprise data in the format as they arrive from the source systems (Data Lake concept). You do not perform any data transformations, do not create any models. It is a fault-tolerant archive, and you have access to all data for any historical period at any time you need. You do not have to ask system administrators to restore data from a backup anymore. Data is always available.

The archive is active, this means you can run MapReduce, Spark and other data processing frameworks to quite effectively query very large volumes of raw data. Also you can move significant ETL workload to clean data, to build data marts to Hadoop (70% and more).

We can also call the cold layer as the landing zone or staging area.

Warm Layer

This is where our traditional data warehouse RDBMS lives. Now it mostly uses Hadoop as the data source but typically it also gets data directly from the transactional databases (OLTP). Some data marts can be already pre-built in Hadoop, so ETL workload is not so large at this layer.

We still need to use a mature RDBMS for better query performance, data governance, transactional support, MDM, legacy business logic and so on.

Hot Layer

The hot layer help us provide extremely low latency query execution and near real time analytics. Depending on the requirements we can use an in-memory database, NoSQL database or messaging system (MQ).

In a Single System?

You see that the data warehouse architecture evolved to deal with the increased data volumes and requirements for interactive data exploration. Can this cold-warm-hot architecture be implemented in a single product? Not yet:

  • Hadoop

    Spark data processing may be significantly fast compared with MapReduce, but it is still slower than RDBMS when you query relatively small data marts.

  • RDBMS

    RDBMS products should offer a cheap storage option to store huge amounts of data, and should simplify loading of unstructured data. Additionally some sort of schema-on-read SQL support is also required to compete with Hadoop.

  • In-memory Databases

    You simply cannot use them to store all amounts of raw data.

  • NoSQL Databases

    NoSQL databases are mostly key-value stores and they do not offer a rich set of batch analytics features as Hadoop and RDBMS.

Big Data Forum 2015 in Moscow, Russia

Big Data Forum 2015 in Moscow was a great event. As usual I talked about Hadoop for Data Warehousing.

bigdata_forum_moscow

It is interesting to see that Big Data is not only Hadoop, a lot of big vendors presented their solutions:

  • EMC Pivotal and Isilon
  • HP Vertica
  • IBM Watson Analytics
  • Microsoft Azure
  • Teradata Aster
  • Informatica

Oracle team was on the conference, but they did not present anything for some reason. Also Surgutneftegas (Gas and Oil) said that they are quite happy with SAP HANA. Among customers there were a lot of banks, communication, gas and oil and other big Russian enterprises.

It is much more interesting that neither Cloudera nor Hortonworks participated at this quite large conference. Big Data is for Big Business and Big Vendors in Russia?

Computerworld article about the event: Big Data Forum 2015 (in Russian).