Hadoop vs MPP – Joining 2 Large Tables – Optimization Using Bucket Map Join

In a data warehouse, you often need to join several large tables. Let’s assume you have ORDERS and ORDER_ITEMS tables defined as follows:

    id INT,
    created DATE,
    customer VARCHAR(100)

  CREATE TABLE order_items
    id INT,
    order_id INT,
    product_id INT,
    qty INT,
    price DECIMAL(9,2)

You can use the following query to get all order items:

  FROM order o, order_items i
  WHERE o.id = i.order_id

Both tables can be very large, containing hundreds of millions rows each.

Join in Traditional MPP Systems

First, let’s see how this problem is solved in traditional MPP systems such as Teradata, DB2 DPF, Greenplum and others.

Continue reading