Adding Columns to an Existing Table in Hive

Let’s see what happens with existing data if you add new columns and then load new data into a table in Hive.

First we will create a table and load an initial data set as follows:

CREATE TABLE airfact
(
  origin STRING,
  dest STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; 

LOAD DATA LOCAL INPATH 'airfact1.txt' INTO TABLE airfact;

The sample airfact1.txt data file content (TAB-delimited file):

SFO     JFK
ORD     LAX

LOAD DATA command just copies the specified file into the table directory, it does not perform any changes or validations of the file.

Now let’s add 2 new columns to the table and load a file containing data in 4 columns:

ALTER TABLE airfact ADD COLUMNS (flight STRING, time STRING);

LOAD DATA LOCAL INPATH 'airfact2.txt' INTO TABLE airfact;

The sample airfact2.txt data file content:

SFO     JFK     Delta 2240              9:15p
ORD     LAX     Virgin America 241      5:35p

Now if you retrieve the data from airfact table, Hive returns NULL values for flight and time columns in old rows:

SELECT * FROM airfact;

airfact

If you browse the HDFS directory of the table, you can see the two original files that we loaded before:

airfact2

So adding new columns into a table is a relatively cheap metadata-only operation as Hive does not modify the existing data files. Then when you retrieve data from the table Hive sets NULL values for columns that do not exist in old data files.

Leave a Reply