Compression Codecs – Splittable, Not Splittable Confusion

When you find out that Snappy, ZLIB, LZO etc. compression codecs are not splittable i.e. mappers cannot divide their work processing a single large file in parallel you become confused because you definitely know that all these codecs are widely used in Hadoop.

The key is that these codecs are used with SequenceFile, ORCFile and other file formats that support block structure i.e compression is applied at the block level, not for the entire file. As the result mappers can concurrently read a single large file decompressing individual blocks.

On the other hand, text file format does not support block concept, so you can only compress the whole file, and as a result you cannot use multiple mappers to concurrently read the compressed file.

HDFS – File Copy in Progress – Suffix _COPYING_ Added to Filename

Let’s assume you copy a large file to HDFS, and you need to launch a MapReduce job when the copy finishes. How to make sure that the file is fully copied?

Fortunately, when the copy is still in progress, HDFS adds _COPYING_ suffix to the file name, and removes it when the operation is complete.

Let’s emulate a long copy process and put the output file to HDFS:

echo `sleep 60` | hadoop fs -put - /user/v-dtolpeko/copy.txt

This command writes to HDFS file /user/v-dtolpeko/copy.txt from STDIN. Just one byte (0x0A – end of line) is written to the file in HDFS but sleep command holds STDIN open for 60 seconds so we can see what happens in HDFS:

Open another session to the Hadoop cluster and run:

hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'

This command lists the directory content in HDFS /user/v-dtolpeko/ directory, and just outputs file names:

[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'
/user/v-dtolpeko/.Trash
/user/v-dtolpeko/.staging
/user/v-dtolpeko/copy.txt._COPYING_
/user/v-dtolpeko/identity.pl

You can see that _COPYING_ suffix was added to copy.txt. When the first session completes the copy, and you rerun ls command, you can see that the suffix removed:

[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'
/user/v-dtolpeko/.Trash
/user/v-dtolpeko/.staging
/user/v-dtolpeko/copy.txt
/user/v-dtolpeko/identity.pl