hflush – HDFS for Near Real-Time Applications

HDFS is a fault-tolerant and distributed file system, but how can you pass data from a producer to a consumer of information in near real-time?

Let’s create a simple Java program that writes “Hello, world!” string to a file in HDFS and check what is required to allow readers to see the data immediately:

import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;

public class Write {
  public static void main (String [] args) throws Exception {    
    Path path = new Path(args[0]);
    FileSystem fs = FileSystem.get(new Configuration());
    FSDataOutputStream out = fs.create(path, true /*overwrite*/);
    System.out.println("File created, check if metadata exists at NameNode.");
    System.in.read();  // Press Enter to continue 
    out.writeChars("Hello, world!");
    System.out.println("Data written, check if you can see them in another session.");
    System.in.read();  // Press Enter to continue
    System.out.println("Buffer flushed, check if you can see data now.");
    System.in.read();  // Press Enter to continue

You can compile the code using the following command (set your path to Hadoop jar version):

javac -classpath /usr/lib/hadoop/hadoop-common- Write.java

This sample program uses the default Hadoop configuration properties, and requires you to specify a HDFS file name as the input parameter. You can run it as follows:

hadoop Write /user/v-dtolpeko/abc.txt 

The program executes a step and then waits until you press Enter to continue and execute the next step. How let’s check what happens in HDFS during each operation.

Visibility of File Creation in HDFS

After creating the file the program waits for an user input. You can open the second session and see that the newly created empty file is already visible to other sessions:

[v-dtolpeko ~]$ hadoop fs -ls /user/v-dtolpeko/abc.txt
-rw-rw-r-- 3 v-dtolpeko hdp_ewwdev  0 2014-11-11 00:59 /user/v-dtolpeko/abc.txt

Now press Enter, so the application will write data to the file.

Not Flushed Data is Not Visible Yet

The application has just written “Hello, world!” string to the file, but you still cannot see it in other sessions:

[v-dtolpeko ~]$ hadoop fs -cat /user/v-dtolpeko/abc.txt
[v-dtolpeko ~]$

Flushed Data Become Visible

When the application executes hflush(), data is sent out of the client buffer to all data nodes holding the replica for this block. Now any other application can see the data:

[v-dtolpeko ~]$ hadoop fs -cat /user/v-dtolpeko/abc.txt
Hello, world!
[v-dtolpeko ~]$

Note that when a HDFS block is fully filled it is automatically flushed.

So you can use HDFS hflush() to allow consumers to immediately read the new portion of data written to the file, and you can use this feature to build near-real time applications that use HDFS as the storage.