Apache Spark 2：Data Processing and Real-Time Analytics

上QQ阅读APP看书，第一时间看更新

File streams

We have modified the Scala-based code example in the last section to monitor an HDFS-based directory by calling the Spark Streaming Context method called textFileStream. We will not display all of the code, given this small change. The application class is now called stream3, which takes a single parameter--the HDFS directory. The directory path could be on another storage system as well (all the code samples will be available with this book):

val rawDstream = ssc.textFileStream( directory )

The stream processing is the same as before. The stream is split into words and the top-ten word list is printed. The only difference this time is that the data must be put in the HDFS directory while the application is running. This is achieved with the HDFS filesystem put command here:

 [root@hc2nn log]# hdfs dfs -put ./anaconda.storage.log /data/spark/stream

As you can see, the HDFS directory used is /data/spark/stream/, and the text-based source log file is anaconda.storage.log (under /var/log/). As expected, the same word list and count is printed:

List : (17104,)
List : (2333,=)
...
List : (564,True)
List : (495,False)
List : (411,None)
List : (356,at)
List : (335,object)

These are simple streaming methods based on TCP and filesystem data. What if we want to use some of the built-in streaming functionality in Spark Streaming? This will be examined next. The Spark Streaming Flume library will be used as an example.