Apache Spark 2：Data Processing and Real-Time Analytics

上QQ阅读APP看书，第一时间看更新

Summary

This chapter has attempted to provide you with an overview of some of the functionality available within the Apache Spark MLlib module. It has also shown the functionality that will soon be available in terms of ANNs or artificial neural networks. You might have been impressed by how well ANNs work. It is not possible to cover all the areas of MLlib due to the time and space allowed for this chapter. In addition, we now want to concentrate more on the SparkML library in the next chapter, which speeds up machine learning by supporting DataFrames and the underlying Catalyst and Tungsten optimizations.

We saw how to develop Scala-based examples for Naive Bayes classification, K-Means clustering, and ANNs. You learned how to prepare test data for these Spark MLlib routines. You also saw that they all accept the LabeledPoint structure, which contains features and labels.

Additionally, each approach takes a training and prediction step to training and testing a model using different datasets. Using the approach shown in this chapter, you can now investigate the remaining functionality in the MLlib library. You can refer to http://spark.apache.org/ and ensure that you refer to the correct version when checking the documentation.

Having examined the Apache Spark MLlib machine learning library in this chapter, it is now time to consider Apache Spark's SparkML. The next chapter will examine machine learning on top of DataFrames.