Apache Spark 2：Data Processing and Real-Time Analytics

上QQ阅读APP看书，第一时间看更新

Hyperparameter tuning

CrossValidation is often used in conjunction with so-called (hyper)parameter tuning. What are hyperparameters? These are the various knobs that you can tweak on your machine learning algorithm. For example, these are some parameters of the Random Forest classifier:

Number of trees
Feature subset strategy
Impurity
Maximal number of bins
Maximal tree depth

Setting these parameters can have a significant influence on the performance of the trained classifier. Often, there is no way of choosing them based on a clear recipe--of course, experience helps--but hyperparameter tuning is considered as black magic. Can't we just choose many different parameters and test the prediction performance? Of course, we can. This feature is also inbuilt in Apache SparkML. The only thing to consider is that such a search can be quite exhaustive. So luckily, Apache Spark is a linearly scalable infrastructure and we can test multiple models very fast.

Note that the hyperparameters form an n-dimensional space where n is the number of hyperparameters. Every point in this space is one particular hyperparameter configuration, which is a hyperparameter vector. Of course, we can't explore every point in this space, so what we basically do is a grid search over a (hopefully evenly distributed) subset in that space.

All of this is completely integrated and standardized in Apache SparkML; isn't that great? Let's take a look at the following code:

import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
var paramGrid = new ParamGridBuilder()
    .addGrid(rf.numTrees, 3 :: 5 :: 10 :: Nil)
    .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: Nil)
    .addGrid(rf.impurity, "gini" :: "entropy" :: Nil)    
    .addGrid(rf.maxBins, 2 :: 5 :: Nil)
    .addGrid(rf.maxDepth, 3 :: 5 :: Nil)
    .build()

In order to perform such a grid search over the hyperparameter space, we need to define it first. Here, the functional programming properties of Scala are quite handy because we just add function pointers and the respective parameters to be evaluated to the parameter grid:

var crossValidator = new CrossValidator()
      .setEstimator(new Pipeline().setStages(transformers :+ rf))
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(5)
.setEvaluator(evaluator)

Then we create a CrossValidator. Note that in the setEstimator method of the CrossValidator object, we set our existing Pipeline. We are able to do so since the Pipeline by itself turns out to be an estimator as it extends from it. In the setEstimatorParamMaps method we set our parameter grid. Finally, we define the number of folds used for CrossValidation, pass an instance of our BinaryClassificationEvaluator, and we are done:

var crossValidatorModel = crossValidator.fit(df)

Although there is so much stuff going on behind the scenes, the interface to our CrossValidator object stays slim and well-known as CrossValidator also extends from Estimator and supports the fit method. This means that, after calling fit, the complete predefined Pipeline, including all feature preprocessing and the RandomForest classifier, is executed multiple times--each time with a different hyperparameter vector.

So let's do some math. How many RandomForest models are executed once this code has run? Basically, this is a number exponentially dependent on the number of parameters to be evaluated and the different parameter values for each parameter. In this case, we have five parameters with parameter values ranging between 2 and 3. So the math is as simple as this: 3 * 2 * 2 * 2 = 24. 24 models have completed, and by just adding additional parameters or parameter values, this number always doubles. So here we are really happy to run on a linearly scalable infrastructure!

So let's evaluate the result:

var newPredictions = crossValidatorModel.transform(df)

As CrossValidator is an Estimator returning a model of the CrossValidatorModel type, we can use it as an ordinary Apache SparkML model by just calling transform on it in order to obtain predictions. The CrossValidatorModel automatically chooses the learned hyperparameters of the underlying model (in this case, RandomForestClassifier) to do the prediction. In order to check how well we are doing, we can run our evaluator again:

evaluator.evaluate(newPredictions, evaluatorParamMap)

In case we are curious and want to know the optimal parameters, we can pull the stages from the Pipeline and check on the parameters used:

var bestModel = crossValidatorModel.bestModel
var bestPipelineModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]
var stages = bestPipelineModel.stages

Then we pull RandomForestClassificationModel from the best stage and check on the parameters:

import org.apache.spark.ml.classification.RandomForestClassificationModel
val rfStage = stages(stages.length-1).asInstanceOf[RandomForestClassificationModel]
rfStage.getNumTrees
rfStage.getFeatureSubsetStrategy
rfStage.getImpurity
rfStage.getMaxBins
rfStage.getMaxDepth

This is enough of theory and it is impossible to cover all transformers, estimators, and helper functions of Apache SparkML but we think this is a very good start. So let's conclude this chapter with a practical example.

The illustrated image is a good example of the pipeline we want to implement: