data:image/s3,"s3://crabby-images/9488a/9488a8a3634c8256db5fb170c4a76f2bd987cf9a" alt="Apache Spark 2:Data Processing and Real-Time Analytics"
Model evaluation
Without evaluation, a model is worth nothing as we don't know how accurately it performs. Therefore, we will now use the built-in BinaryClassificationEvaluator in order to assess prediction performance and a widely used measure called areaUnderROC (going into detail here is beyond the scope of this book):
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val evaluator = new BinaryClassificationEvaluator()
import org.apache.spark.ml.param.ParamMap
var evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC")
var aucTraining = evaluator.evaluate(result, evaluatorParamMap)
As we can see, there is a built-in class called org.apache.spark.ml.evaluation.BinaryClassificationEvaluator and there are some other classes for other prediction use cases such as RegressionEvaluator or MuliclassClassificationEvaluator. The evaluator takes a parameter map--in this case, we are telling it to use the areaUnderROC metric--and finally, the evaluate method evaluates the result:
data:image/s3,"s3://crabby-images/36582/3658214a7111362461093d9d869a8fd9dc6bd648" alt=""
As we can see, areaUnderROC is 0.5424418446501833. An ideal classifier would return a score of one. So we are only doing a bit better than random guesses but, as already stated, the number of features that we are looking at is fairly limited.
This areaUnderROC is, in fact, a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section.