Apache Spark 2:Data Processing and Real-Time Analytics
上QQ阅读APP看书,第一时间看更新

Pipelines

Before we dive into estimators--we've already used one in StringIndexer--let's first understand the concept of pipelines. As you might have noticed, the transformers add only one single column to a DataFrame and basically omit all other columns not explicitly specified as input columns; they can only be used in conjunction with org.apache.spark.ml.Pipeline, which glues individual transformers (and estimators) together to form a complete data analysis process. So let's do this for our two Pipeline stages:

var transformers = indexer :: encoder :: vectorAssembler :: Nil
var pipeline = new Pipeline().setStages(transformers).fit(df)
var transformed = pipeline.transform(df)

The now obtained DataFrame called transformed contains all the original columns plus the columns added by the indexer and encoder stages. This is the way in which ApacheSparkML data processing jobs are defined.