Apache Spark 2：Data Processing and Real-Time Analytics

上QQ阅读APP看书，第一时间看更新

Pipelines

Before we dive into estimators--we've already used one in StringIndexer--let's first understand the concept of pipelines. As you might have noticed, the transformers add only one single column to a DataFrame and basically omit all other columns not explicitly specified as input columns; they can only be used in conjunction with org.apache.spark.ml.Pipeline, which glues individual transformers (and estimators) together to form a complete data analysis process. So let's do this for our two Pipeline stages:

var transformers = indexer :: encoder :: vectorAssembler :: Nil
var pipeline = new Pipeline().setStages(transformers).fit(df)
var transformed = pipeline.transform(df)

The now obtained DataFrame called transformed contains all the original columns plus the columns added by the indexer and encoder stages. This is the way in which ApacheSparkML data processing jobs are defined.

本周热推：

Python编程：从入门到实践 Python编程：从入门到实践（第2版）C Primer Plus（第6版）中文版【最新修订版】大话设计模式 Java从初学到精通