data:image/s3,"s3://crabby-images/9488a/9488a8a3634c8256db5fb170c4a76f2bd987cf9a" alt="Apache Spark 2:Data Processing and Real-Time Analytics"
Feature engineering
Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
var indexer = new StringIndexer()
.setHandleInvalid("skip")
.setInputCol("L0_S22_F545")
.setOutputCol("L0_S22_F545Index")
var indexed = indexer.fit(df_notnull).transform(df_notnull)
indexed.printSchema
As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created:
data:image/s3,"s3://crabby-images/2e0d8/2e0d8b0078a0352a1501bdb5f4760c7bfe1a8b18" alt=""
Finally, let's examine some content of the newly created column and compare it with the source column.
We can clearly see how the category string gets transformed into a float index:
data:image/s3,"s3://crabby-images/18f27/18f27329fbb1406188b8304eedcbcb93ea814ac3" alt=""
Now we want to apply OneHotEncoder, which is a transformer, in order to generate better features for our machine learning model:
var encoder = new OneHotEncoder()
.setInputCol("L0_S22_F545Index")
.setOutputCol("L0_S22_F545Vec")
var encoded = encoder.transform(indexed)
As you can see in the following figure, the newly created column L0_S22_F545Vec contains org.apache.spark.ml.linalg.SparseVector objects, which is a compressed representation of a sparse vector:
data:image/s3,"s3://crabby-images/cf492/cf4923ad25610458792c5bd8cec9a3f710c66820" alt=""
So now that we are done with our feature engineering, we want to create one overall sparse vector containing all the necessary columns for our machine learner. This is done using VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
var vectorAssembler = new VectorAssembler()
.setInputCols(Array("L0_S22_F545Vec", "L0_S0_F0", "L0_S0_F2","L0_S0_F4"))
.setOutputCol("features")
var assembled = vectorAssembler.transform(encoded)
We basically just define a list of column names and a target column, and the rest is done for us:
data:image/s3,"s3://crabby-images/f021c/f021c17d10250005e4bf94b31748eb09b01210f6" alt=""
As the view of the features column got a bit squashed, let's inspect one instance of the feature field in more detail:
data:image/s3,"s3://crabby-images/cb7d5/cb7d50377569f557e1942c76e4a7b000ebde81fd" alt=""
We can clearly see that we are dealing with a sparse vector of length 16 where positions 0, 13, 14, and 15 are non-zero and contain the following values: 1.0, 0.03, -0.034, and -0.197. Done! Let's create a Pipeline out of these components.