
Repeatability and automation
In this section, we will discuss some methods of organizing datasets, preprocessing into workflows, and then use the Apache Spark pipeline to represent as well as implement these workflows. Then, we will review data preprocessing automation solutions.
After this section, we will be able to use Spark pipelines to represent and implement datasets preprocessing workflows and understand some automation solutions made available by Apache Spark.
Dataset preprocessing workflows
Our data preparation work from Data cleaning to Identity matching to Data re-organization to Feature extraction were organized in a way to reflect our step-by-step orderly process of preparing datasets for machine learning. In other words, all the data preparation work can be organized into a workflow.
Organizing data cleaning into workflows can help achieve repeatability and also possible automation, which is often the most valuable for machine learning professionals as ML professionals and data scientists often spend 80% of their time on data cleaning and preprocessing.
For most ML projects, including the ones to be discussed in later chapters, data scientists need to split their data into training, testing, and validation sets; here, the same preprocessing of the training set needs to be repeated on the testing and validation sets. For this reason alone, utilizing workflows to repeat will save ML professionals a lot of time and also help avoiding many mistakes.
Using Spark to represent and implement data preprocessing workflows has special advantages, which include:
- Seamless Data Flow Integration between different sources.
This is the first but very important step.
- Availability of data processing libraries MLlib and GraphX.
As we can note from the previous sections, the libraries built on MLIB and GraphX make data cleaning easy.
- Avoiding slow offline Table Joins.
Spark SQL is faster than SQL.
- The significantly quicker execution of operations that could be naturally parallelized.
Parallelized computation is what is naturally offered by Apache Spark; also, optimization is another advantage offered by Spark.
The Spark pipeline API makes it especially easy to develop and deploy data cleaning and data preprocessing workflows.
Spark pipelines for dataset preprocessing
As an example, SampleClean was used as one of the systems for data preprocessing—specifically for the work of cleaning and entity analytics.
For learning purposes, we encourage users to combine SampleClearn with the R notebook and then utilize Apache Spark Pipeline to organize workflows.
As discussed in previous sections, to complete a data preprocessing and make it available, we need at least the following steps:
- Data cleaning to deal with missing cases.
- Entity analytics to resolve entity problems.
- Reorganizing data to cover subsetting and aggregating data.
- Joining some data together.
- Developing new features from the existing features.
For some of the most basic preprocessing, we may be able to organize the workflow with a few R codes, including the following:
df$var[is.na(df$var)] <- mean(df$var, na.rm = TRUE)
We will then use the R functions, subset
, aggregate
, and merge
, to reorganizing and join datasets.
The preceding R work on the R Notebook in combination with SampleClean and feature development should complete our workflow.
However, in reality, the preprocessing workflows can be a lot more complicated and may involve feedbacks as well.
Dataset preprocessing automation
Spark's new pipelines are good to represent workflows.
Once all the data preprocessing steps get organized into workflows, automation becomes easy.
Databricks is an end-to-end solution to make building a data pipeline easier—from ingest to production. The same concept applies to R notebooks as well: You can schedule your R notebooks to run as jobs on existing or new Spark clusters. The results of each job run, including visualizations, are available to browse, making it much simpler and faster to turn the work of data scientists into production.

An important point here is that the data preparation will turn its outputs into DataFrames. Then, this can be easily combined with machine learning pipelines to automate all together.
For example, the most common advanced analytic tasks can be specified using the new pipeline API in MLlib. For example, the following code creates a simple text classification pipeline consisting of a tokenizer, a hashing term frequency feature extractor, and logistic regression:
tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol="words", outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
Once the pipeline is set up, we can use it to train on a DataFrame directly:
df = context.load("/path/to/data") model = pipeline.fit(df)
For the preceding code, we will discuss more in later chapters.
As we may recall, in section Data cleaning made easy, we had four tables for the purposes of illustration, as follows:
Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
Events(userId INT, action INT, Default)
WebLog(userId, webAction)
Demographic(memberId, age, edu, income)
For this group of datasets, we performed:
- Data cleaning.
- Identity matching.
- Datasets reorganizing.
- Datasets joining.
- Feature extraction, then data joining, and then feature selection.
To implement the preceding, we can use an R notebook to organize them into a workflow for automation and also Spark Pipeline for help.
With all the preceding completed, we are now ready for machine learning.