Alvin's Big Data Notebook : ML Pipeline in Spark 1.4

A Transformer is an abstraction which includes feature transformers and learned models. Technically, a Transformer implements a methodtransform() which converts one DataFrame into another, generally by appending one or more columns. For example:

A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors), append the new column to the dataset, and output the updated dataset.
A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column, and output the updated dataset.

An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Technically, an Estimator implements a method fit() which accepts a DataFrame and produces a Transformer. For example, a learning algorithm such as LogisticRegression is anEstimator, and calling fit() trains a LogisticRegressionModel, which is a Transformer.

Cross-Validation

An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning.Pipelines facilitate model selection by making it easy to tune an entire Pipeline at once, rather than tuning each element in the Pipelineseparately.

Currently, spark.ml supports model selection using the CrossValidator class, which takes an Estimator, a set of ParamMaps, and an Evaluator.CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets

Code Sample:

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

val paramGrid = new ParamGridBuilder()
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .build()

// We treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
// This will allow us to jointly choose parameters for all Pipeline stages.
// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
val crossval = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2) // Use 3+ in practice

// Run cross-validation, and choose the best set of parameters.
val cvModel = crossval.fit(training.toDF)

cvModel.transform(test.toDF)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
  println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}

Reference:
https://spark.apache.org/docs/latest/ml-guide.html

Alvin's Big Data Notebook

Tuesday, 21 July 2015

ML Pipeline in Spark 1.4

No comments:

Post a Comment