A
Transformer
is an abstraction which includes feature transformers and learned models. Technically, a Transformer
implements a methodtransform()
which converts one DataFrame
into another, generally by appending one or more columns. For example:- A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors), append the new column to the dataset, and output the updated dataset.
- A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column, and output the updated dataset.
An
Estimator
abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Technically, an Estimator
implements a method fit()
which accepts a DataFrame
and produces a Transformer
. For example, a learning algorithm such as LogisticRegression
is anEstimator
, and calling fit()
trains a LogisticRegressionModel
, which is a Transformer
.Cross-Validation
An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning.
Pipeline
s facilitate model selection by making it easy to tune an entire Pipeline
at once, rather than tuning each element in the Pipeline
separately.
Currently,
spark.ml
supports model selection using the CrossValidator
class, which takes an Estimator
, a set of ParamMap
s, and an Evaluator
.CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasetsCode Sample:
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) .build() // We treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance. // This will allow us to jointly choose parameters for all Pipeline stages. // A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice // Run cross-validation, and choose the best set of parameters. val cvModel = crossval.fit(training.toDF) cvModel.transform(test.toDF) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }
Reference:
https://spark.apache.org/docs/latest/ml-guide.html
No comments:
Post a Comment