Friday, 31 July 2015

Spark Persist ML Models

1. Save MLlib Model

Some models support save() and load() methods.

model.save(sc, "myModelPath")
val sameModel = RandomForestModel.load(sc, "myModelPath")

Below models have PMML export support

https://spark.apache.org/docs/latest/mllib-pmml-model-export.html

For example,
// Export the model to a String in PMML format
clusters.toPMML
// Export the model to a local file in PMML format
clusters.toPMML("/tmp/kmeans.xml")
// Export the model to a directory on a distributed file system in PMML format
clusters.toPMML(sc,"/tmp/kmeans")

2. Save ML Model

Currently, Model export/import for ML Pipeline is not supported yet.
There is a Jira ticket for it. https://issues.apache.org/jira/browse/SPARK-6725

We can use a general approach to save the model as a java object by using RDD.saveAsObjectFile, then load it by SparkContext.objectFile

For example,

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD

Reference:

https://phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/

No comments:

Post a Comment