LabeledPoint
, which consists
of a Spark MLlib Vector
of features, and a target value, here called the label.LabeledPoint
is only for numeric features. It can be used with categorical
features, with appropriate encoding. One such encoding is
one-hot or 1-of-n encoding, in which one categorical feature
that takes on N distinct values becomes N numeric features, each taking on the value 0 or 1.
Exactly one of the N values has value 1, and the others are 0.Below is an approach for 1-of-n encoding.
import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint val rawData = sc.textFile("hdfs:///user/ds/covtype.data") val categories:List[String] = List("Food and Beverage", "Health and Medical", "Shopping and Classifieds", "Entertainment", "Computers and Internet", "Business and Finance") val data = rawData.map { line => val values = line.split('\t') var cat_list : Array[Double] = new Array[Double](categories.size) val id = values(0) val url = values(1) var i = -1 categories.foreach{x => i +=1; cat_list(i) = if(x == values.last) 1.0 else 0.0}
val vector = Vectors.dense(cat_list) (id, url, vector) }
Reference:
http://stackoverflow.com/questions/26829762/convert-rdd-of-vector-in-labeledpoint-using-scala-mllib-in-apache-spark
No comments:
Post a Comment