Wednesday 4 February 2015

Categorical Feature Vector Encoding

The Spark MLlib abstraction for a feature vector is known as a LabeledPoint, which consists of a Spark MLlib Vector of features, and a target value, here called the label.

LabeledPoint is only for numeric features. It can be used with categorical features, with appropriate encoding. One such encoding is one-hot or 1-of-n encoding, in which one categorical feature that takes on N distinct values becomes N numeric features, each taking on the value 0 or 1. Exactly one of the N values has value 1, and the others are 0.

Below is an approach for 1-of-n encoding.


import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

val rawData = sc.textFile("hdfs:///user/ds/covtype.data")

val categories:List[String] = List("Food and Beverage", "Health and Medical", "Shopping and Classifieds", "Entertainment", "Computers and Internet", "Business and Finance")

val data = rawData.map { line =>
  val values = line.split('\t')
  var cat_list : Array[Double] = new Array[Double](categories.size)
  val id = values(0)
  val url = values(1)

  var i = -1
  categories.foreach{x => i +=1; cat_list(i) = if(x == values.last) 1.0 else 0.0}
  val vector = Vectors.dense(cat_list)
  (id, url, vector)
}



Reference:
http://stackoverflow.com/questions/26829762/convert-rdd-of-vector-in-labeledpoint-using-scala-mllib-in-apache-spark

No comments:

Post a Comment