Saturday 18 June 2016

Parsing XML in Spark

1. Dependency issue with Spark-xml lib

After adding the spark-xml lib, encountered below error.

Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
 at [Source: {"id":"0","name":"newAPIHadoopFile"}; line: 1, column: 1]
com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
 at [Source: {"id":"0","name":"newAPIHadoopFile"}; line: 1, column: 1]
    at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
    at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)

    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)

Solution:
dependencyOverrides ++= Set(
  "com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
This is caused by the classpath providing you with a different version of jackson than the one Spark is expecting that is 2.4.4 

val xmlDf = sqlContext.read
  .format("xml")
  .option("rowTag","Rec")
  .option("attributePrefix", "")
  .option("valueTag", "value")
  .load(filePath)

2. Xml with an array struct and nested elements

By using explode(Column e), creates a new row for each element in the given array or map column.

xmlDf.select(
           xmlDf.col("Ref"),           explode(xmlDf.col("Comm")).as("Comm")
         ).select("Comm.Type, "Ref.AcctID")




No comments:

Post a Comment