Tuesday 17 March 2015

Spark Overwrite a CSV file

Write output to a csv file with header. Always overwrite the output path.

val conf = new SparkConf().set("spark.hadoop.validateOutputSpecs", "false")
union_df.saveAsCsvFile(output_path, Map("delimiter"->",", "header"->"true"))




Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. 
If you do "rdd.saveAsTextFile()" or "dataframe.saveAsTextfile()"
It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance.

A couple options to merge to one single file:
1. dataframe.rdd.oalesce(1,true).saveAsTextFile()

2. You can then use the hdfs merge command to wrap these into one file:
$hdfs dfs -getmerge <src-directory> <dst-file>

Reference:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/package.scala
http://stackoverflow.com/questions/24371259/how-to-make-saveastextfile-not-split-output-into-multiple-file


No comments:

Post a Comment