val conf = new SparkConf().set("spark.hadoop.validateOutputSpecs", "false") union_df.saveAsCsvFile(output_path, Map("delimiter"->",", "header"->"true"))
Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable.
If you do "rdd.saveAsTextFile()" or "dataframe.saveAsTextfile()"
It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance.
A couple options to merge to one single file:
1. dataframe.rdd.oalesce(1,true).saveAsTextFile()A couple options to merge to one single file:
2. You can then use the hdfs merge command to wrap these into one file:
$hdfs dfs -getmerge <src-directory> <dst-file>
Reference:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/package.scala
http://stackoverflow.com/questions/24371259/how-to-make-saveastextfile-not-split-output-into-multiple-file
No comments:
Post a Comment