Alvin's Big Data Notebook : Distcp Copy files from S3 to HDFS

Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n): A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.
S3 Block FileSystem (URI scheme: s3): A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

S3n:// means "A regular file, readable from the outside world. S3:// refers to an HDFS file system mapped into an S3 bucket which is sitting on AWS storage cluster.
s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.

$hadoop distcp -D fs.s3n.awsAccessKeyId=$ACCESSKEYID \
-D fs.s3n.awsSecretAccessKey=$SECRETACCESSKEY \
-f file_list/part-00001 \
file_download/

$Hadoop distcp s3n://bucketname/directoryname/test.csv /user/myuser/mydirectory/

Bug for CDH4.5:

The connection pool size is 20 but it doesn't look like connections are getting returned to the pool.

14/09/08 10:43:50 DEBUG tsccm.ConnPoolByRoute: [{s}->https://s3bucket.s3.amazonaws.com:443] total kept alive: 0, total issued: 20, total allocated: 20 out of 20

14/09/08 10:43:50 DEBUG tsccm.ConnPoolByRoute: No free connections [{s}->https://s3bucket.s3.amazonaws.com:443][null]

14/09/08 10:43:50 DEBUG tsccm.ConnPoolByRoute: Available capacity: 0 out of 20 [{s}->https://s3bucket.s3.amazonaws.com:443][null]

Reference:

https://wiki.apache.org/hadoop/AmazonS3

Alvin's Big Data Notebook

Thursday, 11 September 2014

Distcp Copy files from S3 to HDFS

No comments:

Post a Comment