Alvin's Big Data Notebook : Snapshot for HDFS

A snapshot is a point-in-time image of the entire filesystem or a subtree of a filesystem.

Protection against user errors
Backup
Experimental/Test setups

Only namenode is aware of the snapshot. Datanode is not aware of the fact that some of the blocks are owned by snapshots of the original file.
Snapshot contains only metadata instead of data itself.
A snapshot of a folder /user/me/mydata will be stored in:

hadoop fs -ls /user/me/mydata/.snapshot/<snapshotname>

Note: To restore snapshot, we have to make sure the filesystem is available.
If the cluster is disable for reasons like hardware failure, fire, earthquake.
There is no way to restore from snapshot, even though you keep snapshot in another storage device.

You can set the number of snapshots to keep for a snapshot policy.
Once exceed this number, the oldest one will be deleted.
In this way, we can restore to any stage in the snapshots.

Manage Snapshot is easy from either CM or command-line.
1. Enable HDFS Snapshots.
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_bdr_managing_hdfs_snapshots.html#concept_dgv_dxg_yl_unique_1__section_ymr_gxg_yl_unique_1

2. Create CM Snapshot Policies
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_bdr_snapshots.html

3. Manage Snapshots from Command-line
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

Alvin's Big Data Notebook

Thursday, 8 January 2015

Snapshot for HDFS

No comments:

Post a Comment