Thursday 21 August 2014

Features of Apache Avro

Apache Avro is a data serialization system with rich data structures and a compact, fast, binary data format. With Avro, you can define a data schema and then read and write data in that format using a number of different programming languages.

1. Schema:

Avro needs schema to serialize and de-serialize data. This permits each datum to be written with no per-value overheads, making serialization both fast and small.  If we add or remove fields in the schema, Avro can handle that. No code changes are required, making it dynamic. 
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
2. Splitable: 
If a file size is bigger than the data block, Avro allows the file to split across different data blocks by appending the schema on top. In Apache Hadoop HDFS, all the files are cut into data blocks that are 64MB in size by default or 128MB configured. So in Hadoop if the data file is 1GB, it will be cut into 9 blocks. As this is distributed in the system, these blocks can reside on any of the nodes in the cluster. 
What “splitable” means is that each block has its own schema defined in it, so when our MapReduce job or Hive/Pig query accesses this block it has a self-describing schema, which  makes it easy for the MapReduce job or Hive/Pig query to deduce the fields from the data block it is reading. 
3. Schema evolution
When the schema inevitably changes, Avro uses schema evolution rules to make it easy to interact with files written using both older and newer versions of the schema — default values get substituted for missing fields, unexpected fields are ignored until they are needed, and data processing can proceed uninterrupted through upgrades.
Reference:
http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html

No comments:

Post a Comment