Monday 1 August 2016

Hive Loads Avro Data with Schema Evolution

BACKWARD Compatibility:

If a schema is evolved in a backward compatible way, we can always use the latest schema to query all the data uniformly. For example, removing fields is backward compatible change to a schema, since when we encounter records written with the old schema that contain these fields we can just ignore them. Adding a field with a default value is also backward compatible. 

Let's say we have two version of Employee schema as below.

Schema v1:

{
  "type": "record",
  "name": "Employee",
  "fields": [
      {"name": "email", "type": "string"},
      {"name": "name", "type": "string"},
      {"name": "age", "type": "int"}
  ]

}


Schema v2:


{
  "type": "record",
  "name": "Employee",
  "fields": [
      {"name": "email", "type": "string"},
      {"name": "name", "type": "string"},
      {"name": "yrs", "type": "int", "aliases": ["age"]},
      {"name": "gender", "type": ["null",string"], "default": null}
  ]
}


We will use the latest schema to create a Hive table to load data with different versions of schema.

Please note
1. The "name" fields in two schemas need to be the same. Otherwise, although the data can be loaded in Hive table, but cannot be retrieved successfully.

2. A default value is needed for the optional fields in the latest schema. Specifying "null" as default of a union only works if "null" is specified as first type in the union.

Failed with exception java.io.IOException: org.apache.avro.AvroTypeException:
Found Employee, expecting Employee


CREATE TABLE Avro_table
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///root/avro_schema/Employee2.avsc')
STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT

2 comments: