Friday, 27 June 2014

Parquet Schema


To store in a columnar format we first need to describe the data structures using aschema. This is done using a model similar to Protocol buffers. This model is minimalistic in that it represents nesting using groups of fields and repetition using repeated fields. There is no need for any other complex types like Maps, List or Sets as they all can be mapped to a combination of repeated fields and groups.
The root of the schema is a group of fields called a message. Each field has three attributes: a repetition, a type and a name. The type of a field is either a group or a primitive type (e.g., int, float, boolean, string) and the repetition can be one of the three following cases:
  • required: exactly one occurrence
  • optional: 0 or 1 occurrence
  • repeated: 0 or more occurrences
For example, here’s a schema one might use for an address book:
message AddressBook {
  required string owner;
  repeated string ownerPhoneNumbers;
  repeated group contacts {
    required string name;
    optional string phoneNumber;
  }
}


Reference:
https://blog.twitter.com/2013/dremel-made-simple-with-parquet

No comments:

Post a Comment