Friday, 31 October 2014

Python Avro Validator

Invalid Avro files causes some exceptions while reading in AvroStorage() in Pig.

1. WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: org.apache.avro.SchemaParseException: Undefined name: "ınt"
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:266)

2. 
WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.EOFException
at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:357)

3.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -28

at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)


The default avro lib for python is much slower than Java's.
Fastavro makes it as fast as Java's.

import fastavro as avro

file_name = 'test.avro'
with open(file_name,'rb') as fo:   
    try:
        reader = avro.reader(fo)
        schema = reader.schema
        for record in reader:
            record
    except Exception:
        print("Invalid Format: "+file_name)


Below is not available in the latest Piggybank now.
The default functionality does not change. On an error, it will die. However, thereare not two keys that can
be set:
set pig.piggybank.storage.avro.bad.record.threshold 0.01
set pig.piggybank.storage.avro.bad.record.min 100
The former sets the acceptable ratio threshhold. The latter sets the minimum numberof errors before it can error out.


Reference:
https://bitbucket.org/tebeka/fastavro
https://github.com/rjurney/Collecting-Data/blob/master/src/pig/fixavro.pig
https://issues.apache.org/jira/browse/PIG-2614


No comments:

Post a Comment