1. WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: org.apache.avro.SchemaParseException: Undefined name: "ınt"
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:266)
2.
3.
2.
WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.EOFException
at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:357)
3.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -28
at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
The default avro lib for python is much slower than Java's.
Fastavro makes it as fast as Java's.
import fastavro as avro file_name = 'test.avro' with open(file_name,'rb') as fo: try: reader = avro.reader(fo) schema = reader.schema for record in reader: record except Exception: print("Invalid Format: "+file_name)
Below is not available in the latest Piggybank now.
The default functionality does not change. On an error, it will die. However, thereare not two keys that can be set: set pig.piggybank.storage.avro.bad.record.threshold 0.01 set pig.piggybank.storage.avro.bad.record.min 100 The former sets the acceptable ratio threshhold. The latter sets the minimum numberof errors before it can error out.
Reference:
https://bitbucket.org/tebeka/fastavro
https://github.com/rjurney/Collecting-Data/blob/master/src/pig/fixavro.pig
https://issues.apache.org/jira/browse/PIG-2614
No comments:
Post a Comment