Tuesday, 5 August 2014

Pig: Task attempt failed to report status for 600 seconds. Killing!

Pig Script Error: Task attempt failed to report status for 600 seconds. Killing!
"VM Thread" prio=10 tid=0x00007f4988076800 nid=0x355d runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f4988024000 nid=0x3555 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f4988026000 nid=0x3556 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f4988027800 nid=0x3557 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f4988029800 nid=0x3558 runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007f498802b000 nid=0x3559 runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007f498802d000 nid=0x355a runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007f498802f000 nid=0x355b runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007f4988030800 nid=0x355c runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f49880c3800 nid=0x3564 waiting on condition

JNI global references: 959

Heap
PSYoungGen      total 282176K, used 577K [0x00000000eaab0000, 0x00000000fd640000, 0x0000000100000000)
eden space 281728K, 0% used [0x00000000eaab0000,0x00000000eaad64f0,0x00000000fbdd0000)
from space 448K, 94% used [0x00000000fd580000,0x00000000fd5ea158,0x00000000fd5f0000)
to   space 320K, 0% used [0x00000000fd5f0000,0x00000000fd5f0000,0x00000000fd640000)
PSOldGen        total 342400K, used 237567K [0x00000000c0000000, 0x00000000d4e60000, 0x00000000eaab0000)
object space 342400K, 69% used [0x00000000c0000000,0x00000000ce7ffc08,0x00000000d4e60000)
PSPermGen       total 31616K, used 31448K [0x00000000bae00000, 0x00000000bcce0000, 0x00000000c0000000)
object space 31616K, 99% used [0x00000000bae00000,0x00000000bccb60f8,0x00000000bcce0000)
This issue is caused by our Pig UDF, which is loading big XML object by DOM.
After partition big XML into small pieces, it is solved.

Some approaches to investigate the issue.


To verify GC in the task, you can re-run the task and execute the following:
# Grab the PID of the process:
ps -ef |grep attempt_201408070906
# Run jstat of the process to see GC metrics
jstat -gcutil <PID of process> 1s 120
Once the job has finished, the JT logs don't show the file input split. To retrieve this you need to:
Grab the task attempt ID: attempt_201408070906_0225_m_001837_0
Log into the TT host and grep the log files to find the input split file.


Some terms in the Heap Dump log.
1. Java Native Interface, essentially it allows communication between Java and native operating system libraries writen in other languages.
JNI global references are prone to memory leaks, as they are not automatically garbage collected, and the programmer must explicitly free them. If you are not writing any JNI code yourself, it is possible that the library you are using has a memory leak.
A JNI global reference is a reference from "native" code to a Java object managed by the Java garbage collector. Its purpose is to prevent collection of an object that is still in use by native code, but doesn't appear to have any live references in the Java code.

This is usually caused by native, non java objects not being released by your application rather than java objects on the heap.

2. Memory in the Java HotSpot virtual machine is organized into three generations: a young generation, an old generation, and a permanent generation. 
Most objects are initially allocated in the young generation. The old generation contains objects that have survived some number of young generation collections, as well as some large objects that may be allocated directly in the old generation. The permanent generation holds objects that the JVM finds convenient to have the garbage collector manage, such as objects describing classes and methods, as well as the classes and methods themselves.

The New or Young Generation - This is where all objects created in an application first go, to be very basic when someone clicks on something an object is created in memory and goes into the New-Young generation.  At this point there will be "references" to that object from the application and ongoing garbage collections check all objects for references.  After around 40 to 45 attempts to collect an object, in the New-Young generation it is moved to the next generation, if it cannot be collected, that generation is called the Old or Tenured generation.
The Old or Tenured Generation - As mentioned in the section above, objects that are not collected whilst in the New-Young generation, after 40-45 attempts, are moved into the Old-Tenured generation. This generation is effectively only collected when there is a Full Garbage Collection (Full GC).  Full GC's, in the current family of Sun JVM's are "stop-the-world" events in the sense that for the duration of the Full GC the JVM stops doing all else.  Therefore, too many Full GC's of too long a duration will impact performance, negatively and that is one of the things we always look for.
The Permanent Generation/PermGen - The thing that we want to avoid here is this generation permanently sitting at 100% with no room to grow.  In a lot of cases, we have seen out of memory errors caused by the permanent generation running at 100%.



http://journals.ecs.soton.ac.uk/java/tutorial/native1.1/implementing/refs.html

No comments:

Post a Comment