Alvin's Big Data Notebook : Set Reducer in Map-only Pig Job

Tuesday, 15 July 2014

Set Reducer in Map-only Pig Job

If your job is map-only. There is no need for a reduce phase in the processing you do, so each mapper outputs records to its own file, and you end up with one file for each mapper.

The reason this goes away when you use an ORDER BY is because that triggers a reduce phase, at which point the default parallelism of 20 comes into play.

ranked_data = ORDER mydata BY rank_score DESC;

if you were in a situation where you weren't doing a join, you could force it using a do-nothing GROUP BY, like so:

reduced = FOREACH (GROUP some_data BY RANDOM()) GENERATE FLATTEN(some_data);

Reference:

http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-instead-of-thousands-of-ti

http://pig.apache.org/docs/r0.10.0/perf.html#combine-files

Alvin's Big Data Notebook

Tuesday, 15 July 2014

Set Reducer in Map-only Pig Job

No comments:

Post a Comment