If your job is map-only. There is no need for a reduce phase in the processing you do, so each mapper outputs records to its own file, and you end up with one file for each mapper.
The reason this goes away when you use an
ORDER BY
is because that triggers a reduce phase, at which point the default parallelism of 20 comes into play.ranked_data = ORDER mydata BY rank_score DESC;
if you were in a situation where you weren't doing a join, you could force it using a do-nothing
GROUP BY
, like so:reduced = FOREACH (GROUP some_data BY RANDOM()) GENERATE FLATTEN(some_data);
Reference:
No comments:
Post a Comment