Friday 19 June 2015

Count(distinct ) Performance in Hive

select count(distinct user)
from some_table;
This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.
select count(*) from
(select user
 from some_table
 group by user) q;
This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.
So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.

Reference:
http://stackoverflow.com/questions/19311193/why-is-countdistinct-slower-than-group-by-in-hive
http://docs.treasuredata.com/articles/performance-tuning

No comments:

Post a Comment