Friday 2 January 2015

Find User Activities in a Time Window in Hive Table

Suppose there is a website tracking user activities to prevent robotic 
attack on the Internet. Here are field definition and example:
User ID    TimeStamp    Activity Count
123    9:45am    10
234    9:46am    12
234    9:50am    20
456    9:53am    100
123    9:55am    33
456    9:56am    312
123    10:03am    110
123    10:16am    312
234    10:20am    201
456    10:23am    180
123    10:25am    393
456    10:27am    312

Please design an algorithm to identify user IDs that have more than 500 
activities within any given 10 minutes.

1. Hive Solution:


SELECT distinct a.id FROM 
( 
 SELECT a.id, a.time FROM table a 
 JOIN table b ON a.id=b.id 
 WHERE TIMEDIFF(a.time, b.time)<=10 AND TIMEDIFF(a.time, b.time)>=0 
 GROUP BY a.id, a.time 
 HAVING SUM(b.count)>500
)


2. Another approach to select data is a time window(the last 30 days) is:

where to_date(timestamp) < from_unixtime(unix_timestamp(), 'yyyy-MM-dd')
and to_date(timestamp) >= date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),30)

Here, to_date converts string to yyyy-MM-dd format. unix_timestamp() gets the current sys timestamp in seconds.


No comments:

Post a Comment