Tuesday 29 July 2014

Append Only Yesterday's Data to Sqoop and Pig

Append only the new data from yesterday(from 00:00:00 to 23:59:59) to the data pipeline.

1. Sqoop

--query
"Select * from SQLTable
WHERE ReceivedDate > CONVERT(VARCHAR(10),GETDATE()-1,110) and ReceivedDate < CONVERT(VARCHAR(10),GETDATE(),110)"


2. Pig

date.txt
2014-07-28 00:00:00.107
2014-07-27 23:59:00.157
2014-07-28 05:00:00.467
2014-07-28 09:00:00.62
2014-07-26 08:00:01.167
2014-07-29 00:00:01.14
2014-07-28 23:59:59.48
2014-07-28 12:30:01.483
---------------------------------
A = load 'date.txt' as (in:chararray);
C = filter A by in < ToString(CurrentTime(), 'yyyy-MM-dd') and in > ToString(AddDuration(CurrentTime(),'P-1D'), 'yyyy-MM-dd');
dump C;
---------------------------------
(2014-07-28 00:00:00.107)
(2014-07-28 05:00:00.467)
(2014-07-28 09:00:00.62)
(2014-07-28 23:59:59.48)
(2014-07-28 12:30:01.483)

3. Set Current Date Variable in Pig

%declare CURR_DATE `date +%Y-%m-%d`;
STORE pv into 'sessions/$CURR_DATE' USING PigStorage(',');


Reference:
1. SQL Conver http://www.w3schools.com/sql/func_convert.asp
2. http://stackoverflow.com/questions/17764761/comparing-datetime-in-pig
3. AddDuration http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/AddDuration.html
4. ToString http://stackoverflow.com/questions/18656852/pig-udf-for-iso-to-yyyy-mm-dd-hhmmss-000
5. Incremental Updates http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive
6. http://stackoverflow.com/questions/16749851/formatting-date-in-generate-statement



No comments:

Post a Comment