Friday 3 October 2014

DataFu Hourglass & UDFs

How to build DataFu.

1. Download gradle.
2. To bootstrap the wrapper, then the regular gradlew instructions are available.
./bin/gradle -b bootstrap.gradle in project foler
3.generates the eclipse project and classpath files:
./gradlew eclipse
4. Build and test datafu-Pig subproject
./gradlew :datafu-pig:build
./gradlew :datafu-pig:test

1. Hourglass is a framework to incrementally process data with Hadoop MapReduce.

The input data is partitioned according to time, and the range of input data to process is adjusted as new data arrives.Since a previous output already exists,  Hourglass is able to reuse this result and therefore it only needs to consume the previous output and the new day of input. It reuses the previous output and merges this with only the new input.
It supports both Fixed-length and Fixed-start use cases.


2. List of DataFu analysis UDFs.

DataFu 1.2.0

Packages
datafu.pig.bagsA collection of general purpose UDFs for operating on bags.
datafu.pig.geoUDFs for geographic computations.
datafu.pig.hashUDFs for computing hashes from data.
datafu.pig.linkanalysisUDFs for performing link analysis, such as PageRank.
datafu.pig.randomUDFs dealing with randomness.
datafu.pig.samplingSampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc.
datafu.pig.sessionsUDFs for web log sessionizing data.
datafu.pig.setsUDFs for set operations such as intersect and union.
datafu.pig.statsStatistics UDFs for computing median, quantiles, variance, confidence intervals, etc.
datafu.pig.urlsUDFs for processing URLs.
datafu.pig.utilOther useful utilities.


3.  Bacon project contains url parsing UDF. Require java 7
https://github.com/aaronbinns/bacon

Reference:
https://github.com/apache/incubator-datafu
http://datafu.incubator.apache.org/docs/datafu/1.2.0/
http://datafu.incubator.apache.org/docs/hourglass/getting-started.html
http://datafu.incubator.apache.org/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html




No comments:

Post a Comment