Alvin's Big Data Notebook : Flume Eliminates Small Files

Drawbacks of Flume:

Specialized for HDFS import, making it difficult to use with other systems
Producer library is Java: would require running JVM on production servers
Push-only, meaning that if the sink doesn't keep up with the sources, the producer client can block

A netcat example:

Edit "myagent.conf" file
myagent.sources= netcat-collect
myagent.sinks = hdfs-write
myagent.channels= memory-channel

myagent.sources.netcat-collect.type = netcat
myagent.sources.netcat-collect.bind = 127.0.0.1
myagent.sources.netcat-collect.port = 22222

myagent.sinks.hdfs-write.type = hdfs
myagent.sinks.hdfs-write.hdfs.path = hdfs://localhost.localdomain:8020/user/cloudera/flume_test

# Number of seconds to wait before rolling current file
(0 = never roll based on time interval, the default value is 30 sec)

myagent.sinks.hdfs-write.hdfs.rollInterval = 0
myagent.sinks.hdfs-write.hdfs.writeFormat=Text

# number of events written to file before it flushed to HDFS
myagent.sinks.hdfs-write.hdfs.batchSize = 10000

# File size to trigger roll, in bytes (0: never roll based on file size)
myagent.sinks.hdfs-write.hdfs.rollSize = 26843545

myagent.sinks.hdfs-write.hdfs.fileType=DataStream

#Number of events written to file before it rolled (0 = never roll based on number of events)
myagent.sinks.hdfs-write.hdfs.rollCount = 0

myagent.channels.memory-channel.transactionCapacity = 1000

myagent.channels.memory-channel.type = memory
myagent.channels.memory-channel.capacity=10000
myagent.sources.netcat-collect.channels=memory-channel
myagent.sinks.hdfs-write.channel=memory-channel

$flume-ng agent -f myagent.conf -n myagent

$ while true; do echo test; done | nc 127.0.0.1 32222

if you are using the VM for testing you need to set the replication factor to 1. In path /etc/hadoop/conf/hdfs-site.xml and the property name is dfs.replication.

After restarting the HDFS service, we need to fi the under-replicted block with:

$ hadoop fs -setrep -R 1 /
This command will set the replication factor for those files that were created with the replication factor of 3. So HDFS will not try to replicate and spend resources on doing it.

Reference:

http://www.drdobbs.com/database/acquiring-big-data-using-apache-flume/240155029?pgno=1
http://stackoverflow.com/questions/14141287/using-an-hdfs-sink-and-rollinterval-in-flume-ng-to-batch-up-90-seconds-of-log-in
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation

Alvin's Big Data Notebook

Wednesday, 20 August 2014

Flume Eliminates Small Files

No comments:

Post a Comment