- Specialized for HDFS import, making it difficult to use with other systems
- Producer library is Java: would require running JVM on production servers
- Push-only, meaning that if the sink doesn't keep up with the sources, the producer client can block
A netcat example:
Edit "myagent.conf" file
myagent.sources= netcat-collect
myagent.sinks = hdfs-write
myagent.channels= memory-channel
myagent.sources.netcat-collect.type = netcat
myagent.sources.netcat-collect.bind = 127.0.0.1
myagent.sources.netcat-collect.port = 22222
myagent.sinks.hdfs-write.type = hdfs
myagent.sinks.hdfs-write.hdfs.path = hdfs://localhost.localdomain:8020/user/cloudera/flume_test
# Number of seconds to wait before rolling current file
(0 = never roll based on time interval, the default value is 30 sec)
myagent.sinks.hdfs-write.hdfs.rollInterval = 0
myagent.sinks.hdfs-write.hdfs.writeFormat=Text
# number of events written to file before it flushed to HDFS
myagent.sinks.hdfs-write.hdfs.batchSize = 10000
# File size to trigger roll, in bytes (0: never roll based on file size)
myagent.sinks.hdfs-write.hdfs.rollSize = 26843545
myagent.sinks.hdfs-write.hdfs.fileType=DataStream
#Number of events written to file before it rolled (0 = never roll based on number of events)
myagent.sinks.hdfs-write.hdfs.rollCount = 0
myagent.channels.memory-channel.transactionCapacity = 1000
myagent.channels.memory-channel.type = memory
myagent.channels.memory-channel.capacity=10000
myagent.sources.netcat-collect.channels=memory-channel
myagent.sinks.hdfs-write.channel=memory-channel
$flume-ng agent -f myagent.conf -n myagent
$ while true; do echo test; done | nc 127.0.0.1 32222
if you are using the VM for testing you need to set the replication factor to 1. In path /etc/hadoop/conf/hdfs-site.xml and the property name is dfs.replication.
After restarting the HDFS service, we need to fi the under-replicted block with:
$ hadoop fs -setrep -R 1 /
This command will set the replication factor for those files that were created with the replication factor of 3. So HDFS will not try to replicate and spend resources on doing it.
This command will set the replication factor for those files that were created with the replication factor of 3. So HDFS will not try to replicate and spend resources on doing it.
Reference:
http://www.drdobbs.com/database/acquiring-big-data-using-apache-flume/240155029?pgno=1
http://stackoverflow.com/questions/14141287/using-an-hdfs-sink-and-rollinterval-in-flume-ng-to-batch-up-90-seconds-of-log-in
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation
No comments:
Post a Comment