Tuesday, 14 October 2014

Flume's Avro Source v.s Http Source

Avro Source:
The Avro Source is designed to be a highly scalable RPC server that accepts data into a Flume agent, from another Flume agent’s Avro Sink or from a client application that uses Flume’s SDK to send data.
If the data is ingested via an RPC client and RPC source, the application will convert the data into Flume events

Flume’s Avro Source uses the Netty-Avro inter-process communication (IPC) protocol to communicate. So, it is possible to send data to the Avro Source from Java or JVM languages.

Http Source:
Flume comes bundled with an HTTP Source that can receive events via HTTP POST (with a JSON body in default).
For application environments where it might not be possible to deploy the Flume SDK and its dependencies, or in cases where the client code prefers to send data over HTTP rather than over Flume’s RPC, the HTTP Source can be used to receive data into Flume.

Handler allows the HTTP Source to accept data from clients in any format that can be processed by the handler. It converts the input data from the HTTPServletRequest into Flume events.

Events are the basic form of representation of data in Flume. Each Flume event contains a map of headers and a body, which is the payload represented as a byte array.

Compare:
HTTP Source:
The application use HTTP-friendly formatted data (or JSON-ified data if the default handler is used). The issue with this is that it’s more inefficient than it needs to be, with the additional HTTP and the encoding/decoding overhead.

Avro Source:
Since the format of Flume events is fixed, the best way to send data to Flume is via RPC calls in Flume’s supported RPC format: Avro.


There is an initial conversion from the original format to a Flume event at the source, and a second conversion to the eventual destination format at the destination sink.
If an event comes in as Avro, it can simply be encoded into a byte array using the Avro API and set as the event’s body. It can be written to HDFS using AvroEventSerializer, which simply writes the data as is, thus making the data available in the original Avro format in an Avro container file on HDFS.




No comments:

Post a Comment