Wednesday, 14 September 2016

Apache NiFi ListSFTP->FetchSFTP Dataflow in Cluster Mode

Nodes in a cluster work independently from one another and do not know about each other. That
is accurate. Each node in a cluster runs the same flow.

The way to use the List + Fetch processors in a cluster is to 
schedule the List processor to run on primary node only, and connect it to 
a Remote Process Group, which use Site-to-Site to distribute
that listing to all nodes in the cluster.

If your NiFi instance is clustered, it will store the information in ZooKeeper. If not clustered,
it will store the state in a local file. This is done because in a cluster, you typically want
to run your List*** Processors on Primary Node only, and this allows another node to pick up where the
previous one left off if the Primary Node changes. Of course, storing all of the files that have been
listed can become very verbose so it stores only a small amount of data -- the timestamp of the latest file
discovered and the timestamp of the latest file process/listed. It can then use this information to determine if files are new or modified without storing much info.

The Remote Process Group points back to the same cluster and connects to an 
Input Port port, this is essentially distributing FlowFiles, where each FlowFile contains one file name across your slave nodes. 
The Input Port then connects to FetchFile so all slaves nodes fetch in 
parallel, but fetch different files. In this case the primary node is also one of the target NiFi instances.

When you add a Remote Process Group (RPG) to your graph and supply it
with a URL for the "target" NiFi system, the sending system has no idea
whether it is talking to another standalone NiFi instance or the NCM of a
cluster.  So in the scenario you have where it is talking to a cluster NCM,
once the connection is established it will try and send data to
that NCM. Basically, the NCM will respond that it does not want your data
while at the same time sending the source instance the URL, site-to-site ports and
stats for every currently connected slave node.  The sender will then use that
supplied information to smartly load-balance the delivery of its queued
data to all of those Nodes.
The NiFi Cluster Manager (NCM) is in charge of splitting the incoming FlowFiles among the nodes. It knows the current load of each and splits up the files accordingly.

Site-to-Site protocol has to been enable in Every node, including NCM.

# Site to Site properties = //ip of this node
nifi.remote.input.socket.port =9090 //must be set to use RPG =false

Otherwise, you will see below error:

17:53:20 UTCWARNINGfa1224d3-cf5b-4fc7-8951-55bee660691e
worker-ip:8080 EndpointConnectionPool[Cluster URL=http://ncm-ip:8080/nifi] Unable to refresh Remote Group's peers due to Remote instance of NiFi is not configured to allow site-to-site communications


No comments:

Post a Comment