Alvin's Big Data Notebook : Apache NiFi ListSFTP->FetchSFTP Dataflow in Cluster Mode

Nodes in a cluster work independently from one another and do not know about each other. That

is accurate. Each node in a cluster runs the same flow.

The way to use the List + Fetch processors in a cluster is to

schedule the List processor to run on primary node only, and connect it to

a Remote Process Group, which use Site-to-Site to distribute

that listing to all nodes in the cluster.

If your NiFi instance is clustered, it will store the information in ZooKeeper. If not clustered,

it will store the state in a local file. This is done because in a cluster, you typically want

to run your List*** Processors on Primary Node only, and this allows another node to pick up where the

previous one left off if the Primary Node changes. Of course, storing all of the files that have been

listed can become very verbose so it stores only a small amount of data -- the timestamp of the latest file

discovered and the timestamp of the latest file process/listed. It can then use this information to determine if files are new or modified without storing much info.

The Remote Process Group points back to the same cluster and connects to an

Input Port port, this is essentially distributing FlowFiles, where each FlowFile contains one file name across your slave nodes.

The Input Port then connects to FetchFile so all slaves nodes fetch in

parallel, but fetch different files. In this case the primary node is also one of the target NiFi instances.

When you add a Remote Process Group (RPG) to your graph and supply it
with a URL for the "target" NiFi system, the sending system has no idea
whether it is talking to another standalone NiFi instance or the NCM of a
cluster. So in the scenario you have where it is talking to a cluster NCM,

once the connection is established it will try and send data to

that NCM. Basically, the NCM will respond that it does not want your data

while at the same time sending the source instance the URL, site-to-site ports and

stats for every currently connected slave node. The sender will then use that

supplied information to smartly load-balance the delivery of its queued

data to all of those Nodes.

The NiFi Cluster Manager (NCM) is in charge of splitting the incoming FlowFiles among the nodes. It knows the current load of each and splits up the files accordingly.

Site-to-Site protocol has to been enable in Every node, including NCM.

# Site to Site properties

nifi.remote.input.socket.host =172.31.48.155 //ip of this node

nifi.remote.input.socket.port =9090 //must be set to use RPG

nifi.remote.input.secure =false

Otherwise, you will see below error:

17:53:20 UTCWARNINGfa1224d3-cf5b-4fc7-8951-55bee660691e

worker-ip:8080 EndpointConnectionPool[Cluster URL=http://ncm-ip:8080/nifi] Unable to refresh Remote Group's peers due to java.io.IOException: Remote instance of NiFi is not configured to allow site-to-site communications

Reference:

https://community.hortonworks.com/questions/50627/listsftp-works-but-fetchsftp-doesnt-work-in-cluste.html

https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_UserGuide/content/site-to-site.html

https://community.hortonworks.com/articles/16461/nifi-understanding-how-to-use-process-groups-and-r.html

https://mail-archives.apache.org/mod_mbox/nifi-users/201510.mbox/%3CCAC9dF2f3tENy8PpQOnS-MjNYTiyn6_ohe2m0Cm1pjNhJBScrTA@mail.gmail.com%3E

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html