Tuesday 29 November 2016

"Akka in Action" Chapter 13-15

Akka Streaming

A stream of data is a sequence of elements that could have no end. How to buffer data without running out of memory? How can a producer know if a consumer can or can't keep up? Akka-stream provides a way to handle unbounded streams with bounded buffers.

Akka-stream involves two steps:
1. Define a blueprint: A graph of stream-processing components.
2. Execute the blueprint: Run the graph on an ActorSystem.

The graph can be run as many times as you like, and every run is executed by a new set of actors.
A graph is said to be meterialized once it is run. The materializer eventually creates actors that execute the RunnableGraph.
A RunnableGraph can only be created if all inputs and outputs in the graph are connected. The materializer checks that all inputs and outputs in the graphare connected.

Nonblocking Back-Pressure

The subscriber and publisher send each other messages about supply and demand asynchronously, which are specified as a fixed number of elements.

Akka-stream uses buffers internally to optimize throughput. Internally, batches are requested and published. The default setting of max-input-buffer-size = 16


Actor Persistence

A persistent actor makes it easy for an actor to record its state as events and to recover from the events after a crash or restart.
An actor that has a lifetime that's longer than one message and that accumulates valuable information over time often requires persistent state. Instead of storing the last result in one record, we'll capture every successful operation in a journal as an event. Commands turn into events if the commands are valid and if they affect the state of the actor.

One of the biggest benefits of event sourcing is that writing to and reading from the database are separated into two distinct phases. Reading from the journal only happens when recovering the state of a persistent actor. The actor can simply process messages and keep state in memory, as long as it makes sure to persist events.

A journal only needs to support appending serialized events and reading deserialized events from a position in the journal. The events in the journal are effectively immutable, which wins when it comes to the complexity of concurrent access.

A obvious drawback is the increase in required storage space. Creating snapshots of the actor state can reduce the required storage space and speed up recovery of current state. Snapshots are stored in a separate SnapshotStore.

Event sourcing requires some form of serialization of events. It really only provides a way to recover state from events; it's not a solution for ad hoc queries.

A persistent actor works in two modes: it recovers from events or it processes commands. Commands are messages that are sent to the actor to execute some logic; events provide evidence that the actor has executed the logic correctly. Every persistent actor requires a persistentId, which is used to uniquely identify the events in the journal for that actor. The ID is automatically passed to the journal when a persistent actor persists an event.

The function to handle the persisted event is called asynchronously, but akka persistence makes sure that the next command isn't handled before this function is completed, so it's safe to refer to sender() from this function. This does come at some performance overhead, since messages will have to be stashed.

The ReceiveRecover will be called with all the events that have occurred when the actor starts/restarts. It will have to execute exactly the same logic that was used when the commands were processed correctly.

Persistence Query

This is not a tool for ad hoc query like SQL. The best use case for persistent query is to continuously read events out of band from the persistent actors and update these events in another database in a shpe more suitable for querying. It supports getting all events, getting events for a particular persistentId.


Clustering

A cluster is dynamic group of nodes. On each node is an actor system that listens on the network. Clustering takes location transparency to the next level. The actor might exist locally or remotely and could reside anywhere in the cluster.

Seed nodes are the starting point for the cluster, and they serve as the first point of contact for other nodes. Nodes join the cluster by sending a join message that contains the unique address of the node.
The first node in the seed list starts up and automatically joins itself and forms the cluster. The first seed node needs to be up before the next seed nodes can join the cluster.

Clustered Job Processing

The JobReceptionist and JobMaster actors will run on a master role node. The JobWorkers will run on worker role nodes.

Whenever a JobReceptionist receives a JobRequest, it spawns a JobMaster for the Job and tells it to start work on the job.  The JobMaster creates many works. It supervises the workers and tells them to start. The JobMaster needs to first create the JobWorkers and then broadcast the Work message to them.

The JobWorker receives the Work message and sends a message back to the JobMaster that it wants to enlist itself for work. It also immediately sends the NextTask message to ask for the first task to process. The JobMaster watches all the JobWorkers that enlist, in case one or more of them crashes, and stops all the JobWorkers once the job is finished.

The JobWorker receives a Task, processes the Task, sends back a TaskResult, and asks for the NextTask. The JobMaster receives TaskResult messages in the working state and merges the results. The JobReceptionist finally receives the merged results and kills the JobMaster and complete the process.



No comments:

Post a Comment