Alvin's Big Data Notebook : Introduction to In-memory Data Grid Hazelcast

In addition to the distributed data storage, Hazelcast also provides us with an ability to share out computational power, in the form of a distributed executor.
We can think of Hazelcast as a library technology—a JAR file that we bring into our application's classpath, and integrate with in order to harness its data distribution capabilities.

Do not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features

Feature 1: Masterless nature
Each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly.

Feature 2: Persist data entirely in-memory
This makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the data that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience.

In terms of scalability, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes.

In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time— we as developers never actually need to know.

Distributed Collections

Queues are great for providing a single pipeline for work distribution. Items can be concurrently offered onto it before being taken off in parallel by workers. With Hazelcast ensuring that each item is only reliably delivered to a single worker while providing us with the distribution, resilience and scalability

MultiMap create a key/list-of-values map
Hazelcast always returns a cloned copy of the data rather than the instance actually held; so modifying the returned object as we would in the preceding code does not actually update the persisted value.

In order to access the additional indexing functionally, we have to use the Hazelcast specific IMap interface. In order to search the map, we need to use SqlPredicate, which provides us with the ability to use a SQL-like syntax to describe the filter.

the distributed map collection provided by Hazelcast is defined by its own IMap class. This actually extends ConcurrentMap, which will provide us with additional atomic operations such as putIfAbsent(key, value) and replace(key, oldValue, newValue). These capabilities may go some way to prevent any concurrent modification, as we are able to detect when a change has occurred, and handle it appropriately within the application layer.

Backup polices:

backup-count controls the number of backup copies created synchronously on each change. Increasing this number significantly will have performance implications as we will have to block waiting upon confirmations this many nodes.
async-backup-count controls the number of backup copies that are created asynchronously on a best effort basis. This figure combined with backup-count determines the total number of backup copies to be held.

Distributed Lock

In building a broad scalable application, one aspect we tend to lose is our ability to restrict and prevent concurrent activity. Within a single JVM we would use a synchronized lock to gatekeeper, a section of functionality from concurrent execution. Once we move away from a single JVM, this problem becomes a much bigger issue. Traditional approaches would leverage a transactional database to provide a system for locking, in the form of a table rowlock or transactional state. However, this approach presents us with a single point of failure and concurrency issues when scaling up our application.

Rather than using a Lock object, IMap provides us with key locking capabilities. Using this, we can acquire a mutex on a specific entry, enabling the ability to prevent concurrent modifications on a targeted piece of data.

Transaction

Offering a REPEATABLE_READ transaction isolation, once you enter a transaction, Hazelcast will automatically acquire the appropriate key locks for each entry that is interacted with;

Any changes we write will be buffered locally until the transaction is complete. If the transaction was successful and was committed, all the locally buffered changes will be flushed out to the wider cluster, and the locks released.

If the transaction was rolled back, we simply release our locks without flushing out the local changes.

Place tasks to data

By making our task PartitionAware, we can return a key with which our task is going to interact. From this it is established which partition the key belongs to, and hence the member node that holds that data. Then the task will be automatically submitted to execute on that specific node to minimize the network latency for the task to obtain or manipulate the data.

Replications

backup-count configures the number of backups to keep synchronously up-to-date. This means anymanipulation operations (put, delete, and so on) will block until this many configured backups have been notified and have confirmed the change.

async-backup-count specifies the number of backups that will be maintained in the background. The creation of these backups will not block when creating or changing data, but will be replicated out to other nodes on a best effort basis asynchronously by Hazelcast.

The number of copies of data will be governed by these two figures:

number-of-copies = 1 owner + backup-count + async-backup-count

Lite Member

Lite member is a non-participant member of the cluster, in that it maintains connections to all the other nodes in the cluster and will directly talk to partition owners, but does not provide any storage or computation to the cluster. This avoids the double hop required by the standard client, but adds the additional complexity and overhead of participating in the cluster. If you need higher levels of performance and throughput, you could consider using a lite member.

Alvin's Big Data Notebook

Sunday, 19 April 2015

Introduction to In-memory Data Grid Hazelcast

No comments:

Post a Comment