Alvin's Big Data Notebook : Data and Stored Procedure Partition in VoltDB

Data Partition

The goal of partitioning is to distribute both the data and the processing.

PARTITION TABLE towns ON COLUMN state_num;

Throughput is maximized by partitioning both the data and the transactions that access that data. So multiple transactions can be processed in parallel. Throughput is easily scaled by increasing either the number of partitions per node or the number of nodes in the cluster.

VoltDB also partitions the work(stored procedures) based on the same column value. Because partitions are independent, multiple queries can be run in parallel.

How many sites per Host

VoltDB defaults to eight sites per node to provide reasonable performance on most modern system configurations. This default does not normally need to be

changed. However, for systems with a large number of available processes (16 or more) or older machines with fewer than 8 processors and limited memory, you may wish to tune the sitesperhost attribute.

The number of sites needed per node is related to the number of processor cores each system has, the optimal number being approximately 3/4 of the number of CPUs reported by the operating system.

it is important to remember that all nodes in the cluster will use the same number of sites. So the best performance is achieved by using a cluster with all nodes having the same physical architecture (i.e. cores)

Stored Procedures

Stored procedures let you define the query once and change the input values when you execute the procedure. Stored procedures have an additional benefit; because they are pre-compiled, the queries do not need to be planned at runtime, reducing the time it takes for each query to execute.

When you partition a stored procedure, you associate it with a specific partition based on the table that it accesses. Now when we invoke the stored procedure, it is executed only in the partition where the State_num column matches the first argument to the procedure, leaving the other partitions free to process other requests.

exec leastpopulated 6; //procedure_name and parameters

One of the most important aspects of VoltDB stored procedures is that each stored procedure is executed as a complete unit, a transaction, that either succeeds or fails as a whole. If any errors occur during the transaction, earlier queries in the transaction are rolled back before a response is returned to the calling application, or any further work is done by the partition.

Once you package the stored procedures into a Jar file, you can then load them into the database using the sqlcmd load classes directive.

load classes storedprocs.jar;

CREATE PROCEDURE

PARTITION ON TABLE people COLUMN state_num FROM CLASS UpdatePeople;

Each partition is single-threaded, running transactions to completion. Since there is no contention within the thread, VoltDB eliminates the need for locking and latching of database records and much of the management overhead that consume the traditional databases.

At run-time, calls to the stored procedures are passed to the appropriate partition. When procedures are "single-partitioned" (meaning they operate on data within a single partition) the server process executes the procedure by itself, freeing the rest of the cluster to handle other requests in parallel.

By using serialized processing, VoltDB ensures transactional consistency without the overhead of locking, latching, and transaction logs, while partitioning lets the database handle multiple requests at a time.

The key to designing the data access for VoltDB applications is that complex or performance sensitive access to the database should be done through stored procedures. It is possible to perform ad hoc queries on a VoltDB database. However, ad hoc queries do not benefit as fully from the performance optimizations VoltDB specializes in and therefore should not be used for frequent, repetitive, or complex transactions. When you define a stored procedure, VoltDB automatically provides ACID transaction guarantees for the stored procedure.

VoltDB does not guarantee a consistent order of results unless you use a tree index to scan the records in a specific order or you specify an ORDER BY clause in the query itself. This is also why use of an ORDER BY clause or a tree index in the WHERE constraint is strongly recommended for all SELECT statements that return multiple rows.

Design Rules for Partitioning Tables

Analyze Data volume and workload

1. There can be only one partition column per table

2. If the table has a primary key, the partitioning column must be included in the primary key.

3. Any integer or string column can identify the partition. The partition column values cannot be null. The column can be empty but can’t be a null value

The following are some additional recommendations:

Choose a column with a reasonable distribution of values so that rows of data will be evenly partitioned.
Choose a column that maximizes use of single-partitioned stored procedures. If one procedure uses column A to lookup data and two procedures use column B to lookup data, partition on column B.
The goal of partitioning is to make the most frequent transactions single-partitioned.
If you partition more than one table on the same column attribute, VoltDB will partition them together.

Reference:

http://voltdb.com/volt-university/tutorials

http://voltdb.com/resources

Alvin's Big Data Notebook

Sunday, 3 May 2015

Data and Stored Procedure Partition in VoltDB

No comments:

Post a Comment