Alvin's Big Data Notebook : Schema Design in VoltDB

What is VoltDB

VoltDB focuses specifically on fast data rather than big data. VoltDB is not the optimal choice for collecting and collating extremely large historical data sets which must be queried across multiple tables.

VoltDB is optimized throughput over latency.

It is important to note that the VoltDB architecture is optimized for throughput over latency. The latency of any one transaction (the time from when the transaction begins until processing ends) is similar in VoltDB to other databases. However, the number of transactions that can be completed in a second (i.e. throughput) is orders of magnitude higher because VoltDB reduces the amount of time that requests sit in the queue waiting to be executed.

To take full advantage of VoltDB's capabilities, it is best to design your schema and your stored procedures to maximize the use of partitioned tables and procedures.

Schema Best Practices

The database schema is a specification that describes the structure of the VoltDB database such as tables and indexes, identifies the stored procedures that access data in the database, and defines the way tables and stored procedures are partitioned for fast data access. When designing client applications to use the database, the schema specifies the details needed about data types, tables, columns, and so on.

Design the schema to maximize the use of single-partition queries.

Use replicated tables for small, primarily read-only data that can serve as look-up table.

One of the major differences in VoltDB is that you define the database schema before you create the database. VoltDB can determine the best way to arrange the partitions when the database is started.

Application Catalog

At the heart of every Voltdb database is the catalog, which contains: database schema, stored procedures, partitioning information. When you compile the catalog, it compiles the SQL DDL for creating tables, indexes, and materialized views into the schema.

The catalog also includes any stored procedures that defined and the corresponding java class files.

Each node of the cluster starts from a copy of the catalog. The catalog specifies the logical partitioning of tables. At runtime the deployment file specifies the size of the cluster and how many partitions are created on each node.

Replicated Table

When a table is relatively small and not updated frequently, it is better to replicate it to all partitions. This way, even if another table is partitioned (such as a customer table partitioned on last name), stored procedures can join the two tables, no matter what partition the procedure executes in.

Tables where all the records appear in all the partitions are called replicated tables.

One last caveat concerning replicated tables: the benefits of having the data replicated in all partitions is that it can be read from any individual partition. However, the deficit is that any updates or inserts to a replicated table must be executed in all partitions at once. This sort of multi-partition procedure reduces the benefits of parallel processing and impacts throughput. Which is why you should not replicate tables that are frequently updated.

Smaller, mostly read-only tables are good candidates for replication. Note also that if a table needs to be accessed frequently by columns other than the partitioning column, the table should be replicated instead because there is no guarantee that a particular partition includes the data that the query seeks.

In VoltDB, you do not explicitly state that a table is replicated. If you do not specify a partitioning column in the database schema, the table will by default be replicated.

Modifying Partitioning for Tables and Stored Procedures — You can un-partition stored procedures and re-partition stored procedures on a different column, For tables you can change a table between partitioned and replicated, and repartition a table on a different column,

Alvin's Big Data Notebook

Sunday, 3 May 2015

Schema Design in VoltDB

No comments:

Post a Comment