Friday, 28 August 2015

Aggregations in Cassandra 2.2

In Cassandra 2.2, the standard aggregate functions of minmaxavgsum, and count are built-in functions.

Cassandra 2.2 and later allows users to define aggregate functions that can be applied to data stored in a table as part of a query result. 

  • The function must be created prior to its use in a SELECT statement and the query must only include the aggregate function itself, but no columns. 
  • The state function is called once for each row, and the value returned by the state function becomes the new state.
  •  After all rows are processed, the optional final function is executed with the last state value as its argument. 


Aggregation is performed by the coordinator. So if you don't include a partition key in your query all the results are brought back to the coordinator for your function to be executed, if you do a full table scan for your UDF/A don't expect it to be fast if your table is huge.

User defined aggregates work by calling your user defined function on every row returned from your query, they differ from a function because the first value to the function is state that is passed between rows, much like a fold.

Creating an aggregate is a two or three step process: 
  1. Create a function that takes in state as the first parameter and any number of additional parameters
  2. (Optionally) Create a final function that is called after the state function has been called on every row
  3. Refer to these in an aggregate, which starts with (INITCOND) null (so it will return null for an empty table)

Some examples:

CREATE FUNCTION state_group_and_total( state map, type text, amount int )
CALLED ON NULL INPUT
RETURNS map
LANGUAGE java AS '
Integer count = (Integer) state.get(type);  
if (count == null) count = amount; 
else count = count + amount; 
state.put(type, count); 
return state; ' ;


CREATE OR REPLACE AGGREGATE group_and_total(text, int) 
SFUNC state_group_and_total 
STYPE map 
INITCOND {};

SELECT GROUP_AND_TOTAL(customer_id, amount) 
FROM CUSTOMER_PURCHASES;


Reference:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDA.html
http://christopher-batey.blogspot.ca/2015/05/cassandra-aggregates-min-max-avg-group.html

No comments:

Post a Comment