Alvin's Big Data Notebook : Apache Spark v.s. Flink

Flink and Spark are both general-purpose data processing platforms.
Both have expansions like SQL queries (Spark: Spark SQL, Flink: MRQL), Graph processing (Spark: GraphX, Flink: Spargel), machine learning (Spark: MLlib, Flink: ???) and Streaming (Spark Streaming, Flink Streaming).

Both are capable of running in standalone mode, yet many are using them on top of Hadoop (YARN, HDFS). They share a strong performance due to their in memory nature.

Flink:

Flink unifies the processing of real-time and historical data.
Flink has a cost-based optimizer of the kind found in relational platforms storing regular structured data adapted for use with unstructured workloads.
Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing.
Flink shares a lot of similarities with relational DBMS. Data is serialized in byte buffers and processed a lot in binary representation. This also allows for fine-grained memory control.
Flink uses a pipelined processing model and it has a cost-based optimizer that selects execution strategies and avoids expensive partitioning and sorting steps.
Moreover, Flink features a special kind of iterations (delta-iterations) that can significantly reduce the amount of computations as iterations go on.

Spark:

Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory datastructure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory.
Spark Streaming provides a workaround, accumulating data for a shortened time window before pushing it down for processing, but that still leaves a delay of several seconds until ingestion — a far cry from real-time.

Reference:
http://blog.madhukaraphatak.com/introduction-to-flink-for-spark-developers-flink-vs-spark/

http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark-and-apache-flink

http://flink.apache.org/blog/index.html
https://medium.com/chasing-buzzwords/is-apache-flink-europes-wild-card-into-the-big-data-race-a189fcf27c4c

Alvin's Big Data Notebook

Saturday, 21 February 2015

Apache Spark v.s. Flink

No comments:

Post a Comment