The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.

Slides: Stephan Ewen – Scaling Apache FlinkĀ® to very large State.pptx