Apache Flink: Checkpoints and Savepoints
Arguably the most powerful feature of Apache Flink is its ability to do stateful computations on a boundless stream of data. Apache Flink is the core of Eventador’s fully managed Enterprise Streaming Platform. That said, to understand the value of Apache Flink, it’s still important to know the difference between a checkpoint and a savepoint. These two mechanisms for saving state are similar in design but used for two very different purposes.
Checkpoints provide fault tolerance capabilities by keeping state in a replayable form that can be used at any time to recover a Flink job if a task manager has failed for any number of reasons. Checkpoints are state snapshots that happen automatically in the background via the
state.checkpoints.num-retained setting. They are saved for any operator requiring state whether user-defined or system state.
- Automatically created
- Used for operator recoverability, and position in a stream of data
- Checkpoint capability differs by source/sink type
- A restart strategy can be defined to control restart timing and attempts
It should be noted that because checkpoints allow jobs to resume from failures they are not retained after the job is canceled by default. However, you can setup retained checkpoints—these checkpoints stick around for resuming later as needed. It’s up to you to remove the checkpoints when you no longer need them. Retained checkpoints are configured in your code like:
CheckpointConfig config = env.getCheckpointConfig();
Additional in-depth reading can be found in the docs here.
A savepoint provides the ability to stop/restart, fork, and update Flink jobs. Savepoints use checkpoints to build a consistent image of the job at any point in time, and users control savepoint creation and usage. A typical usage pattern would be to cancel a job with a savepoint, then resume from that savepoint after making some change to the code and pushing a new version out to the cluster.
- Created by the user
- Can create a savepoint when canceling a job
- Can restart a job from a savepoint
Savepoints can be triggered via the Flink command-line client, or on Eventador via the Console application. Using the command-line you can cancel and create a savepoint, and then start from savepoint like:
# Trigger a savepoint
bin/flink savepoint :jobId [:targetDirectory]
# Cancel with savepoint
bin/flink cancel -s [:targetDirectory] :jobId
# Resume from savepoint
bin/flink run -s :savepointPath [:runArgs]
It’s important to assign operator IDs to your jobs in Flink—this allows for jobs to be upgraded in the future. If you don’t specify them manually an automatic ID will be generated. The best practice is to give them an ID that’s recognizable for your application context.
DataStream<String> stream = env.
// Stateful source (e.g. Kafka) with ID
.uid("source-id") // ID for the source operator
// Stateful mapper with ID
.uid("mapper-id") // ID for the mapper
// Stateless printing sink
.print(); // Auto-generated ID
More reading is in the docs here
How checkpoints and savepoints work in Eventador ESP
There are a number of features built into Eventador ESP that make savepoint and checkpoint management easy.
When using Eventador ESP, the state backend is configured as
state.backend: rocksdb checkpoints are initially set to
state.checkpoints.num-retained: 3 and stored on S3 like
state.checkpoints.dir: s3p://... The savepoint destination is configured to save on S3 as well like
state.savepoints.dir: s3p://... Of course, this is configurable based on your use-case. The net effect is that Flink is fault tolerant of failures because three checkpoints are retained for recoverability reasons.
Using the ESP Console, you can create a savepoint for any job, view a library of savepoints, and restart from them as when making changes to your code. This all works seamlessly with Eventador Projects and Github. You can commit your job code to Github, then run that job starting from a previously created savepoint all in a few clicks.
There are a number of tunables and configuration operations available in Flink. Our support staff is always available to discuss these options and help with tuning, capacity planning, and answer your questions 24x7x365 on slack and email.
If you’d like to learn more about Eventador ESP including how checkpoints work on Eventador, I recently hosted a webinar that you can view on demand to see more on that topic and to discover more ways that Eventador ESP helps simplify Apache Flink-based stream processor creation, deployment, and management.