Connecting Apache Flink and the Pravega Stream Store

Pravega is a storage substrate that we have designed and implemented from ground up to accommodate the requirements of recent data analytics applications. The architecture of Pravega is such that applications use it to store stream data permanently while being able to process this data with low latency. Storing stream data permanently is highly desirable because it enables applications to process such data in both near real-time or months later in the same way and through the same interface. Writes to Pravega can be transactional so that the application can make data visible for reading atomically. This feature is important for guaranteeing exactly-once semantics when writing data into Pravega. As the volume of writes on a stream can grow and shrink over time, Pravega enables the capacity of a stream to adapt by allowing streams to scale automatically. The auto-scaling feature enables the number of segments of a stream to increase and decrease according to load, where a segment is the unit of parallelism of a Pravega stream. Pravega by itself does not provide any capability to process data, however, and it requires an engine with advanced data analytics features such as Apache Flink to process its streams. To connect Pravega and Flink, we have been working with the community to implement a source and a sink. The source uses Pravega checkpoints to save the state of incoming streams and guarantees exactly-once semantics, while not requiring a sticky assignment of source workers to segments. Pravega checkpoints integrate and align well with Flink checkpoints, making this a unique approach compared to existing sources. As scaling is a key feature of Pravega, we also envision scaling signals out of Pravega to the source workers indicating that a job needs to scale up or down, which is a feature currently under development and [...]