Apache Flink is a framework for distributed stream processing and in a very high level description, Flink is made up of the following:
where Data Source: is the input data to flink for processing.
Transformation: is the processing stage where different algorithms may be applied.
Data Sink: is the stage where flink sends the processed data. This could be kafka queue, cassandra etc.
Flink’s capability to compute accurate results on unbounded data sets is based on the fact that it has the following features:
- Exactly Once semantics for stateful computations: where stateful means application can maintain a summary of the data that has been processed and checkpointing mechanism of Flink ensures exactly once semantics in the case of a failure. In other words, checkpointing allow Flink to recover state and positions in the stream in case of a failure.
- Flink supports stream processing with event time semantics where event time refers to the time at which each individual event occurred in the device. The event time semantics makes it easy to compute accurate results over streams when the events arrive out of order or with delay. So, the time at which the event occurred will be present in every event implies it will be easy to group and process the events by assigning them to their corresponding hour window. An hourly event time window will contain all records that carry an event timestamp that falls into that hour, regardless of when the records arrive, and in what order they arrive.
3. Flink supports flexible windowing option where the windowing can be done based on time, session or counts. Apache Flink supports different types of window such as tumbling window, sliding window, global window and session window. Time based window is created as soon as the first event corresponding to this window arrives and window is removed when the time (event time or processing time) passes its end timestamp + user specified allowed lateness.
4. Flink’s save point is a mechanism to update the application or reprocess historic data with minimum downtime. Savepoints are externally stored checkpoints that can be used to update the Flink program. It uses Flinks checkpointing mechanism to create a snapshot of the state of the streaming program and write the checkpoint metadata to an external file system.
A simple word count algorithm using Apache Flink DataSet Api can be found at the github project Apache Flink Git Hub Project