Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Flink and Spark although offering useful Web UI components for monitoring and understanding the logical plan of the jobs, both lack a tool that helps to understand the physical plan of the scheduler and the possibility to monitor execution at a very low level, along with the communication that occur between parallel vertex instances. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Flink or Spark. The tool also offers monitoring of the distribution of keys in a data stream and can lead to optimizing data partitioning across parallel subtasks in the future.

Slides: Zoltan Zvara & Márton Balassi- Advanced visualization of Flink and Spark jobs

Video on YouTube