Flink Forward is a special time of year in not only Berlin, but also Mountain View, California. At Google we’ve long been fans of the Flink project and advancing stream processing in general. This year, several of us at Google are especially excited to collaborate with dataArtisans and other Apache Flink committers on Apache Beam (incubating). We think that together, Beam and Flink, have a very bright (or should I say Beaming?) future.
Where it all began
In 2007 we started on a project called Millwheel, which would grow into a new data processing model at Google described in the 2013 paper of the same name. Meanwhile in Europe, Flink emerged as described in a 2015 paper into a well-regarded open source execution engine. The people behind the two projects met after the Google Dataflow paper was published in 2015.
The Beam Model
The Dataflow paper pushed primarily one idea: a single model for data processing that unified batch and streaming. But this assertion implied an equally significant idea that I’m not sure any of us fully appreciated at the time: A unified model meant data processing jobs can be made portable across execution engines. As Kostas Tzoumas, CEO of data Artisans, the company behind Flink, puts it:
Our involvement with Beam started very early […] Very quickly, it became apparent to us that the Dataflow model […] is the correct model for stream and batch data processing. We fully subscribed to Google’s vision to unify real-time and historical analytics under one platform […]. Taking a cue from this foundational work, we rewrote Flink’s DataStream API […] to incorporate many of the concepts described in the Dataflow paper […].
As the Dataflow SDK and the Runners were moving to Apache Incubator as Apache Beam, we were asked by Google to bring the Flink runner into the codebase of Beam […]. We decided to go full st(r)eam ahead with this opportunity as we believe that […] the Beam model is the future reference programming model for writing data applications in both stream and batch […].
Job Portability
As powerful as the ideas behind the model are, years from now people may think of Apache Beam mostly for its portability. While dataArtisans was working on the Flink runner, Cloudera caught wind of the project and the Spark runner soon followed. In the six months since we now see Beam with multiple SDKs, nearly a half dozen runners in the works and contributions from the biggest names in data processing.
Google Cloud Dataflow
Our ultimate goal has always to been to perfect a cloud-native and cloud-only data processing service we call Cloud Dataflow. We’ve recently added Dynamic Work Balancing and Autoscaling that maintain exactly-once processing, which Malo Denielou presented at the conference. Yet, we haven’t let up on our commitment to Beam. We are excited that customers who’ve chosen Apache Beam as their API and Cloud Dataflow as their execution engine have no concerns for lock-in with an increasingly seamless alternative in Flink for on-premise or self-managed Cloud deployments. We look forward, at the close of this Flink Forward, to the next conference and what will come of these projects. Thanks to our mutual communities for all their innovation and contributions.
By Eric Anderson, Product Manager, Google