Efficiently executing R Dataframes on Flink ​

While dataflow engines offer scalability, their programming abstractions are often unfamiliar to data scientists, which are used to Python and R. To provide a more convenient interface, dataflow engines like Spark provide an R-like dataframe abstraction. While operations without user-defined code can be executed efficiently, the execution of UDFs is dominated by serialized data exchange between the dataflow engine and an external R process that evaluates the code. We present a new approach to execute user-defined functions by using the Truffle/Graal compiler infrastructure, which enables efficient execution of dynamic languages on the JVM. Based on fastR, the R language provided by this infrastructure, we exemplify the execution of R scripts directly inside the data pipelines of Flink, without data serialization and inter-process communication. Furthermore, we discuss future opportunities and problems, and compare our approach to native Flink, Spark, and SparkR.

Speakers involved