Distributed computing is a model in which components of a system are shared among multiple computers for better efficiency and performance. Andy Grove, a software engineer introduced Ballista, a distributed computing platform. It’s based on Kubernetes and Rust implementation of Apache Arrow.
Ballista: A Distributed Computing Platform Based on Kubernetes and Rust
According to a blog post by Grove, he started DataFusion project around eighteen months ago. The project was actually an in-memory query engine based on Apache Arrow as the memory model. The main aim was to build a distributed computing platform in Rust to compete with Apache Spark but it later turned out to be difficult for him.
“Unsurprisingly, this turned out to be an overly ambitious goal at the time and I fell short of achieving that. However, some very good things came out of this effort. We now have a Rust implementation of Apache Arrow with a growing community of committers, and DataFusion was donated to the Apache Arrow project as an in-memory query execution engine and is now starting to see some early adoption,” said Andy Grove.
He took a break from working on DataFusion and Arrow and started working on some deliverables at work. Andy then initiated a new PoC (Proof of Concept) project which was actually his second attempt to build a distributed platform with Rust. This time he already had the advantage of DataFusion and Arrow on his plate.
A Ballista cluster comprises of a number of individual pods within a Kubernetes cluster. Ballista applications can be deployed to Kubernetes with the help of Ballista CLI. They use Kubernetes service discovery for connecting to the cluster.
As of now, there’s no distributed query planner. Ballista apps must manually build the query plans that need to be executed on the cluster. In order to make this project practically work, Grove listed some of the things on the roadmap for v1.0.0:
- First of all, he will implement a distributed query planner.
- Then, bring support for all DataFusion logical plans and expressions.
- User code has to be supported as part of distributed query execution.
- They plan to bring interactive SQL queries support against a cluster with gRPC.
- He will also bring support for Arrow Flight protocol and Java bindings.
As this project already led to three DataFusion PRs merged into the Apache Arrow codebase., it will help in driving the requirements for data fusion. If you want to know more, check out the official announcement here.