Apache Spark 2.3 with Native Kubernetes Support

Tuesday, March 06, 2018

Kubernetes and Big Data

The open source community has been working over the past year to enable first-class support for data processing, data analytics and machine learning workloads in Kubernetes. New extensibility features in Kubernetes, such as

Traditionally, data processing workloads have been run in dedicated setups like the YARN/Hadoop stack. However, unifying the control plane for all workloads on Kubernetes simplifies cluster management and can improve resource utilization.

"Bloomberg has invested heavily in machine learning and NLP to give our clients a competitive edge when it comes to the news and financial information that powers their investment decisions. By building our Data Science Platform on top of Kubernetes, we're making state-of-the-art data science tools like Spark, TensorFlow, and our sizable GPU footprint accessible to the company's 5,000+ software engineers in a consistent, easy-to-use way." - Steven Bower, Team Lead, Search and Data Science Infrastructure at Bloomberg

Introducing Apache Spark + Kubernetes

Apache Spark is an essential tool for data scientists, offering a robust platform for a variety of applications ranging from large scale data transformation to analytics to machine learning. Data scientists are adopting containers en masse to improve their workflows by realizing benefits such as packaging of dependencies and creating reproducible artifacts. Given that Kubernetes is the de facto standard for managing containerized environments, it is a natural fit to have support for Kubernetes APIs within Spark.

Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster and take advantage of Apache Spark's ability to manage distributed data processing tasks. Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through

Concretely, a native Spark Application in Kubernetes acts as a Istio.

To try this yourself on a Kubernetes cluster, simply download the binaries for the official

$ kubectl cluster-info  

Kubernetes master is running at https://xx.yy.zz.ww

$ bin/spark-submit

   --master k8s://https://xx.yy.zz.ww

   --deploy-mode cluster

   --name spark-pi

   --class org.apache.spark.examples.SparkPi

   --conf spark.executor.instances=5

   --conf spark.kubernetes.container.image=

   --conf spark.kubernetes.driver.pod.name=spark-pi-driver

   local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

To watch Spark resources that are created on the cluster, you can use the following kubectl command in a separate terminal window.

$ kubectl get pods -l 'spark-role in (driver, executor)' -w

NAME              READY STATUS  RESTARTS AGE

spark-pi-driver   1/1 Running  0 14s

spark-pi-da1968a859653d6bab93f8e6503935f2-exec-1   0/1 Pending 0 0s

The results can be streamed during job execution by running:


$ kubectl logs -f spark-pi-driver

When the application completes, you should see the computed value of Pi in the driver logs.

In Spark 2.3, we're starting with support for Spark applications written in Java and Scala with support for resource localization from a variety of data sources including HTTP, GCS, HDFS, and more. We have also paid close attention to failure and recovery semantics for Spark executors to provide a strong foundation to build upon in the future. Get started with

Get Involved

There's lots of exciting work to be done in the near future. We're actively working on features such as dynamic resource allocation, in-cluster staging of dependencies, support for PySpark & SparkR, support for Kerberized HDFS clusters, as well as client-mode and popular notebooks' interactive execution environments. For people who fell in love with the Kubernetes way of managing applications declaratively, we've also been working on a

And we're just getting started! We would love for you to get involved and help us evolve the project further.

Huge thanks to the Apache Spark and Kubernetes contributors spread across multiple organizations who spent many hundreds of hours working on this effort. We look forward to seeing more of you contribute to the project and help it evolve further.

Anirudh Ramanathan and Palak Bhatia
Google

←Previous