Using Spark and Zeppelin to process big data on Kubernetes 1.2
Editor's note: this is the fifth post in a
With big data usage growing exponentially, many Kubernetes customers have expressed interest in running
There are many ways to run Spark outside of Kubernetes: So why would you run Spark on Kubernetes? For this demo, we’ll be using
Note: We’re using n1-standard-4 (which are larger than the default node size) to demonstrate some features of Horizontal Pod Autoscaling. However, Spark works just fine on the default node size of n1-standard-1. After the cluster’s created, you’re ready to launch Spark on Kubernetes using the config files in the Kubernetes GitHub repo: The pods (especially Apache Zeppelin) are somewhat large, so may take some time for Docker to pull the images. Once everything is running, you should see something similar to the following: You can see that Kubernetes is running one instance of Zeppelin, one Spark master and three Spark workers. Next you’ll set up a secure proxy from your local machine to Zeppelin, so you can access the Zeppelin instance running in the cluster from your machine. (Note: You’ll need to change this command to the actual Zeppelin pod that was created on your cluster.) This establishes a secure link to the Kubernetes cluster and pod ( For our example, we’re going to show you how to build a simple movie recommendation model. This is based on the code
Now that the secure proxy is up, visit
Click “Import note,” give it an arbitrary name (e.g. “Movies”), and click “Add from URL.” For a URL, enter: Then click “Import Note.” This will give you a ready-made Zeppelin note for this demo. You should now have a “Movies” notebook (or whatever you named it). If you click that note, you should see a screen similar to this: You can now click the Play button, near the top-right of the PySpark code block, and you’ll create a new, in-memory movie recommendation model! In the Spark application model, Zeppelin acts as a
For this demo, we’ll be using Google Cloud Storage, which will let us store our model data beyond the life of a single pod. Spark for Kubernetes is built with the
If you want, you can change the variables at the top of the note so that the example will actually save and restore a model for the movie recommendation engine — just point those variables at a GCS bucket that you have access to. If you want to create a GCS bucket, you can do something like this on the command line: You’ll need to change this URI to something that is unique for you. This will create a bucket that you can use in the example above. Spark is somewhat elastic to workers coming and going, which means we have an opportunity: we can use use Kubernetes Horizontal Pod Autoscaling to scale-out the Spark worker pool automatically, setting a target CPU threshold for the workers and a minimum/maximum pool size. This obviates the need for having to configure the number of worker replicas manually. Create the Autoscaler like this (note: if you didn’t change the machine type for the cluster, you probably want to limit the --max to something smaller): To see the full effect of autoscaling, wait for the replication controller to settle back to one replica. Use The workload we ran before ran too quickly to be terribly interesting for HPA. To change the workload to actually run long enough to see autoscaling become active, change the “rank = 100” line in the code to “rank = 200.” After you hit play, the Spark worker pool should rapidly increase to 20 pods. It will take up to 5 minutes after the job completes before the worker pool falls back down to one replica. In this article, we showed you how to run Spark and Zeppelin on Kubernetes, as well as how to use Google Cloud Storage to store your Spark model and how to use Horizontal Pod Autoscaling to dynamically size your Spark worker pool. This is the first in a series of articles we’ll be publishing on how to run big data frameworks on Kubernetes — so stay tuned! Please join our community and help us build the future of Kubernetes! There are many ways to participate. If you’re particularly interested in Kubernetes and big data, you’ll be interested in: -- Zach Loafman, Software Engineer, GoogleWhy Zeppelin?
Why Kubernetes?
Launch Spark
$ gcloud container clusters create spark --scopes storage-full
--machine-type n1-standard-4
$ git clone https://github.com/kubernetes/kubernetes.git
$ kubectl create -f kubernetes/examples/spark
‘kubernetes/examples/spark’
is a directory, so this command tells kubectl to create all of the Kubernetes objects defined in all of the YAML files in that directory. You don’t have to clone the entire repository, but it makes the steps of this demo just a little easier.$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-controller-v4v4y 1/1 Running 0 21h
spark-worker-controller-7phix 1/1 Running 0 21h
spark-worker-controller-hq9l9 1/1 Running 0 21h
spark-worker-controller-vwei5 1/1 Running 0 21h
zeppelin-controller-t1njl 1/1 Running 0 21h
Set up the Secure Proxy to Zeppelin
$ kubectl port-forward zeppelin-controller-t1njl 8080:8080
zeppelin-controller-t1njl
) and then forwards the port in question (8080) to local port 8080, which will allow you to use Zeppelin safely.Now that I have Zeppelin up and running, what do I do with it?
https://gist.githubusercontent.com/zmerlynn/875fed0f587d12b08ec9/raw/6
eac83e99caf712482a4937800b17bbd2e7b33c4/movies.json
Working with Google Cloud Storage (Optional)
$ gsutil mb gs://my-spark-models
Using Horizontal Pod Autoscaling with Spark (Optional)
$ kubectl autoscale --min=1 --cpu-percent=80 --max=10 \
rc/spark-worker-controller
‘kubectl get rc’
and wait for the “replicas” column on spark-worker-controller to fall back to 1.Conclusion