Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes
Today’s post is by David Aronchick and Jeremy Lewi, a PM and Engineer on the Kubeflow project, a new open source GitHub repo dedicated to making using machine learning (ML) stacks on Kubernetes easy, fast and extensible.
Kubernetes and Machine Learning
Kubernetes has quickly become the hybrid solution for deploying complicated workloads anywhere. While it started with just stateless services, customers have begun to move complex workloads to the platform, taking advantage of rich APIs, reliability and performance provided by Kubernetes. One of the fastest growing use cases is to use Kubernetes as the deployment platform of choice for machine learning.
Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning. Infrastructure engineers will often spend a significant amount of time manually tweaking deployments and hand rolling solutions before a single model can be tested.
Worse, these deployments are so tied to the clusters they have been deployed to that these stacks are immobile, meaning that moving a model from a laptop to a highly scalable cloud cluster is effectively impossible without significant re-architecture. All these differences add up to wasted effort and create opportunities to introduce bugs at each transition.
Introducing Kubeflow
To address these concerns, we’re announcing the creation of the Kubeflow project, a new open source GitHub repo dedicated to making using ML stacks on Kubernetes easy, fast and extensible. This repository contains:
- JupyterHub to create & manage interactive Jupyter notebooks
- A Tensorflow Custom Resource (CRD) that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
- A TF Serving container Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!
Using Kubeflow
Let's suppose you are working with two different Kubernetes clusters: a local GKE cluster with GPUs; and that you have two kubectl contexts defined named minikube and gke.
First we need to initialize our
We can now define
And we’re done! Now just create the environments on your cluster. First, on minikube: And to create it on our multi-node GKE cluster for quicker training: By making it easy to deploy the same rich ML stack everywhere, the drift and rewriting between these environments is kept to a minimum. To access either deployments, you can execute the following command: and then open up http://127.0.0.1:8100 to access JupyterHub. To change the environment used by kubectl, use either of these commands: When you execute apply you are launching on K8s Let's suppose you want to submit a training job. Kubeflow provides ksonnet TensorFlow's CNN benchmark. To submit a training job, you first generate a new job from a prototype: By default the tf-cnn prototype uses 1 worker and no GPUs which is perfect for our minikube cluster so we can just submit it. On GKE, we’ll want to tweak the prototype to take advantage of the multiple nodes and GPUs. First, let’s list all the parameters available: Now let’s adjust the parameters to take advantage of GPUs and access to multiple nodes. Note how we set those parameters so they are used only when you deploy to GKE. Your minikube parameters are unchanged! After training, you
Kubeflow also includes a serving package as well. In a separate example, we trained a standard Inception model, and stored the trained model in a bucket we’ve created called ‘gs://kubeflow-models’ with the path ‘/inception’. To deploy a the trained model for serving, execute the following: This highlights one more option in Kubeflow - the ability to pass in inputs based on your deployment. This command creates a tf-serving service on the GKE cluster, and makes it available to your application. For more information about of deploying and monitoring TensorFlow training jobs and TensorFlow models please refer to the . One choice we want to call out is the use of the ksonnet project. We think working with multiple environments (dev, test, prod) will be the norm for most Kubeflow users. By making environments a first class concept, ksonnet makes it easy for Kubeflow users to easily move their workloads between their different environments. Particularly now that docs. We also want to thank the team at
We are in the midst of building out a community effort right now, and we would love your help! We’ve already been collaborating with many teams -
“The Kubeflow project was a needed advancement to make it significantly easier to set up and productionize machine learning workloads on Kubernetes, and we anticipate that it will greatly expand the opportunity for even more enterprises to embrace the platform. We look forward to working with the project members in providing tight integration of Kubeflow with Tectonic, the enterprise Kubernetes platform.” -- Reza Shafii, VP of product, CoreOS If you’d like to try out Kubeflow right now right in your browser, we’ve partnered with
And we’re just getting started! We would love for you to help. How you might ask? Well… Jeremy Lewi & David Aronchick
Google ks init my-kubeflow
cd my-kubeflow
ks registry add kubeflow \
github.com/google/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/core
ks pkg install kubeflow/tf-serving
ks pkg install kubeflow/tf-job
ks generate core kubeflow-core --name=kubeflow-core
kubectl config use-context minikube
ks env add minikube
kubectl config use-context gke
ks env add gke
ks apply minikube -c kubeflow-core
ks apply gke -c kubeflow-core
kubectl port-forward tf-hub-0 8100:8000
# To access minikube
kubectl config use-context minikube
# To access GKE
kubectl config use-context gke
ks generate tf-cnn cnn --name=cnn
ks apply minikube -c cnn
# To see a list of parameters
ks prototype list tf-job
ks param set --env=gke cnn num\_gpus 1
ks param set --env=gke cnn num\_workers 1
ks apply gke -c cnn
ks generate tf-serving inception --name=inception
---namespace=default --model\_path=gs://kubeflow-models/inception
ks apply gke -c inception
Kubeflow + ksonnet
What’s Next?