Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes

Today’s post is by David Aronchick and Jeremy Lewi, a PM and Engineer on the Kubeflow project, a new open source GitHub repo dedicated to making using machine learning (ML) stacks on Kubernetes easy, fast and extensible.

Kubernetes and Machine Learning

Kubernetes has quickly become the hybrid solution for deploying complicated workloads anywhere. While it started with just stateless services, customers have begun to move complex workloads to the platform, taking advantage of rich APIs, reliability and performance provided by Kubernetes. One of the fastest growing use cases is to use Kubernetes as the deployment platform of choice for machine learning.

Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning. Infrastructure engineers will often spend a significant amount of time manually tweaking deployments and hand rolling solutions before a single model can be tested.

Worse, these deployments are so tied to the clusters they have been deployed to that these stacks are immobile, meaning that moving a model from a laptop to a highly scalable cloud cluster is effectively impossible without significant re-architecture. All these differences add up to wasted effort and create opportunities to introduce bugs at each transition.

Introducing Kubeflow

To address these concerns, we’re announcing the creation of the Kubeflow project, a new open source GitHub repo dedicated to making using ML stacks on Kubernetes easy, fast and extensible. This repository contains:

  • JupyterHub to create & manage interactive Jupyter notebooks
  • A Tensorflow Custom Resource (CRD) that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
  • A TF Serving container Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!

Using Kubeflow

Let's suppose you are working with two different Kubernetes clusters: a local GKE cluster with GPUs; and that you have two kubectl contexts defined named minikube and gke.

First we need to initialize our

     ks init my-kubeflow  
     cd my-kubeflow  
     ks registry add kubeflow \  
     github.com/google/kubeflow/tree/master/kubeflow  
     ks pkg install kubeflow/core  
     ks pkg install kubeflow/tf-serving  
     ks pkg install kubeflow/tf-job  
     ks generate core kubeflow-core --name=kubeflow-core

We can now define

     kubectl config use-context minikube  
     ks env add minikube  

     kubectl config use-context gke  
     ks env add gke  

And we’re done! Now just create the environments on your cluster. First, on minikube:

     ks apply minikube -c kubeflow-core  

And to create it on our multi-node GKE cluster for quicker training:

     ks apply gke -c kubeflow-core  

By making it easy to deploy the same rich ML stack everywhere, the drift and rewriting between these environments is kept to a minimum.

To access either deployments, you can execute the following command:

     kubectl port-forward tf-hub-0 8100:8000  

and then open up http://127.0.0.1:8100 to access JupyterHub. To change the environment used by kubectl, use either of these commands:

     # To access minikube  
     kubectl config use-context minikube  

     # To access GKE  
     kubectl config use-context gke  

When you execute apply you are launching on K8s

Let's suppose you want to submit a training job. Kubeflow provides ksonnet TensorFlow's CNN benchmark.

To submit a training job, you first generate a new job from a prototype:

     ks generate tf-cnn cnn --name=cnn  

By default the tf-cnn prototype uses 1 worker and no GPUs which is perfect for our minikube cluster so we can just submit it.

     ks apply minikube -c cnn

On GKE, we’ll want to tweak the prototype to take advantage of the multiple nodes and GPUs. First, let’s list all the parameters available:

     # To see a list of parameters  
     ks prototype list tf-job  

Now let’s adjust the parameters to take advantage of GPUs and access to multiple nodes.

     ks param set --env=gke cnn num\_gpus 1  
     ks param set --env=gke cnn num\_workers 1  

     ks apply gke -c cnn  

Note how we set those parameters so they are used only when you deploy to GKE. Your minikube parameters are unchanged!

After training, you

Kubeflow also includes a serving package as well. In a separate example, we trained a standard Inception model, and stored the trained model in a bucket we’ve created called ‘gs://kubeflow-models’ with the path ‘/inception’.

To deploy a the trained model for serving, execute the following:

     ks generate tf-serving inception --name=inception  
     ---namespace=default --model\_path=gs://kubeflow-models/inception  
     ks apply gke -c inception  

This highlights one more option in Kubeflow - the ability to pass in inputs based on your deployment. This command creates a tf-serving service on the GKE cluster, and makes it available to your application.

For more information about of deploying and monitoring TensorFlow training jobs and TensorFlow models please refer to the .

Kubeflow + ksonnet

One choice we want to call out is the use of the ksonnet project. We think working with multiple environments (dev, test, prod) will be the norm for most Kubeflow users. By making environments a first class concept, ksonnet makes it easy for Kubeflow users to easily move their workloads between their different environments.

Particularly now that docs.

We also want to thank the team at

What’s Next?

We are in the midst of building out a community effort right now, and we would love your help! We’ve already been collaborating with many teams -

“The Kubeflow project was a needed advancement to make it significantly easier to set up and productionize machine learning workloads on Kubernetes, and we anticipate that it will greatly expand the opportunity for even more enterprises to embrace the platform. We look forward to working with the project members in providing tight integration of Kubeflow with Tectonic, the enterprise Kubernetes platform.” -- Reza Shafii, VP of product, CoreOS

If you’d like to try out Kubeflow right now right in your browser, we’ve partnered with

And we’re just getting started! We would love for you to help. How you might ask? Well…

Jeremy Lewi & David Aronchick Google