You are viewing documentation for Kubeflow 0.4

This is a static snapshot from the time of the Kubeflow 0.4 release.
For up-to-date information, see the latest version.

Overview of Kubeflow Pipelines

Overview of Kubeflow Pipelines

Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

Quickstart

Run your first pipeline by following the pipelines quickstart guide.

What is Kubeflow Pipelines?

The Kubeflow Pipelines platform consists of:

  • A user interface (UI) for managing and tracking experiments, jobs, and runs.
  • An engine for scheduling multi-step ML workflows.
  • An SDK for defining and manipulating pipelines and components.
  • Notebooks for interacting with the system using the SDK.

The following are the goals of Kubeflow Pipelines:

  • End-to-end orchestration: enabling and simplifying the orchestration of machine learning pipelines.
  • Easy experimentation: making it easy for you to try numerous ideas and techniques and manage your various trials/experiments.
  • Easy re-use: enabling you to re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time.

In Kubeflow v0.1.3 and later, Kubeflow Pipelines is one of the Kubeflow core components. It’s automatically deployed during Kubeflow deployment. You can try it currently with a Kubeflow deployment on GKE. See the GKE setup guide.

What is a pipeline?

A pipeline is a description of an ML workflow, including all of the components in the workflow and how they combine in the form of a graph. (See the screenshot below showing an example of a pipeline graph.) The pipeline includes the definition of the inputs (parameters) required to run the pipeline and the inputs and outputs of each component.

A pipeline is the main shareable artifact in the Kubeflow Pipelines platform. After developing your pipeline, you can upload and share it on the Kubeflow Pipelines UI.

A pipeline component is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, a component can be responsible for data preprocessing, data transformation, model training, and so on.

Example of a pipeline

The screenshots and code below show the xgboost-training-cm.py pipeline, which creates an XGBoost model using structured data in CSV format. You can see the source code and other information about the pipeline on GitHub.

The runtime execution graph of the pipeline

The screenshot below shows the example pipeline’s runtime execution graph in the Kubeflow Pipelines UI:

The Python code that represents the pipeline

Below is an extract from the Python code that defines the xgboost-training-cm.py pipeline. You can see the full code on GitHub.


@dsl.pipeline(
  name='XGBoost Trainer',
  description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
    output,
    project,
    region='us-central1',
    train_data='gs://ml-pipeline-playground/sfpd/train.csv',
    eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
    schema='gs://ml-pipeline-playground/sfpd/schema.json',
    target='resolution',
    rounds=200,
    workers=2,
    true_label='ACTION',
):
  delete_cluster_op = DeleteClusterOp('delete-cluster', project, region).apply(gcp.use_gcp_secret('user-gcp-sa'))
  with dsl.ExitHandler(exit_op=delete_cluster_op):
    create_cluster_op = CreateClusterOp('create-cluster', project, region, output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    analyze_op = AnalyzeOp('analyze', project, region, create_cluster_op.output, schema,
                           train_data, '%s/{{workflow.name}}/analysis' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    transform_op = TransformOp('transform', project, region, create_cluster_op.output,
                               train_data, eval_data, target, analyze_op.output,
                               '%s/{{workflow.name}}/transform' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    train_op = TrainerOp('train', project, region, create_cluster_op.output, transform_op.outputs['train'],
                         transform_op.outputs['eval'], target, analyze_op.output, workers,
                         rounds, '%s/{{workflow.name}}/model' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    predict_op = PredictOp('predict', project, region, create_cluster_op.output, transform_op.outputs['eval'],
                           train_op.output, target, analyze_op.output, '%s/{{workflow.name}}/predict' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', predict_op.output,
                                            '%s/{{workflow.name}}/confusionmatrix' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    roc_op = RocOp('roc', predict_op.output, true_label, '%s/{{workflow.name}}/roc' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

Pipeline data on the Kubeflow Pipelines UI

The screenshot below shows the Kubeflow Pipelines UI for kicking off a run of the pipeline. The pipeline definition in your code determines which parameters appear in the UI form. The pipeline definition can also set default values for these parameters. The arrows on the screenshot indicate the parameters that do not have useful default values in this particular example:

Outputs from the pipeline

The following screenshots show examples of the pipeline output visible on the Kubeflow Pipelines UI.

Prediction results:

Confusion matrix:

Receiver operating characteristics (ROC) curve:

Next steps