May 19 | 6 min read

Introducing an Operator for Cortex

Simão Reis

A standalone, open source operator for Cortex on Kubernetes

Cortex already goes far beyond what most cloud native projects do. It is horizontally scalable, highly available, and can leverage S3 for long-term retention out of the box. You can get started locally or even in a Kubernetes cluster with a single binary/container. But as is usually the case, things get more complicated at scale.

At Opstrace, we have invested a lot of time into operating Cortex, orchestrating it with code as part of our larger observability distribution. We have a powerful, well-tested CI and production environments that give us confidence in making changes and performing upgrades rapidly.

Now that we have developed this deep expertise, we feel confident in our ideas for running Cortex and want to share that experience back to the community in a smaller more digestible form than via our full distribution. So we are reimplementing our existing Typescript code as a standalone operator for Cortex: opstrace/cortex-operator. In our experience, having smaller pieces of software that do one job well not only are easier to maintain and test but they also make it easier for others to adopt it for their own projects and contribute. Over time this should hopefully make it easier for newcomers to learn about Cortex and run it safely at scale.

Why use a Cortex operator? Cortex requires more orchestration than Prometheus because it deploys many different resource types to accomplish its job. A common way to run Cortex is to use the Helm chart. It’s an easy way to get started, but for longer-term operations you really want running code. Using the operator pattern we can do a lot, for example: scale individual Cortex components safely, control rollout logic, and recover from failures with a non-trivial sequence of operations. The end result should be as easy as running Prometheus with the prometheus-operator.

This initial version reads a spec and then generates the appropriate Kubernetes definitions for Cortex:

It is built with Go, using the kubebuilder framework. When you create a CRD object the controller creates a ConfigMap with some default configuration for Cortex and deploys all the required services for you. When you update the CRD, the controller reconciles all the necessary resources.

Here’s an example of what you can do today to deploy Cortex on your EKS cluster after you install the operator. We have a guide to scaffold the infrastructure on AWS using Terraform.

First, create a manifest file with the CRD object:

$ cat cortex-sample.yaml
apiVersion: cortex.opstrace.io/v1alpha1
kind: Cortex
metadata:
  name: cortex-sample
spec:
  image: "cortexproject/cortex:v1.9.0"

  blocks_storage:
    backend: s3
    s3:
      bucket_name: cortex-operator-example-209f-data
      endpoint: s3.us-west-2.amazonaws.com

  ruler_storage:
    backend: s3
    s3:
      bucket_name: cortex-operator-example-209f-config
      endpoint: s3.us-west-2.amazonaws.com

  alertmanager_storage:
    backend: s3
    s3:
      bucket_name: cortex-operator-example-209f-config
      endpoint: s3.us-west-2.amazonaws.com

Create it using kubectl with the following command:

$ kubectl create -f cortex-sample.yaml

You can see that all the pods are running with kubectl get pod:

$ kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
compactor-0                      1/1     Running   0          2m59s
compactor-1                      1/1     Running   0          2m46s
distributor-db7f645c7-2bzlj      1/1     Running   0          2m59s
distributor-db7f645c7-2n22k      1/1     Running   0          2m59s
ingester-0                       1/1     Running   0          2m59s
ingester-1                       1/1     Running   0          1m26s
memcached-0                      1/1     Running   0          3m
memcached-index-queries-0        1/1     Running   0          3m
memcached-index-writes-0         1/1     Running   0          3m
memcached-metadata-0             1/1     Running   0          3m
memcached-results-0              1/1     Running   0          3m
querier-7dbd4cb465-66q95         1/1     Running   0          2m59s
querier-7dbd4cb465-frfnj         1/1     Running   1          2m59s
query-frontend-b9f7f97b7-g7lsf   1/1     Running   0          2m59s
query-frontend-b9f7f97b7-tsppd   1/1     Running   0          2m59s
store-gateway-0                  1/1     Running   0          2m59s
store-gateway-1                  1/1     Running   0          2m34s

When all the pods are running you can start sending metrics to Cortex.

Using this sample manifest you can deploy Prometheus node-exporter to the cluster to gather node metrics and scrape them with Grafana Agent. To confirm Cortex is receiving metrics you can set up port-forwarding and then use cortex-tools to read them back.

Set up port-forwarding with kubectl:

$ kubectl port-forward svc/query-frontend 8080:80

Query with cortex-tools with:

$ cortextool remote-read dump --address=http://localhost:8080 --remote-read-path=/api/v1/read
INFO[0000] Created remote read client using endpoint 'http://localhost:8080/api/v1/read'
INFO[0000] Querying time from=2021-05-19T12:25:29Z to=2021-05-19T13:25:29Z with selector=up
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430648242
<... snip …>

You can also deploy Grafana and add a datasource pointing at the query-frontend service running at the url http://query-frontend.default.svc.cluster.local/api/v1/read.

We decided to open the repo as soon as possible so you can follow along from the very beginning stages of development. Currently, the operator is only tested on EKS with the blocks storage engine using the Amazon S3 backend. When we finally integrate this with Opstrace we will leverage our testing infrastructure and production customers to provide even more confidence in the operator.

What’s next for this operator? First and foremost, we want to engage with the Cortex community. We’d like to discuss it with you wherever you like, in our community channels or the #cortex channel at CNCF. Please join us so we can work together to build something useful for everyone. And there are many more features to be built. On our roadmap are several important areas, including:

Custom Cortex configuration and upgrade support are in progress.
Reflect status of deployment in CRD status field.
Add Kubernetes validation webhook for Cortex config. (For example, did you set all the required options to set up the S3 storage correctly?)
Improve local testing with a tutorial for Kubernetes Kind cluster and minio.
Support CRDs to deploy Cortex in a particular topology. Currently, Opstrace deploys N Cortex ingesters, distributors, and queriers, but does not distinguish between instances; our goal is to be able to handle other ingestion and query load patterns in the same cluster.
Advanced failure remediation, for example, removing a component from the ring and re-create it.

Cortex is an amazing, growing project that offers a lot of value. We want to contribute our Cortex expertise back to the community in the form of a Kubernetes operator to make it easier for more people to join the Cortex community. Get started with our quick start here: https://github.com/opstrace/cortex-operator/#quickstart

You might also like...

Follow Opstrace on...