Mar 25 | 4 min read

Update #2

portrait of Chris Lambert

Chris Lambert

tracy waving next to a status seal

Catch up on recently shipped code, and what we have coming

What's new this week? We have some additional detail following on from last week's cloud metrics collection story and some interesting research in progress regarding rate limits.

But first, we'd like to share our recent appearance on Software Engineering Daily. Our co-founder and CEO, Seb, sat down with Jeff to talk about how our open source platform makes it easy for anyone to scale their metrics and log collection:

We published a new, detailed user guide for collecting cloud vendor metrics (i.e., metrics only available via CloudWatch and Stackdriver). It includes instructions for using our new API to set up the necessary credentials and configure which metrics to collect. For example, metrics such as ELB throughput or S3 API requests can be critical for running your application, and having all of your metrics in one place makes life much easier. Check it out here:

A note about our documentation: all of our documentation lives inside our repo, alongside the code it supports. The Markdown files are rendered on our website and correctly rendered on GitHub (for example, all links work correctly), which means you can preview your work right from your PR. We're also working on a mini-website preview on each PR—more on this in a future blog post.

On Tuesday we posted a new blog post about Cortex and Loki, particularly about the features that led us to choose them for Opstrace. If you are interested in these horizontally scalable backends for metrics and logs, check it out and leave us comments on Twitter.

We fixed a bug preventing some users from re-installing Opstrace with the same name. You're now able to re-use the same name after a destroy operation immediately. We performed the bug fix in a service that we operate that provides our users a custom subdomain under

We did significant work on the Looker's metrics mode. Looker is our black-box verification testing tool for Loki and Cortex: it pushes log entries or metric samples into the system and then performs read validation—a process that can be repeated and parameterized in a fine-grained fashion. We use Looker to explore, test, and validate the system's behaviors, such as rate limits. We have an ongoing initiative towards understanding and setting rate limits in the various components of Cortex and Loki and other components of Opstrace (e.g., ingress NGINX reverse proxies). The goal is to make it a little harder to accidentally overload an Opstrace instance by, for example, pushing too much data per time unit—example pull request:

We initially developed Looker for Loki's logs-oriented API, and the metrics mode was added as an afterthought. The first step is to consolidate Looker's metrics mode. Specifically, the goal is to allow for running Looker with more concurrent streams (think: time series) by two or three orders of magnitude more than before. Example pull request:

In production, sometimes you need a scalpel—a way to go in and make very target manual changes, for example, by having the Controller support manual editing of Kubernetes objects by not updating them during reconciliation. To accomplish this, we adopted a simple mark as 'dont touch' approach that landed here. Now, if a resource that the controller normally manages has had its opstrace annotation removed, the controller will avoid making any changes to it and log that it's leaving it alone to check controller logs to see resources marked "don't touch." Don't forget to add the opstrace annotation back when done!

We've begun work on creating a user interface for the cloud metrics collection feature mentioned earlier. Here you'll be able to set up your AWS and GCP metric gathering credentials and then configure the appropriate exporter. Stay tuned for more details. In the meantime, if you have any questions or suggestions, start a discussion on GitHub or tweet at us @opstrace.