Article

Update #1

Catch up on recently shipped code, and what we have coming

Mar 183 min readChris Lambert

tracy waving next to a status seal

Over the past month, we have shipped several features that remove even more pain from standing up a full observability platform in your own cloud network. Below, you’ll find what we’ve shipped, what we’re working on, and how we’re talking about it.

Shipped

We’ve heard our community suffer using CloudWatch and Stackdriver for logs and metrics, yet some metrics are only available there—for example, cloud services like ELB, S3, Aurora, etc. We recently shipped an integration that scrapes these native metrics so users can have full visibility of their custom metrics and native cloud metrics under the same umbrella. See our blog post for more information.

While we strive to automate all operations, we know there are times when manual intervention is necessary. We are building escape hatches, for example, to manually modify managed resources without the controller reverting the changes. (#467, #492)

We’ve wrapped the Alertmanager APIs with authentication so you can now control who has the ability to manage alert configurations. The API otherwise mirrors the Alertmanager API and Ruler API documented by Cortex. As a part of this work we have also upstreamed a few bug fixes to Cortex for problems we found along the way: broken API endpoint, error handling bug, uninformative logs. (#430, #441)

We’ve witnessed the pain of configuring Alertmanager too many times, so we’ve invested in solving some of the basic UX challenges. We are now beginning to expose configuration options in the UI, starting with Alertmanager. Each Tenant now has an "Alerts'' section where you can paste in your existing Alertmanager Configuration YAML or begin creating a new one. This also includes basic validation checks that provide immediate feedback to the user. Additional work in this area is planned. (#380, #483).

We've learned a lot while managing Opstrace installations, and as a result, we’ve made various reliability and performance improvements toward building a fully automated observability platform:

  • Tune persistent volume size for Cortex and Loki components. (#413)
  • Improve the debugging experience by adding a dry run mechanism to the controller, so all Kubernetes API writes will be logged but not executed. (#413)
  • Adjust our scaling formula for Loki/Cortex components in larger installations (5+ nodes). (#413)
  • More resilient CloudSQL instance creation on GCP. (#397)
  • Upgraded to Cortex 1.8.0-rc1 and Loki 2.2.0. (#/453, #472)
  • The per-tenant Prometheus instance, Prometheus Operator, and related CRDs have been upgraded to the latest release. (#478, #487)

In the News

This past week @sebp was the featured speaker at the Data on Kubernetes Meetup. The video is now available on YouTube. Don’t miss the traditional post-meetup rap from Bart Farrell.

Work in Progress

Here are just a few of the things that we have in progress right now. Your feedback and participation are encouraged!

  • Tuning global rate limits. (#465)
  • Automated testing of alerts. (#442)
  • Testing for exporters. (#412)

If for some reason you missed our blog post from February shining a light on infrastructure costs, please do take a look here: https://go.opstrace.com/blog/pulling-cost-curtain-back

Questions, comments, or suggestions? Tweet to us @opstrace.

If you like what you see, consider starring our repo: github.com/opstrace/opstrace.