Article

Introducing the Ring Health UI for Cortex and Loki Components

Critical ring information is now centralized and actionable from a variety of disparate sources

Jun 32 min readChris Lambert

New Ring Health View in Opstrace UI

When running Cortex and Loki—as we do with Opstrace—there are many (many) running containers. Several of these services (ingester, store-gateway, compactor, ruler, and alertmanager) use a hashring [1][2] to coordinate work-sharing and discovery.

In the happy path, you never need to look at the state of the ring members because members will join, leave and redistribute tokens automatically as components come and go during scaling or failover. However, when troubleshooting, access to this information is critical to guide your effort in the event that the system is behaving unexpectedly. Our new UI presents all of the services’ ring information to you in a centralized way, collecting the data from various disparate endpoints (on both Loki and Cortex):

Viewing all of the rings

It’s a great starting point to check if components are in good shape via this new UI. It’s possible that some components, like an ingester, might get stuck in a state other than ACTIVE and they need to be forgotten and restarted in order for the work (allocated through token assignment) to be redistributed to healthy ingesters. We allow you to do this manually from the UI. Sometimes, though, components will fail and recover on their own:

Ingester leaving and rejoining the ring

Looking ahead, this manual remediation can be improved through automation. We recently announced a cortex-operator project where we intend to do just that—the identification and remediation of ring issues to eliminate the need for manual intervention. Join our community Slack if you’d like to follow our progress.

[1] Cortex ring architecture and getting started: https://cortexmetrics.io/docs/architecture/#the-hash-ring https://cortexmetrics.io/docs/guides/getting-started-with-gossiped-ring/

[2] Blog post from Grafana Labs: https://grafana.com/blog/2020/03/25/how-were-using-gossip-to-improve-cortex-and-loki-availability/