It's crazy how hard it is to craft good, reliable alerts. Throw in scale beyond one Prometheus instance and the problem gets even harder. This week we released a new feature to enable API- and UI-based configuration of the scalable Cortex Alertmanager. We published a blog post about it, as well as a user guide with detailed instructions. Additional implementation details can be found on GitHub:
Upgrades are hard, our goal is to deliver inplace upgrades without stopping the system. But everything does not always work as planned and sometimes you need an escape hatch. This is one of them. This feature will allow an opstrace instance to be taken down without deleting any of the data (S3 buckets and RDS configuration database) and then bring it back up. This tool allows for upgrades and also bringing back clusters when all else has failed and time is of the essence.
Foundation for managing recording rules from the UI
One can’t build good Prometheus alerts without some custom recording rules. This work is simply the foundation to be able to manage rules through our UI in the future. This will allow us to not only build management interfaces but more importantly high level tools to help create better alerts such as SLOs and Error Budgets.
Cloud metrics UI
A few weeks ago we released the cloud exporters feature. This will of course also get a proper UI over time. This PR is focused on the credentials that are needed to go scrape other cloud accounts.