Article

Collecting Cloud Provider Metrics with Opstrace

Centralize critical AWS/GCP service metrics alongside your application metrics

Mar 165 min readNick Parker

ladder climbing to a cloud

Prometheus is widely-used to monitor distributed applications, and for good reasons—it’s amazing, for many more reasons than we have space to go into here. But if you’re using Prometheus, chances are you’ve run into a problem: you have most of the metrics you want in Prom but some other (very critical!) metrics are stranded inside your cloud provider. For example, in AWS you probably have an ELB that you want to monitor, or in GCP you may want to track your GCS usage. Having multiple sources for metrics makes life difficult because you want to query and alert on all metrics in the same way. To solve this problem, people often add AWS CloudWatch and GCP Stackdriver as a data source in Grafana, but then quickly realize they can’t use Prometheus alerting. Others import the data into Prometheus using additional exporters; that is yet more knowledge to be acquired and infrastructure to maintain.

In this blog post, we will show you how we incorporated the open source exporters into Opstrace to automate their configuration and maintenance. We’ll first tell you about how we implemented this and how it works. Then we will show you how to do all of this by providing Opstrace with cloud credentials and configurations for the metrics to collect, and then querying back the data using Grafana. Finally we will talk about how support could be extended to other exporters as well as service discovery configurations.

Implementation

As you submit credentials and collection configurations, exporters are started within Opstrace to pull the metrics from CloudWatch and Stackdriver. The configurations themselves are a direct passthrough to the underlying exporter implementations and Opstrace deploys stock builds of each respective exporter type. This simplifies debugging the exporters and makes the configurations portable into and out of Opstrace. This structure may also support higher-level GUI-based configuration in the future.

Opstrace internally stores your submitted credentials and exporter configurations directly to Postgres, acting as a backing store to a Hasura GraphQL interface. Note that the credentials are currently only stored as plaintext within Postgres and must be scoped to only allow metrics collection. That said, the configuration API requires authentication and does not return credential values after they have been submitted. As an additional step, we may in the future introduce a separate secrets store as a replacement for the current Postgres-based storage.

The credential and exporter configuration changes are detected by the Opstrace Controller as GraphQL updates. As configurations are updated, stock release builds of the CloudWatch and Stackdriver exporters are deployed into each tenant. This structure supports the addition of additional exporter types over time. The deployed exporters serve the configured metrics in Prometheus format to be collected and made available within Opstrace.

Overall architecture of where these components fit

Quick Start

We will configure both AWS CloudWatch and GCP Stackdriver collection to be sent to a tenant named dev. We use the corresponding tenant-api-token-dev auth token file created by the Opstrace CLI during installation. For convenience the configuration APIs accept one or more objects. Multiple credential or exporter types may be included in each request.

First we will submit the AWS and GCP credentials, naming them example-aws-credential and example-gcp-credential respectively:

NOTE: These credentials must be scoped to allow fetching metrics and nothing else.

$ echo '
name: example-aws-credential
type: aws-key
value:
AWS_ACCESS_KEY_ID: sample-key-id
AWS_SECRET_ACCESS_KEY: sample-access-key
---
name: example-gcp-credential
type: gcp-service-account
value: |-
{
"type": "service_account",
"project_id": "sample-project-id",
"private_key_id": "...",
"private_key": "...",
"etc": "etc"
}
' | curl -v -H "Authorization: Bearer $(cat tenant-api-token-dev)" --data-binary @- https://me.opstrace.io/api/v1/credentials/

Next we submit the CloudWatch and Stackdriver exporter configurations, creatively named example-aws-exporter and example-gcp-exporter. These reference the credentials from the last step. In this example we are retrieving EC2 CPU usage and custom Buildkite CI metrics from CloudWatch, and CPU, disk, and network usage and allocation metrics from Stackdriver. The configuration follows the format used by CloudWatch Exporter and Stackdriver Exporter:

$ echo '
name: example-aws-exporter
type: cloudwatch
credential: example-aws-credential
config:
region: us-west-2
metrics:
- aws_namespace: AWS/EC2
aws_metric_name: CPUUtilization
aws_dimensions: [InstanceId]
aws_statistics: [Sum]
- aws_namespace: Buildkite
aws_metric_name: ScheduledJobsCount
aws_dimensions: [Org, Queue]
aws_statistics: [Sum]
- aws_namespace: Buildkite
aws_metric_name: RunningJobsCount
aws_dimensions: [Org, Queue]
aws_statistics: [Sum]
- aws_namespace: Buildkite
aws_metric_name: WaitingJobsCount
aws_dimensions: [Org, Queue]
aws_statistics: [Sum]
---
name: example-gcp-exporter
type: stackdriver
credential: example-gcp-credential
config:
monitoring.metrics-type-prefixes: # required
- compute.googleapis.com/instance/cpu
- compute.googleapis.com/instance/disk
- compute.googleapis.com/instance/network
google.project-id: # optional
- sample-project-id
monitoring.metrics-interval: '5m' # optional
monitoring.metrics-offset: '0s' # optional
' | curl -v -H "Authorization: Bearer $(cat tenant-api-token-dev)" --data-binary @- https://me.opstrace.io/api/v1/exporters/

At this point Opstrace will begin collecting the metrics we've configured in the dev tenant.

Metrics dashboard showing stacked AWS/GCP per-instance CPU utilization

Outlook

This process makes it easy to get any metrics we may want out of CloudWatch and Stackdriver, even across multiple AWS and GCP accounts. To maximize flexibility and portability, we use a low-level API for configuring the credentials and metrics to collect. We will supplement with a GUI-based configuration interface on top of this layer and make it easier to discover and select the metrics to be collected from CloudWatch and Stacktrace. We've also set up a good foundation for supporting other external data sources in the future... is there anything else, in particular, you'd like to collect? Let us know in GitHub or by email.