Jul 30 | 8 min read

Breaking Down Infrastructure Cost

portrait of Nick Parker

Nick Parker

portrait of Simão Reis

Simão Reis

portrait of Chris Lambert

Chris Lambert

cost image

A horizontally scalable, highly reliable, open source observability platform costs less than one dollar per month, per 1K metrics

Back in January, we posted about the cost of running Opstrace on your own cloud infrastructure when ingesting 1M active metrics. Last month, we did additional scale testing up to 200M active series (without regard for infrastructure cost). Recently we did a new scale test with an eye toward cost and more realistic workload behavior. This post explains the details of that.

Tl;dr: The on-demand infrastructure costs come to $0.80 per 1,000 metric series (with 35M total active series across 6 tenants); with EC2 Reserved Instances this can come down to $0.65 to $0.55.

Summary

Compared with the last test of 200M series in one tenant, in this test, we used more tenants with a high rate of churn. This effectively simulates a high level of continuously deployed applications, which temporarily increases cardinality. Think of a service rolling out, where a pod name label value is changing; this new label value creates a new time series that requires additional memory (until the old series age out). The main details of the test include:

  • An 18-host Opstrace instance on AWS (us-west-1, r5a.4xlarge).
  • A total of 6 tenants.
  • One tenant with 30M active metric series written from 5 Prometheus instances.
  • The other 5 tenants had 1M active series each.
  • Active series churned at a rate of 100K series per minute across tenants.
  • Tuned configuration to reduce the amount of memory used.

Definitions & Layout

The series churn means that at some point in time a previously active time series was replaced by another; a time series is defined by its label set, for example:

http_server_requests_total{code=”200”,method=”get”,path=”/”,pod_name=”server-a2fec23”} 213

All of the data points collected for this metric with that exact set of labels and values are considered one metric time series. But if one of the label values changes, for example "pod name" during a deploy, then we effectively have a new time series created; the old time series will be persisted in long-term storage, however, there is some short-term cost in terms of the processing it takes to create and ingest a new series.

The driver cluster is a basic EKS cluster that hosts our synthetic workload. This workload attempts to approximate some characteristics of a real-world workload in a repeatable manner. See our original blog post for an explanation of the basics of how we generated metrics workload with Avalance; we describe the change to add churn in the next section. Basically, we use 5 Prometheus instances to poll random metric generators; each Prometheus instance remote writes to Opstrace, which ultimately persists the data in S3.

tests layout diagram

Configuration

Generating Churn

Avalanche pods only allow configuring a period between label resets, but not when those resets should occur in terms of clock time. (In practice, this is implemented as a series_id label, whose value increments after the configured period of time.) In order to test a consistent churn of new series, we wanted these resets to be staggered out across the Avalanche pods. For example, if we wanted to have 10k series change their labels every minute, we would deploy avalanche pods with 10k series each, and have them deploy onto Kubernetes at a rate of one per minute.

This was implemented using a helper script included on the avalanche pods, which used modulo logic against the current time to schedule when a given avalanche process would start. This scheduling logic ensured that the rollout ordering would be retained even if the avalanche pods were restarted or reconfigured.

In this test, we used a Prometheus scrape_interval of 60s.

Tuning Memory Usage

Under the hood Opstrace uses Cortex. The ingester component of Cortex consumes the vast majority of the memory used by the overall system. This is mainly because they each keep an in-memory copy of everything that they’ve received in the current block period. If a lot of distinct series are being sent, then the Ingester memory usage will grow proportionately. On top of that, this in-memory copy is replicated across multiple ingesters. This is in addition to the on-disk write-ahead log (WAL) that the ingesters retain for the current block. We believe that reducing this excessive memory use can make Cortex far more cost-effective for larger, churning workloads.

The ingesters have a max_series configuration option (default unlimited) that will set a cap on the number of distinct series that are kept in memory. This will prevent the ingesters from OOMing and causing both incoming data and queries to be rejected. However this is a hard limit for all tenants, so any tenant can potentially create a denial of service situation, where ingesters will refuse any new series until the existing in-memory series have been flushed out. To mitigate this we also must tune the per-tenant limits.max_series_per_user and limits.max_global_series_per_user values to reduce the likelihood of a single tenant taking too much of the global quota. However, these need to be adjusted in order to support the largest tenant of the system, and they assume that tenants are roughly equal in size. For example, if one tenant has 80% of the overall series, then this setting won’t do much to catch the series count growing unexpectedly for the other 20%.

We configured the Cortex block duration to be 30 minutes, from the default of two hours, via the blocks_storage.tsdb.block_ranges_period setting. In practice, we found that this did not greatly reduce the ambient memory usage of the ingesters, however, it did greatly help in reducing the peak usage which would occur when the blocks were flushed to S3. We believe that the more frequent flushes were resulting in less peak memory use at the ingesters since less data needed to be included in each flush. We also decreased the blocks_storage.tsdb.retention_period from a default of six hours to one hour, but this did not show a noticeable improvement in memory usage.

Detailed Cost Breakdown

The total cost per day for this new test is much higher than it was for our original post—as one would expect given the increased load: $850/d vs $30/d. However, this new configuration is less expensive on a per-metric basis: $0.80 vs $0.90, per 1K metric series per month.

tests layout diagram

Using on-demand list prices, we see that 18 EC2 instances (r5a.4xlarge) are $450 per day. S3, projected out with 1yr retention and nominal query load, would be about $350 per day; this projection is why the actuals in the graph above fall below the $850 mark mentioned in the paragraph above. The other costs (such as PVCs, RDS, NatGateways, ELBs, etc) all cost less than $100 per day.

Additional Considerations

Of course, your mileage may vary. Other costs can arise from longer data retention settings, data transfer costs, region-specific pricing, or heavy query (S3 read) workloads.

The most important consideration here is that these prices ($0.80 per 1K metrics per month) are based on EC2 on-demand prices; by running Opstrace in your account, you will benefit from any discounts you have negotiated (e.g., reserved instances). If you look at the maximum list discount for reserved instances—3-year, all upfront—you get a 62% discount on on-demand pricing. The RI range given the list price comes to $0.65 to $0.55 per 1K metrics per month.

Depending on where your applications live, you will may also pay egress fees to collect your data; this would be the same as if you used an external SaaS such as Datadog or Lightstep. Additionally, your spend on Opstrace in your own account will contribute to your overall spending goals with Amazon, as opposed to paying an external vendor for a SaaS monitoring solution.

Lastly, because we don't make money on the volume of data you ingest, we are incentivized to optimize Opstrace even further to reduce the operational cost. In other words, the numbers in this post should be considered a starting point—in practice, the actual cost can be even lower.

*   *   *

With this setup, we have increased the difficulty of the workload tremendously, while effectively reducing the cost. And there is additional room to bring down the cost further.

Anyone who is cost-aware who produces a lot of data should take notice—compare this to your own monitoring solution and let us know what you find.