Dynamic per-tenant Cortex limits, configurable via UI

How to change rate limits on the fly via UI

Jun 18 min readDr. Jan-Philip Gehrcke

plot showing ingestion rate limit change


Cortex has a runtime configuration system which can be used to change configuration parameters on the fly without restarting individual components. This is a powerful feature. For example, it allows for dynamically changing bandwidth limits on a per-tenant basis.

In Opstrace, we have now started to expose the set of Cortex runtime configuration parameters via our graphical user interface.

In this blog post, I am going to briefly demonstrate how to use this new Opstrace capability: we are going to walk through changing a certain rate limit in a running Opstrace system, and we are making sure that we see the change take effect using a tiny load test performed twice -- before and after the configuration change.

Along the way, I try to explain some of the relevant concepts.

Step-by-step guide: how to use the new Cortex configuration UI

In Opstrace, the Cortex ingestion rate limit is currently set to 10^5 samples per second per tenant per distributor. Note that this is more or less an arbitrary choice for now, and therefore it makes sense for our users to be able to change this to their needs in a rather easy fashion.

In this walkthrough, I will first demonstrate this limit to take effect when sending metric samples at a high rate. Then, I will reduce the limit to make it much more strict and repeat the load test to see the new setting take effect.

1) Observe load response with default rate limits

In the Opstrace UI, I navigated to Configuration -> Cortex to display the current Cortex (runtime) configuration for a specific tenant, named default. The following screenshot shows a tiny part of that config:


The first row, ingestion_rate, shows the above-mentioned ingestion rate limit of 10^5 samples per second per tenant per distributor.

Next up, I pushed metric samples at a high rate into the system (using our in-house testing tool called 'looker', cf. Appendix below for the specific command used). I monitored the system response with our "Tenant Overview" dashboard for Cortex (rather prominently shown in the Opstrace UI). The following screenshot shows some of the panels in this dashboard:


Note that this system was comprised of three Cortex distributors, yielding a maximum allowed sample push rate of 3*10^5 samples per second. In the top-right panel of the screenshot above, we can see that the one-minute time average of the actual sample push rate during the short load test maxed out at approximately 2*10^5 samples per second. That is, the limit defined by ingestion_rate was not reached continuously, at all times. However, on shorter time scales, it was reached every now and then: individual distributors sometimes rejected individual push HTTP requests for reaching the said limit. These cases can be identified when inspecting the HTTP response body. Example:

ingestion rate limit (100000) exceeded while
adding 5000 samples and 0 metadata

Side note: in Opstrace, we translate Cortex' HTTP responses with status code 429 to HTTP responses with status code 503 because up until recently, the Prometheus remote_write system did not retry in view of 429 responses. Likewise, the top-left panel in the screenshot above shows a low rate of 503 responses (watch out for the semi-logarithmic scale here).

The sample receive rate panel shown in the top-right corner of the screenshot above is based on the distributor's perspective. It includes samples before rejection (i.e., before enforcing rate limits).

Let us have a look at the sample receive rate from the ingester system's point of view. This is after rejection, and therefore reflects the actual sample ingest rate after enforcing rate limits.


Note that this metric shows the sample ingest rate for all tenants in sum, from the point of view of each individual ingester. In the plot, you can convince yourself that there is a negligible base ingest rate (created by the system tenant of this Opstrace instance); the profile of the peak in the graph shown is dominated by said short load test.

Side note: in this Opstrace instance, the distributor replication factor is set to three. That is, each of the three distributors forwards to three ingesters -- which is why the ingest rate as seen by the ingesters is effectively three times as large as what the user/tenant pushes.

The plot shows that each of the ingesters saw a maximum sample push rate of slightly below 2*10^5 samples per second. This is a little lower than what was reported by the distributors before: because some of the HTTP requests were actually rejected as of the discussed rate limit hitting in.

A noteworthy conclusion from these observations is that the sample ingest rate bottleneck in this tiny experiment was not at all times the ingestion_rate limit, but elsewhere (probably on client side). It is fair to say that the ingestion_rate did kick in, but only sometimes.

Next up, I am going to significantly reduce the rate limit. The goal is to make it the definite bottleneck when repeating the short load test.

2) Decrease ingest rate limit via UI

I again navigated to the new Cortex configuration section in the Opstrace UI and changed the ingestion_rate limit for the default tenant:


As you can see, I reduced the ingestion_rate limit by an order of magnitude to 10^4 samples per second per distributor. I pressed Apply Changes and waited a minute for the changes to apply.

Note: this change needs to propagate through a small number of caching layers where some of them have to periodically poll (instead of responding to changes instantaneously). The upper bound for how long it takes until these config changes apply in every corner of the system is on the order of one minute, for now.

3) Observe load response after configuration change

With the new limit in place, I repeated the short load test.

In order to see how the system behaved during this test, let's first have a look at the Tenant Overview dashboard for Cortex:


Each panel shows data from both experiments: the previous load test as well as the more recent one after the configuration change.

You can see in the top-left panel that the rate of HTTP responses with status code 503 (as of the new rate limit hitting in) was higher than the rate of HTTP responses with status code 200 (indicating 'data accepted'). That already shows a strong throttling effect.

The top-right panel shows almost the same sample receive rate as before. Why is that? Again, this is the sample receive rate before applying rate limits (before "rejection", as reported by the distributors).

Let's have a look at the actual sample ingest rate (after rejection),as reported by the ingesters:


In the small peak on the right-hand side in this graph, we now see a rather stable equilibrium where each ingester accepted approximately 3*10^4 samples per second. This directly corresponds to the new ingestion_rate limit of 10^4 samples per second per each of the three distributors.

As per the observations above, in this second load test, the actual sample ingest rate was at all times limited by the ingestion_rate throttling mechanism. As desired, this throttling mechanism became the definite bottleneck after the config change.

Final remarks

In this post, I focussed on explaining and dynamically changing Cortex' ingestion_rate parameter. You may want to explore all the other Cortex runtime configuration parameters that you can change dynamically for each tenant individually; among them is for example an upper limit for the number of active time series.

You will notice that there are still rough edges around what we have built here. Like a lot of what we do, this is very much to be considered a prototype, and the screenshots in this blog post will likely become out of date sooner than later.

However, I hope that it has become apparent that changing a certain rate limit for an individual tenant in a running Opstrace system is just a button click away. From a system operations point of view, this is an incredibly powerful feature. We have work underway to help choose sensible rate limits. However, for you to be able to change and tweak them in an ad-hoc and maybe even playful fashion is always going to be priceless -- that is why we wanted to get this feature to our users as early as possible.

Besides the per-tenant configuration, there also are global settings that can be changed in Cortex on the fly, such as global ingester rate limits. The corresponding parameters are not yet exposed via our UI, but will be soon.


The command used for generating load:

$ docker run -v $(pwd):/rundir --net=host \
opstrace/looker:17b792dd-ci looker \
https://cortex.default.${CLUSTERNAME} \
--metrics-mode \
--bearer-token-file tenant-api-token-default \
--n-concurrent-streams 5000 \
--n-entries-per-stream-fragment 5000 \
--stream-write-n-seconds 120 \
--metrics-time-increment-ms 1 \
--max-concurrent-writes 200 \
--max-concurrent-reads 200 \
--n-cycles 1