Jul 21 | 10 min read

What is Log Aggregation?

portrait of Sébastien Pahl

Sébastien Pahl

Co-founder and CEO

portrait of Mat Appelman

Mat Appelman

Co-founder

portrait of Chris Lambert

Chris Lambert

aggregating physical logs

Opstrace is open source, minimal-index log aggregation distribution that you can search, graph, and alert on

Click here to jump straight to "What Does Opstrace Provide?"; continue reading from here for background on logging and logging aggregation...

Logs are everywhere. Just about every application ever written emits logs in one form or another. So why is it still so hard and costly to manage them? To address this question—and the logging aggregation solution—we will address the following in this post:

  • Why do we care about logs?
  • What is log aggregation?
  • How does Opstrace implement log aggregation?
  • How is Opstrace different from other vendors?

Background: Life with Logs

What do most programmers learn first? print "hello world!". Even from the very first line of code, log output is part of every software application. The extent to which the modern application landscape has expanded means that managing logging output is challenging. Even though some tools have become more sophisticated (e.g., logging frameworks), and new tools are being developed (e.g., distributed tracing), logs remain an indispensable resource for various stakeholders inside of a company to do their jobs.

As a developer, you add logs to your code and read them back during development and in production. Reading production logs to find a problem that is happening right now is really difficult when you don't know where to turn; you know where to find the logs for the service running your code, but what about the upstream service that yours depends on?

As an IT professional, you are managing many different products from different vendors. You need access to the logs from these products, and may need to cross-reference their logs with your own in-house dev team’s applications. For example, cross-referencing application and SSO logs.

As a security analyst, you need access to everything to audit and monitor them for sensitive information that must be purged. And when you find it, you need to be able to remove the problematic content.

Logs are useful because they offer substantial insight into the behavior of your applications. Typically, all manner of events are captured in logs, from successful actions such as a completed login to a background task logging its status periodically to recoverable (and non-recoverable) errors that may occur.

Example of a good log message:

[timestamp] [error] 'foo is missing, check the bar service', code=404
^^^^^^^^^ ^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^
UTC type description suggestion specific detail

Example of a bad log message:

[timestamp (pdt)] [info] uh oh, caught an error, stack trace:
^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
error prone misleading puts all burden on reader
...

Of course, there are other methods of gathering insights for applications that can be used, from process debuggers to time series metrics to distributed tracing. These are all useful, and should be considered part of any observability toolkit, but they will not soon eliminate the usefulness of logs—logs remain a tried-and-true method of inspection of all types of applications, regardless of the author of that software.

To gain value from your logs, in a simple world where you might have only one monolithic application, it's easy to tail the logs and grep through them to find what you need. Maybe you parse them with awk or jq, and maybe you even do this over a dozen or so instances of your application using ssh, kubectl logs, or something similar. But, what if you have many applications—and many instances of these applications—all generating logs of their own? What if the volume of logs outstrips your local infrastructure's ability to store them? Then, you need log aggregation.

Log Aggregation

When you have multiple copies of an application (perhaps hundreds or thousands of nearly-identical pods) or when you need to cross-reference those distributed logs against other distributed applications to diagnose a system failure of some kind, you need to aggregate all of your logs into a single system that you can administer and query from a single known location.

"Log Aggregation" is the industry-accepted term that generally means...

Bringing all log messages together, from disparate applications, for easier analysis by an end user and easier management by an administrator.

There is no single definition; SumoLogic has a similar (longer) definition here. This is because, as with many business functions, they can mean different things for different types of people. From our perspective, some key use cases need to be satisfied by any log aggregation system. We will talk about the differentiation between them a little later on. Those use cases, briefly considered, are:

For Developers

  • Add logging statements to their code in a standard way.
  • Configure collection once (for continual ingestion), but query that more often.
  • Simple interface (web, CLI, or both) for querying, filtering, and tailing the logs.

For Administrators, Operators

  • Install in your own cloud account.
  • Upgrade regularly to stay on top of security patches and new features.
  • Manage and monitor configuration as conditions on the ground change.
    • Keep up and running, identify bad actors, rogue applications.
    • Plan for additional capacity.
    • Identify and remediate outages.

For Security Analysts

  • Identify, contain, remediate threats across the entire application stack.
  • Enforce compliance regulations, for example, access controls applied to certain logs.
  • Find and purge sensitive data that should not have been logged.

By standardizing log collection and centralizing its storage, all of these user types can benefit from log aggregation. But how do you choose your specific approach to implementing log aggregation? If your company is like many others, you may have tried several approaches already and found them lacking for a variety of reasons ranging from cost to complexity.

How Does Opstrace Implement Log Aggregation?

What Does Opstrace Provide?

The Opstrace open source observability distribution supports both metrics and logs out of the box. It bundles together a variety of open source projects, making them simple to install in your cloud account, upgrade, monitor, and scale. It provides security by default (authenticated APIs and TLS certificates).

For log aggregation, under the hood, Opstrace uses Loki:

Unlike other logging systems, Loki is built around the idea of only indexing metadata about your logs: labels (just like Prometheus labels). Log data itself is then compressed and stored in chunks in object stores such as S3 or GCS, or even locally on the filesystem. A small index and highly compressed chunks simplifies the operation and significantly lowers the cost of Loki.

This minimal index paradigm relies on filtering to subsets of data using labels (which are indexed) before performing a distributed grep on the resulting dataset. Because the data is stored in S3 and because that grep can be distributed horizontally, Loki performs well in most situations without building a large index from all of your message contents. Let’s look at some examples.

The following query will filter for k8s_container_name labels your myapp value, before grepping the resulting lines for a 404 response code:

{k8s_container_name="myapp"} |= "404"
^^^^^^^^^^^^^^^^^^ ^^^^^ ^^^
label name value search text

Because logs are colocated in the same system as metrics, we can easily pivot between logs and metrics at query time, for example, this query will find the success rate metric for the above pods whose logs we were querying:

api_succcess_rate{k8s_container_name="myapp"}

Loki's LogQL supports a variety of log query types. A log query consists of log stream selector and optionally a log pipeline. You can think of these as a label filter followed by a grep on the results of that. That’s what we did in the example above. We can chain things together to make more complicated targeted queries, for example (from the above-linked docs):

{container="query-frontend",namespace="tempo-dev"}
|= "metrics.go"
| logfmt
| duration > 10s and throughput_mb < 500

The query is composed of:

  • A log stream selector {container="query-frontend",namespace="loki-dev"} which targets the query-frontend container in the tempo-dev namespace.
  • A log pipeline |= "metrics.go" | logfmt | duration > 10s and throughput_mb &lt; 500 which will filter out log that contains the word metrics.go, then parses each log line to extract more labels and filter with them.

Beyond log queries, LogQL supports metrics queries, which extend log queries to create values (numbers) based on the content returned by the log query. These queries are very powerful, enabling you to graph log-based content and subsequently alert on it. Like PromQL, LogQL supports a variety of aggregation operators, for example:

sum(rate({k8s_pod_name=~"compactor-\\w+"} |= "successfully completed blocks cleanup"[1h]))

Can yield a graph like:

graph produced from sum and rate functions

How Does it Compare to Other Log Aggregation Tools?

Opstrace with Loki is reliable and highly scalable. Because it only indexes log labels and stores log data highly compressed in S3 (or GCS), the retention costs are extremely low. You pay your cloud vendor directly according to the pricing you have already negotiated—we will not and cannot bill you by volume of data.

Most other log aggregation vendors create a full-text index across all data in your log messages to make search queries fast, but this comes with a high storage cost (because indexes are necessarily large). An index is not the only way to achieve performant queries, as we have discussed above.

Other vendors support minimal index log aggregation, for example, Humio ("index-free") and ChaosSearch (Chaos Index®). However, even with reduced storage requirements, their pricing models rely on billing you for the volume of your data.

Future

Exemplars

Integrating exemplars into Opstrace will enable advanced cross-cutting queries, moving from metrics to traces to logs and back. For example, borrowing an example from Grafana, we can see that from a particular logline in Loki (on the left), it is easy to open the associated span (on the right) by click on the blue "Tempo" button:

exemplars example in grafana

Additional Data Sinks

While LogQL is useful for most operational purposes, sometimes, there is a need for some more specific behavior. For example, you may want some subset of your logs to go to a full-text search engine for deeper ad hoc analysis, or you may want some of your data sent to your data warehouse so that you can generate reports based on some aspect of the system. Shunting some subset of your log data to additional data sinks is something that we have planned. In the meantime, we'd love to hear from you, the reader, if you find this interesting and how you'd use it.

*   *   *

Log Aggregation is a necessary tool for all applications inside your company and can help developers, admins, and security analysts make their job easier and more efficient. Opstrace’s minimal index solution is highly performant and extremely cost-effective. Join our community Slack and try out Opstrace today: go.opstrace.com/quickstart.