How to monitor and troubleshoot Fluentd with Prometheus

MalBot · June 23, 2022, 3:05pm

Fluentd is an open source data collector widely used for log aggregation in Kubernetes. Monitoring and troubleshooting Fluentd with Prometheus is really important to identify potential issues affecting your logging and monitoring systems.

In this article, you’ll learn how to start monitoring Fluentd with Prometheus, following Fluentd docs monitoring recommendations. You’ll also discover the most common Fluentd issues and how to troubleshoot them.

How to install and configure Fluentd to expose Prometheus metrics

You can install Fluentd in different ways, as seen in Fluentd documentation. We recommend deploying Fluentd in Kubernetes by using the Helm chart:

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm install fluentd fluent/fluentd

You’ll need to enable a few Fluentd Prometheus plugins for Fluentd to expose metrics. But don’t worry, they are already activated in the Helm chart, so if you installed Fluentd that way, you’re good to go.

Monitoring Fluentd with Prometheus: Inputs

Incoming records

Fluentd can collect data from many sources. Each piece of data (e.g., an entry of a log) is a record, and Fluentd exposes the fluentd_input_status_num_records_total metric, which counts the total number of records collected from each source.

You can use the following PromQL query in your dashboard to display the rate of incoming records in all of your Fluentd instances:

sum(rate(fluentd_input_status_num_records_total[5m]))

Relabelling the ‘tag’ label for better filtering

The metric fluentd_input_status_num_records_total has a label called tag that contains the source of the record, usually a path to a log file inside the node. The problem is that in Kubernetes, these file names tend to have names that concatenate several strings, like the pod name, the container, and the namespace. To leverage this, let’s see how to use Prometheus metric relabelling.

Here, you can find an example of Fluentd collecting logs from a CoreDNS pod in Kubernetes:

fluentd_input_status_num_records_total{tag="kubernetes.var.log.containers.coredns-56dd667f7c-p9vbx_kube-system_coredns-72d2ba9bae8f73e32b3da0441fbd7015638117e37278d076a6f99c31b289e404.log",hostname="fluentd-lks8v"} 4.0

In the Prometheus job, you can use metric relabelling to create new labels from other ones. In this case, you can use a regular expression to create new labels for the namespace, the pod name, and the container:

metric_relabel_configs:
- action: replace
  source_labels:
  - __name__
  - tag
  regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
  target_label: input_pod
  replacement: $1
- action: replace
  source_labels:
  - __name__
  - tag
  regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
  target_label: input_namespace
  replacement: $2
- action: replace
  source_labels:
  - __name__
  - tag
  regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
  target_label: input_container
  replacement: $3

After applying the relabelling magic, the previous Fluentd example metrics from CoreDNS would look like this:

fluentd_input_status_num_records_total{input_pod="coredns-56dd667f7c-p9vbx",input_container="coredns",input_namespace="kube-system",tag="kubernetes.var.log.containers.coredns-56dd667f7c-p9vbx_kube-system_coredns-72d2ba9bae8f73e32b3da0441fbd7015638117e37278d076a6f99c31b289e404.log",hostname="fluentd-lks8v"} 4.0

You can use these new labels to filter the rate of incoming records by namespace, pod, or container. This is quite handy when creating dashboards.

sum by (input_namespace, input_pod, input_container)(rate(fluentd_input_status_num_records_total[5m]))

Monitoring Fluentd with Prometheus: Outputs

Outgoing records

Fluentd can emit the previously collected incoming records to different destinations by using multiple output plugins types. This is also known as flush. The emitted records are counted by the fluentd_output_status_emit_records metric.

You can use the following PromQL query in your dashboard to display the rate of emitted records in all of your Fluentd instances:

sum(rate(fluentd_output_status_emit_records[5m]))

All Fluentd metrics starting with fluentd_output_ include the following labels:

type, for the output plugin type.
plugin_id, the custom plugin id in your configuration, or a random id if not specified.

You can also use these labels for better filtering in your dashboard:

sum by (type, plugin_id)(rate(fluentd_output_status_emit_records[5m]))

Top troubleshooting situations to monitor Fluentd

Troubleshooting errors, retries, and rollbacks in Fluentd

Fluentd can face issues when trying to flush records to its configured output destinations. These issues, permanent or temporal, might include networking issues or even the destination system being down.

Let’s create an alert to trigger if Fluentd has a flush error ratio higher than 5%:

100 * sum by (type, plugin_id)(rate(fluentd_output_status_num_errors[5m])) / sum by (type, plugin_id)(rate(fluentd_output_status_emit_count[5m])) > 5

You could create similar alerts for retries or rollbacks. Although the alert for errors should be enough to detect the issue, these new ones help you to identify the cause. Those alerts look like this:

100 * sum by (type, plugin_id)(rate(fluentd_output_status_retry_count[5m])) / sum by (type, plugin_id)(rate(fluentd_output_status_emit_count[5m])) > 5

100 * sum by (type, plugin_id)(rate(fluentd_output_status_rollback_count[5m])) / sum by (type, plugin_id)(rate(fluentd_output_status_emit_count[5m])) > 5

When Fluentd fails to flush, it waits a few seconds before retrying. Each time it retries, this waiting time increases. Once the maximum wait time is reached, Fluentd discards the data that needs to be flushed, so it’s lost. Let’s create an alert that triggers if the retry wait time is over 60 seconds:

sum by (type, plugin_id)(max_over_time(fluentd_output_status_retry_wait[5m])) > 60

Troubleshooting buffer queue not being flushed in Fluentd

Fluentd stores the collected records in a buffer and then flushes them to its configured output destinations. When Fluentd fails to flush them, the buffer size and queue length increases. Consequently, you will lose the data once the buffer becomes full.

These are the recommended alerts for the buffer capacity:

Low buffer available space

Fluentd buffer available space is lower than 10%:

fluentd_output_status_buffer_available_space_ratio < 10

Buffer queue length increasing

The queued records have increased in the last five minutes:

avg_over_time(fluentd_output_status_buffer_queue_length[5m]) - avg_over_time(fluentd_output_status_buffer_queue_length[5m] offset 5m) > 0

Buffer total bytes increasing

The buffer size has increased in the last five minutes:

avg_over_time(fluentd_output_status_buffer_total_bytes[5m]) - avg_over_time(fluentd_output_status_buffer_total_bytes[5m] offset 5m) > 0

Troubleshooting slow flushes in Fluentd

If Fluentd flushes take too much time, it can also be an indicator of network issues or problems in the destination systems. Let’s create an alert to trigger if Fluentd has a slow flush ratio over 5%:

100 * sum by (type, plugin_id)(rate(fluentd_output_status_slow_flush_count[5m])) / sum by (type, plugin_id)(rate(fluentd_output_status_emit_count[5m])) > 5

In your dashboard, you can troubleshoot the flush time with the following PromQL:

sum by(kube_cluster_name, type, plugin_id)(rate(fluentd_output_status_flush_time_count[5m]))

You can also see if the amount of flushes per second have increased or decreased by using the following PromQL:

sum by(kube_cluster_name, type, plugin_id)(rate(fluentd_output_status_emit_count[5m]))

Troubleshooting Fluentd stopped flushing

Fluentd didn’t flush any records in the last five minutes:

rate(fluentd_output_status_emit_records[5m]) == 0

Monitor Fluentd with Prometheus, with these dashboards

Don’t miss these open source dashboards to monitor your Fluentd application. They are already set up, so you can use them right away!, They include:

Input/Output
Buffer
Flush
Want to try this integration?

Register now for the free Sysdig Monitor trial and start taking advantage of its Managed Prometheus service. Try our Prometheus integrations or send metrics through Prometheus Remote Write in minutes!

The post How to monitor and troubleshoot Fluentd with Prometheus appeared first on Sysdig.

Article Link: How to monitor and troubleshoot Fluentd with Prometheus – Sysdig