Circuit Breakers in Solr and Elasticsearch

What is circuit breaker?

Circuit breakers are added in search products like Solr and Elasticsearch for preventing operations which can cause OutOfMemoryError or node going down. The premise of circuit breakers is to ensure a higher quality of service and only accept request loads that are serviceable in the current resource configuration.

Circuit breakers in Elasticsearch

Circuit breakers in Elasticsearch are present from a long time but it has evolved based on the suggested issues and improvement. Currently in the 7.x version of Elasticsearch we have 6 circuit breakers:

  1. Parent circuit breaker: Parent circuit breaker exceptions are caused by sum of all memory being used across the different types of circuit breaker. indices.breaker.total.use_real_memory default=true indices.breaker.total.limit default=95% JVM heap
  2. Field Data circuit breaker: Field data circuit breaker exception is caused when the total amount of memory used by field data in your indices exceeds the threshold. Field data is by default set to false on a text field, but may be used if it is defined it in your mappings: “fielddata”: true. indices.breaker.fielddata.limit (default=40% JVM heap)
  3. Request circuit breaker: Request circuit breaker exception is caused when one request to elasticsearch is trying to use more than threshold memory specified. indices.breaker.request.limit (defaults to 60% of JVM heap)
  4. In flight requests circuit breaker: In flight circuit breaker exception is caused when memory usage of all currently active incoming requests on transport or HTTP level from exceeding a certain amount of memory on a node. network.breaker.inflight_requests.limit (defaults to 100% of JVM heap)
  5. Account request circuit breaker: Account request circuit breaker exception is caused when memory usage of things held in memory that are not released when a request is completed. indices.breaker.accounting.limit (defaults to 100% of JVM heap)
  6. Script compilation circuit breaker: Script compilation circuit breaker exception is fired when the number of inline script compilations within a period of time exceeds the threshold. (Defaults to 75/5m, meaning 75 every 5 minutes)

Elasticsearch circuit breaker status

Command to check:

GET _nodes/stats/breaker

Response:

Example circuit breaker exception in Elasticsearch

[2020-11-02T07:11:45,611][WARN ][o.e.a.b.TransportShardBulkAction] [es-data-1-123456] [[tracking-2020.11][8]] failed to perform indices:data/write/bulk[s] on replica [tracking-2020.11][8], node[c3aC_s52RMe0xv0uzp02nQabc], [R], s[STARTED], a[id=Tgwiiu_GREOAnbKl7BI6Mgcv]
org.elasticsearch.transport.RemoteTransportException: [es-data-3-123456][99.123.11.111:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [20293537836/18.8gb], which is larger than the limit of [20293386240/18.8gb], real usage: [20293530192/18.8gb], new bytes reserved: [7644/7.4kb], usages [request=49320/48.1kb, fielddata=2665302366/2.4gb, in_flight_requests=2491948/2.3mb, accounting=162036398/154.5mb]

For diagnosing common circuit breaker errors refer. Additionally following can be looked at:

  1. Enable slow search and indexing logs.
  2. Changing Kibana default index from *(all index) to most relevant index. This helps in not running ad-hoc queries to all the indices.
  3. Set the memory setting for each data node appropriately based on the usage. It is advisable to assign 50–70% of total vm memory for Elasticsearch.
  4. Check and validate the aggregation queries that we are not aggregating on large number of buckets.

Circuit breakers in Solr

Circuit breakers in Solr was introduced in Solr 8.7 version. As compared to Elasticsearch circuit breakers in Solr is fairly new. We use Circuit breaker when stability of the cluster is more important than request throughput. If circuit breakers are enabled, requests may be rejected under the condition of high node stress with an appropriate HTTP error code (503).

It is up to the client to handle this error and potentially build a retrial logic.

In Solr we have 2 types of circuit breaker:

  1. JVM heap usage based circuit breaker: This circuit breaker tracks JVM heap memory usage and rejects incoming search requests with a 503 error code if the heap usage exceeds a configured percentage of maximum heap allocated to the JVM (-Xmx).
  2. CPU utilisation based circuit breaker: This circuit breaker tracks CPU utilisation and triggers if the average CPU utilisation over the last one minute exceeds a configurable threshold.

Example circuit exception in Solr

Summary

Currently as compared to circuit breakers in Solr, Elasticsearch has more number of circuit breakers and also more dynamic options to control the circuit breakers which enables more stable and resilient search service.

References

  1. https://www.elastic.co/guide/en/elasticsearch/reference/7.x/circuit-breaker.html
  2. https://www.elastic.co/guide/en/elasticsearch/reference/7.x/fix-common-cluster-issues.html#diagnose-circuit-breaker-errors
  3. https://solr.apache.org/guide/8_7/circuit-breakers.html

Circuit Breakers in Solr and Elasticsearch was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Circuit Breakers in Solr and Elasticsearch | by george chakkalakkal | Walmart Global Tech Blog | Medium