In this edition of our exporter review series, we introduce the Elasticsearch exporter, one of the best-fit exporters for monitoring metrics used by NexClipper. Read on to find out the exporter’s most important metrics, recommended alert rules, as well as the related Grafana dashboard and Helm Chart.
Elasticsearch is a RESTful search engine, data store, and analytics solution. It is developed in Java and based on Apache Lucene. Elasticsearch is mainly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Elasticsearch is a NoSQL database, which means it stores data in an unstructured way. You can send data in the form of JSON documents using the API or ingestion tools like Logstash. Elasticsearch will store the data and add searchable references to it. You can then search and retrieve the document using the Elasticsearch API or a visualization tool like Kibana.
Elasticsearch used to be open source under the Apache License until 2021 when Elastic NV announced that they would change their software licensing strategy to offer it under the Elastic license.
Since Elasticsearch, like all other databases, is a critical resource, downtime can cause significant financial and reputation losses, therefore monitoring is a must. The Elasticsearch exporter is required to monitor and expose Elasticsearch metrics. It queries Elasticsearch, scraps the data, and exposes the metrics to a Kubernetes service endpoint that can further be scrapped by Prometheus to ingest time series data. For monitoring of Elasticsearch, an external Prometheus exporter is used, which is maintained by the Prometheus Community. On deployment, the Elasticsearch exporter scraps sizable metrics from Elasticsearch and helps users get crucial and continuous information about Elasticsearch which is difficult and time-consuming to extract from Elasticsearch directly.
For this setup, we are using Elastic/Elasticsearch Helm charts to start the Elasticsearch cluster.
With the latest version of Prometheus (2.33 as of February 2022), these are the ways to set up a Prometheus exporter:
Supported by Prometheus since the beginning
To set up an exporter in the native way a Prometheus config needs to be updated to add the target.
A sample configuration:
# scrape_config job
scrape_configs:
- job_name: elasticsearch
scrape_interval: 45s
scrape_timeout: 30s
metrics_path: "/metrics"
static_configs:
- targets:
- <elasticsearch exporter endpoint>
Code language: PHP (php)
This method is applicable for Kubernetes deployment only.
A default scrap config can be added to the prometheus.yaml file and an annotation can be added to the exporter service. With this, Prometheus will automatically start scrapping the data from the services with the mentioned path.
Prometheus.yaml
- job_name: kubernetes-services
scrape_interval: 15s
scrape_timeout: 10s
kubernetes_sd_configs:
- role: service
relabel_configs:
# Example relabel to scrape only endpoints that have
# prometheus.io/scrape: "true" annotation.
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
# prometheus.io/path: "/scrape/path" annotation.
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# prometheus.io/port: "80" annotation.
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
Code language: PHP (php)
Exporter service annotations:
annotations:
prometheus.io/path: /metrics
prometheus.io/scrape: "true"
Code language: PHP (php)
Setting up a service monitor
The Prometheus operator supports an automated way of scraping data from the exporters by setting up a service monitor Kubernetes object. For reference, a sample service monitor for Redis can be found here.
These are the necessary steps:
Step 1
Add/update Prometheus operator’s selectors. By default, the Prometheus operator comes with empty selectors which will select every service monitor available in the cluster for scrapping the data.
To check your Prometheus configuration:
Kubectl get prometheus -n <namespace> -o yaml
Code language: HTML, XML (xml)
A sample output will look like this.
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
app: kube-prometheus-stack
release: kps
scrapeInterval: 1m
scrapeTimeout: 10s
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: kps-kube-prometheus-stack-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
release: kps
Code language: CSS (css)
Here you can see that this Prometheus configuration is selecting all the service monitors with the label release = kps
So with this, if you are modifying the default Prometheus operator configuration for service monitor scrapping, make sure you use the right labels in your service monitor as well.
Step 2
Add a service monitor and make sure it has a matching label and namespace for the Prometheus service monitor selectors (serviceMonitorNamespaceSelector & serviceMonitorSelector).
Sample configuration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: elasticsearch-exporter
meta.helm.sh/release-namespace: monitor
labels:
app: prometheus-elasticsearch-exporter
app.kubernetes.io/managed-by: Helm
chart: prometheus-elasticsearch-exporter-1.1.0
heritage: Helm
release: kps
name: prometheus-elasticsearch-exporter
namespace: monitor
spec:
endpoints:
- interval: 15s
port: elasticsearch-exporter
selector:
matchLabels:
app: prometheus-elasticsearch-exporter
release: elasticsearch-exporter
As you can see, a matching label on the service monitor release = kps is used that is specified in the Prometheus operator scrapping configuration.
The following handpicked metrics for the Elasticsearch exporter will provide insights into Elasticsearch.
Prometheus exporter for various metrics about Elasticsearch, written in Go.
For pre-built binaries please take a look at the releases. https://github.com/prometheus-community/elasticsearch_exporter/releases
docker pull quay.io/prometheuscommunity/elasticsearch-exporter:latest docker run --rm -p 9114:9114 quay.io/prometheuscommunity/elasticsearch-exporter:latest
Example docker-compose.yml
:
elasticsearch_exporter: image: quay.io/prometheuscommunity/elasticsearch-exporter:latest command: - '--es.uri=http://elasticsearch:9200' restart: always ports: - "127.0.0.1:9114:9114"
You can find a helm chart in the prometheus-community charts repository at https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-elasticsearch-exporter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install [RELEASE_NAME] prometheus-community/prometheus-elasticsearch-exporter
NOTE: The exporter fetches information from an Elasticsearch cluster on every scrape, therefore having a too short scrape interval can impose load on ES master nodes, particularly if you run with --es.all
and --es.indices
. We suggest you measure how long fetching /_nodes/stats
and /_all/_stats
takes for your ES cluster to determine whether your scraping interval is too short. As a last resort, you can scrape this exporter using a dedicated job with its own scraping interval.
Below is the command line options summary:
elasticsearch_exporter --help
Argument | Introduced in Version | Description | Default |
---|---|---|---|
es.uri | 1.0.2 | Address (host and port) of the Elasticsearch node we should connect to. This could be a local node (localhost:9200 , for instance), or the address of a remote Elasticsearch server. When basic auth is needed, specify as: <proto>://<user>:<password>@<host>:<port> . E.G., http://admin:pass@localhost:9200 . Special characters in the user credentials need to be URL-encoded. | http://localhost:9200 |
es.all | 1.0.2 | If true, query stats for all nodes in the cluster, rather than just the node we connect to. | false |
es.cluster_settings | 1.1.0rc1 | If true, query stats for cluster settings. | false |
es.indices | 1.0.2 | If true, query stats for all indices in the cluster. | false |
es.indices_settings | 1.0.4rc1 | If true, query settings stats for all indices in the cluster. | false |
es.indices_mappings | 1.2.0 | If true, query stats for mappings of all indices of the cluster. | false |
es.aliases | 1.0.4rc1 | If true, include informational aliases metrics. | true |
es.shards | 1.0.3rc1 | If true, query stats for all indices in the cluster, including shard-level stats (implies es.indices=true ). | false |
es.snapshots | 1.0.4rc1 | If true, query stats for the cluster snapshots. | false |
es.slm | If true, query stats for SLM. | false | |
es.timeout | 1.0.2 | Timeout for trying to get stats from Elasticsearch. (ex: 20s) | 5s |
es.ca | 1.0.2 | Path to PEM file that contains trusted Certificate Authorities for the Elasticsearch connection. | |
es.client-private-key | 1.0.2 | Path to PEM file that contains the private key for client auth when connecting to Elasticsearch. | |
es.client-cert | 1.0.2 | Path to PEM file that contains the corresponding cert for the private key to connect to Elasticsearch. | |
es.clusterinfo.interval | 1.1.0rc1 | Cluster info update interval for the cluster label | 5m |
es.ssl-skip-verify | 1.0.4rc1 | Skip SSL verification when connecting to Elasticsearch. | false |
web.listen-address | 1.0.2 | Address to listen on for web interface and telemetry. | :9114 |
web.telemetry-path | 1.0.2 | Path under which to expose metrics. | /metrics |
version | 1.0.2 | Show version info on stdout and exit. |
Commandline parameters start with a single -
for versions less than 1.1.0rc1
. For versions greater than 1.1.0rc1
, commandline parameters are specified with --
.
The API key used to connect can be set with the ES_API_KEY
environment variable.
Username and password can be passed either directly in the URI or through the ES_USERNAME
and ES_PASSWORD
environment variables. Specifying those two environment variables will override authentication passed in the URI (if any).
ES 7.x supports RBACs. The following security privileges are required for the elasticsearch_exporter.
Setting | Privilege Required | Description |
---|---|---|
exporter defaults | cluster monitor | All cluster read-only operations, like cluster health and state, hot threads, node info, node and cluster stats, and pending cluster tasks. |
es.cluster_settings | cluster monitor | |
es.indices | indices monitor (per index or * ) | All actions that are required for monitoring (recovery, segments info, index stats and status) |
es.indices_settings | indices monitor (per index or * ) | |
es.shards | not sure if indices or cluster monitor or both | |
es.snapshots | cluster:admin/snapshot/status and cluster:admin/repository/get | ES Forum Post |
es.slm | read_slm |
Further Information
Name | Type | Cardinality | Help |
---|---|---|---|
elasticsearch_breakers_estimated_size_bytes | gauge | 4 | Estimated size in bytes of breaker |
elasticsearch_breakers_limit_size_bytes | gauge | 4 | Limit size in bytes for breaker |
elasticsearch_breakers_tripped | counter | 4 | tripped for breaker |
elasticsearch_cluster_health_active_primary_shards | gauge | 1 | The number of primary shards in your cluster. This is an aggregate total across all indices. |
elasticsearch_cluster_health_active_shards | gauge | 1 | Aggregate total of all shards across all indices, which includes replica shards. |
elasticsearch_cluster_health_delayed_unassigned_shards | gauge | 1 | Shards delayed to reduce reallocation overhead |
elasticsearch_cluster_health_initializing_shards | gauge | 1 | Count of shards that are being freshly created. |
elasticsearch_cluster_health_number_of_data_nodes | gauge | 1 | Number of data nodes in the cluster. |
elasticsearch_cluster_health_number_of_in_flight_fetch | gauge | 1 | The number of ongoing shard info requests. |
elasticsearch_cluster_health_number_of_nodes | gauge | 1 | Number of nodes in the cluster. |
elasticsearch_cluster_health_number_of_pending_tasks | gauge | 1 | Cluster level changes which have not yet been executed |
elasticsearch_cluster_health_task_max_waiting_in_queue_millis | gauge | 1 | Max time in millis that a task is waiting in queue. |
elasticsearch_cluster_health_relocating_shards | gauge | 1 | The number of shards that are currently moving from one node to another node. |
elasticsearch_cluster_health_status | gauge | 3 | Whether all primary and replica shards are allocated. |
elasticsearch_cluster_health_timed_out | gauge | 1 | Number of cluster health checks timed out |
elasticsearch_cluster_health_unassigned_shards | gauge | 1 | The number of shards that exist in the cluster state, but cannot be found in the cluster itself. |
elasticsearch_clustersettings_stats_max_shards_per_node | gauge | 0 | Current maximum number of shards per node setting. |
elasticsearch_filesystem_data_available_bytes | gauge | 1 | Available space on block device in bytes |
elasticsearch_filesystem_data_free_bytes | gauge | 1 | Free space on block device in bytes |
elasticsearch_filesystem_data_size_bytes | gauge | 1 | Size of block device in bytes |
elasticsearch_filesystem_io_stats_device_operations_count | gauge | 1 | Count of disk operations |
elasticsearch_filesystem_io_stats_device_read_operations_count | gauge | 1 | Count of disk read operations |
elasticsearch_filesystem_io_stats_device_write_operations_count | gauge | 1 | Count of disk write operations |
elasticsearch_filesystem_io_stats_device_read_size_kilobytes_sum | gauge | 1 | Total kilobytes read from disk |
elasticsearch_filesystem_io_stats_device_write_size_kilobytes_sum | gauge | 1 | Total kilobytes written to disk |
elasticsearch_indices_active_queries | gauge | 1 | The number of currently active queries |
elasticsearch_indices_docs | gauge | 1 | Count of documents on this node |
elasticsearch_indices_docs_deleted | gauge | 1 | Count of deleted documents on this node |
elasticsearch_indices_docs_primary | gauge | Count of documents with only primary shards on all nodes | |
elasticsearch_indices_fielddata_evictions | counter | 1 | Evictions from field data |
elasticsearch_indices_fielddata_memory_size_bytes | gauge | 1 | Field data cache memory usage in bytes |
elasticsearch_indices_filter_cache_evictions | counter | 1 | Evictions from filter cache |
elasticsearch_indices_filter_cache_memory_size_bytes | gauge | 1 | Filter cache memory usage in bytes |
elasticsearch_indices_flush_time_seconds | counter | 1 | Cumulative flush time in seconds |
elasticsearch_indices_flush_total | counter | 1 | Total flushes |
elasticsearch_indices_get_exists_time_seconds | counter | 1 | Total time get exists in seconds |
elasticsearch_indices_get_exists_total | counter | 1 | Total get exists operations |
elasticsearch_indices_get_missing_time_seconds | counter | 1 | Total time of get missing in seconds |
elasticsearch_indices_get_missing_total | counter | 1 | Total get missing |
elasticsearch_indices_get_time_seconds | counter | 1 | Total get time in seconds |
elasticsearch_indices_get_total | counter | 1 | Total get |
elasticsearch_indices_indexing_delete_time_seconds_total | counter | 1 | Total time indexing delete in seconds |
elasticsearch_indices_indexing_delete_total | counter | 1 | Total indexing deletes |
elasticsearch_indices_index_current | gauge | 1 | The number of documents currently being indexed to an index |
elasticsearch_indices_indexing_index_time_seconds_total | counter | 1 | Cumulative index time in seconds |
elasticsearch_indices_indexing_index_total | counter | 1 | Total index calls |
elasticsearch_indices_mappings_stats_fields | gauge | 1 | Count of fields currently mapped by index |
elasticsearch_indices_mappings_stats_json_parse_failures_total | counter | 0 | Number of errors while parsing JSON |
elasticsearch_indices_mappings_stats_scrapes_total | counter | 0 | Current total Elasticsearch Indices Mappings scrapes |
elasticsearch_indices_mappings_stats_up | gauge | 0 | Was the last scrape of the Elasticsearch Indices Mappings endpoint successful |
elasticsearch_indices_merges_docs_total | counter | 1 | Cumulative docs merged |
elasticsearch_indices_merges_total | counter | 1 | Total merges |
elasticsearch_indices_merges_total_size_bytes_total | counter | 1 | Total merge size in bytes |
elasticsearch_indices_merges_total_time_seconds_total | counter | 1 | Total time spent merging in seconds |
elasticsearch_indices_query_cache_cache_total | counter | 1 | Count of query cache |
elasticsearch_indices_query_cache_cache_size | gauge | 1 | Size of query cache |
elasticsearch_indices_query_cache_count | counter | 2 | Count of query cache hit/miss |
elasticsearch_indices_query_cache_evictions | counter | 1 | Evictions from query cache |
elasticsearch_indices_query_cache_memory_size_bytes | gauge | 1 | Query cache memory usage in bytes |
elasticsearch_indices_query_cache_total | counter | 1 | Size of query cache total |
elasticsearch_indices_refresh_time_seconds_total | counter | 1 | Total time spent refreshing in seconds |
elasticsearch_indices_refresh_total | counter | 1 | Total refreshes |
elasticsearch_indices_request_cache_count | counter | 2 | Count of request cache hit/miss |
elasticsearch_indices_request_cache_evictions | counter | 1 | Evictions from request cache |
elasticsearch_indices_request_cache_memory_size_bytes | gauge | 1 | Request cache memory usage in bytes |
elasticsearch_indices_search_fetch_time_seconds | counter | 1 | Total search fetch time in seconds |
elasticsearch_indices_search_fetch_total | counter | 1 | Total number of fetches |
elasticsearch_indices_search_query_time_seconds | counter | 1 | Total search query time in seconds |
elasticsearch_indices_search_query_total | counter | 1 | Total number of queries |
elasticsearch_indices_segments_count | gauge | 1 | Count of index segments on this node |
elasticsearch_indices_segments_memory_bytes | gauge | 1 | Current memory size of segments in bytes |
elasticsearch_indices_settings_stats_read_only_indices | gauge | 1 | Count of indices that have read_only_allow_delete=true |
elasticsearch_indices_settings_total_fields | gauge | Index setting value for index.mapping.total_fields.limit (total allowable mapped fields in a index) | |
elasticsearch_indices_shards_docs | gauge | 3 | Count of documents on this shard |
elasticsearch_indices_shards_docs_deleted | gauge | 3 | Count of deleted documents on each shard |
elasticsearch_indices_store_size_bytes | gauge | 1 | Current size of stored index data in bytes |
elasticsearch_indices_store_size_bytes_primary | gauge | Current size of stored index data in bytes with only primary shards on all nodes | |
elasticsearch_indices_store_size_bytes_total | gauge | Current size of stored index data in bytes with all shards on all nodes | |
elasticsearch_indices_store_throttle_time_seconds_total | counter | 1 | Throttle time for index store in seconds |
elasticsearch_indices_translog_operations | counter | 1 | Total translog operations |
elasticsearch_indices_translog_size_in_bytes | counter | 1 | Total translog size in bytes |
elasticsearch_indices_warmer_time_seconds_total | counter | 1 | Total warmer time in seconds |
elasticsearch_indices_warmer_total | counter | 1 | Total warmer count |
elasticsearch_jvm_gc_collection_seconds_count | counter | 2 | Count of JVM GC runs |
elasticsearch_jvm_gc_collection_seconds_sum | counter | 2 | GC run time in seconds |
elasticsearch_jvm_memory_committed_bytes | gauge | 2 | JVM memory currently committed by area |
elasticsearch_jvm_memory_max_bytes | gauge | 1 | JVM memory max |
elasticsearch_jvm_memory_used_bytes | gauge | 2 | JVM memory currently used by area |
elasticsearch_jvm_memory_pool_used_bytes | gauge | 3 | JVM memory currently used by pool |
elasticsearch_jvm_memory_pool_max_bytes | counter | 3 | JVM memory max by pool |
elasticsearch_jvm_memory_pool_peak_used_bytes | counter | 3 | JVM memory peak used by pool |
elasticsearch_jvm_memory_pool_peak_max_bytes | counter | 3 | JVM memory peak max by pool |
elasticsearch_os_cpu_percent | gauge | 1 | Percent CPU used by the OS |
elasticsearch_os_load1 | gauge | 1 | Shortterm load average |
elasticsearch_os_load5 | gauge | 1 | Midterm load average |
elasticsearch_os_load15 | gauge | 1 | Longterm load average |
elasticsearch_process_cpu_percent | gauge | 1 | Percent CPU used by process |
elasticsearch_process_cpu_seconds_total | counter | 1 | Process CPU time in seconds |
elasticsearch_process_mem_resident_size_bytes | gauge | 1 | Resident memory in use by process in bytes |
elasticsearch_process_mem_share_size_bytes | gauge | 1 | Shared memory in use by process in bytes |
elasticsearch_process_mem_virtual_size_bytes | gauge | 1 | Total virtual memory used in bytes |
elasticsearch_process_open_files_count | gauge | 1 | Open file descriptors |
elasticsearch_snapshot_stats_number_of_snapshots | gauge | 1 | Total number of snapshots |
elasticsearch_snapshot_stats_oldest_snapshot_timestamp | gauge | 1 | Oldest snapshot timestamp |
elasticsearch_snapshot_stats_snapshot_start_time_timestamp | gauge | 1 | Last snapshot start timestamp |
elasticsearch_snapshot_stats_latest_snapshot_timestamp_seconds | gauge | 1 | Timestamp of the latest SUCCESS or PARTIAL snapshot |
elasticsearch_snapshot_stats_snapshot_end_time_timestamp | gauge | 1 | Last snapshot end timestamp |
elasticsearch_snapshot_stats_snapshot_number_of_failures | gauge | 1 | Last snapshot number of failures |
elasticsearch_snapshot_stats_snapshot_number_of_indices | gauge | 1 | Last snapshot number of indices |
elasticsearch_snapshot_stats_snapshot_failed_shards | gauge | 1 | Last snapshot failed shards |
elasticsearch_snapshot_stats_snapshot_successful_shards | gauge | 1 | Last snapshot successful shards |
elasticsearch_snapshot_stats_snapshot_total_shards | gauge | 1 | Last snapshot total shard |
elasticsearch_thread_pool_active_count | gauge | 14 | Thread Pool threads active |
elasticsearch_thread_pool_completed_count | counter | 14 | Thread Pool operations completed |
elasticsearch_thread_pool_largest_count | gauge | 14 | Thread Pool largest threads count |
elasticsearch_thread_pool_queue_count | gauge | 14 | Thread Pool operations queued |
elasticsearch_thread_pool_rejected_count | counter | 14 | Thread Pool operations rejected |
elasticsearch_thread_pool_threads_count | gauge | 14 | Thread Pool current threads count |
elasticsearch_transport_rx_packets_total | counter | 1 | Count of packets received |
elasticsearch_transport_rx_size_bytes_total | counter | 1 | Total number of bytes received |
elasticsearch_transport_tx_packets_total | counter | 1 | Count of packets sent |
elasticsearch_transport_tx_size_bytes_total | counter | 1 | Total number of bytes sent |
elasticsearch_clusterinfo_last_retrieval_success_ts | gauge | 1 | Timestamp of the last successful cluster info retrieval |
elasticsearch_clusterinfo_up | gauge | 1 | Up metric for the cluster info collector |
elasticsearch_clusterinfo_version_info | gauge | 6 | Constant metric with ES version information as labels |
elasticsearch_slm_stats_up | gauge | 0 | Up metric for SLM collector |
elasticsearch_slm_stats_total_scrapes | counter | 0 | Number of scrapes for SLM collector |
elasticsearch_slm_stats_json_parse_failures | counter | 0 | JSON parse failures for SLM collector |
elasticsearch_slm_stats_retention_runs_total | counter | 0 | Total retention runs |
elasticsearch_slm_stats_retention_failed_total | counter | 0 | Total failed retention runs |
elasticsearch_slm_stats_retention_timed_out_total | counter | 0 | Total retention run timeouts |
elasticsearch_slm_stats_retention_deletion_time_seconds | gauge | 0 | Retention run deletion time |
elasticsearch_slm_stats_total_snapshots_taken_total | counter | 0 | Total snapshots taken |
elasticsearch_slm_stats_total_snapshots_failed_total | counter | 0 | Total snapshots failed |
elasticsearch_slm_stats_total_snapshots_deleted_total | counter | 0 | Total snapshots deleted |
elasticsearch_slm_stats_total_snapshots_failed_total | counter | 0 | Total snapshots failed |
elasticsearch_slm_stats_snapshots_taken_total | counter | 1 | Snapshots taken by policy |
elasticsearch_slm_stats_snapshots_failed_total | counter | 1 | Snapshots failed by policy |
elasticsearch_slm_stats_snapshots_deleted_total | counter | 1 | Snapshots deleted by policy |
elasticsearch_slm_stats_snapshot_deletion_failures_total | counter | 1 | Snapshot deletion failures by policy |
elasticsearch_slm_stats_operation_mode | gauge | 1 | SLM operation mode (Running, stopping, stopped) |
We provide examples for Prometheus alerts and recording rules as well as an Grafana Dashboard and a Kubernetes Deployment.
The example dashboard needs the node_exporter installed. In order to select the nodes that belong to the Elasticsearch cluster, we rely on a label cluster
. Depending on your setup, it can derived from the platform metadata:
For example on GCE
- source_labels: [__meta_gce_metadata_Cluster]
separator: ;
regex: (.*)
target_label: cluster
replacement: ${1}
action: replace
Please refer to the Prometheus SD documentation to see which metadata labels can be used to create the cluster
label.
elasticsearch_exporter
is maintained by the Prometheus Community.
elasticsearch_exporter
was then maintained by the nice folks from JustWatch. Then transferred this repository to the Prometheus Community in May 2021.
This package was originally created and maintained by Eric Richardson, who transferred this repository to us in January 2017.
Maintainers of this repository:
Please refer to the Git commit log for a complete list of contributors.
We welcome any contributions. Please fork the project on GitHub and open Pull Requests for any proposed changes.
Please note that we will not merge any changes that encourage insecure behaviour. If in doubt please open an Issue first to discuss your proposal.
The Elasticsearch exporter, alert rule, and dashboard can be deployed in Kubernetes using the Helm chart. The Helm chart used for deployment is taken from the Prometheus community, which can be found here.
If your Elasticsearch server is not up and ready yet, you can start it using Helm:
$ helm repo add elastic https://helm.elastic.co
$ helm install elasticsearch elastic/elasticsearch
$ helm repo add Prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install my-release prometheus-community/prometheus-elasticsearch-exporter --set es.uri=http://<elasticsearch>:9200
Some of the common parameters that must be changed in the values file include:
es:
## Address (host and port) of the Elasticsearch node we should connect to.
## This could be a local node (localhost:9200, for instance), or the address
## of a remote Elasticsearch server. When basic auth is needed,
## specify as: <proto>://<user>:<password>@<host>:<port>. e.g., http://admin:pass@localhost:9200.
##
uri: http://localhost:9200
## If true, query stats for all nodes in the cluster, rather than just the
## node we connect to.
##
all: true
## If true, query stats for all indices in the cluster.
##
indices: true
## If true, query settings stats for all indices in the cluster.
##
indices_settings: true
## If true, query mapping stats for all indices in the cluster.
##
indices_mappings: true
## If true, query stats for shards in the cluster.
##
shards: true
## If true, query stats for snapshots in the cluster.
##
snapshots: true
## If true, query stats for cluster settings.
##
cluster_settings: false
All these parameters can be tuned via the values.yaml file here.
There are multiple ways to scrape the metrics as discussed above. In addition to the native way of setting up Prometheus monitoring, a service monitor can be deployed (if a Prometheus operator is being used) to scrap the data from the Elasticsearch exporter. With this approach, multiple Elasticsearch servers can be scrapped without altering the Prometheus configuration. Every Elasticsearch exporter comes with its own service monitor.
In the above-mentioned chart, a service monitor can be deployed by turning it on from the values.yaml file here.
serviceMonitor:
## If true, a ServiceMonitor CRD is created for a prometheus operator
## https://github.com/coreos/prometheus-operator
##
enabled: false
# namespace: monitoring
labels: {}
interval: 10s
scrapeTimeout: 10s
scheme: http
relabelings: []
targetLabels: []
metricRelabelings: []
sampleLimit: 0
Update the annotation section here in case you are not using the Prometheus Operator.
service:
annotations:
prometheus.io/path: /metrics
prometheus.io/scrape: "true"
After digging into all the valuable metrics, this section explains in detail how we can get critical alerts with the Elasticsearch exporter.
PromQL is a query language for the Prometheus monitoring system. It is designed for building powerful yet simple queries for graphs, alerts, or derived time series (aka recording rules). PromQL is designed from scratch and has zero common grounds with other query languages used in time series databases, such as SQL in TimescaleDB, InfluxQL, or Flux. More details can be found here.
Prometheus comes with a built-in Alert Manager that is responsible for sending alerts (could be email, Slack, or any other supported channel) when any of the trigger conditions is met. Alerting rules allow users to define alerts based on Prometheus query expressions. They are defined based on the available metrics scraped by the exporter. Click here for a good source for community-defined alerts.
A general alert looks as follows:
- alert:(Alert Name)
expr: (Metric exported from exporter) >/</==/<=/=> (Value)
for: (wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element)
labels: (allows specifying a set of additional labels to be attached to the alert)
annotation: (specifies a set of informational labels that can be used to store longer additional information)
Some of the recommended Elasticsearch exporter alerts are:
➡ Alert - Cluster down
- alert: ElasticsearchClusterDown
expr: elasticsearch_cluster_health_up == 0
for: 5m
labels:
severity: warning
annotations:
summary: Elasticsearch is Down
description: "Elasticsearch is down for 5 min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Health status "yellow"
- alert: ElasticsearchClusterYellow
expr: elasticsearch_cluster_health_status{color="yellow"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
description: "Elastic Cluster Yellow status\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Health status "red"
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})
description: "Elastic Cluster Red status\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - ElasticSearch heap size too high
- alert: ElasticsearchHeapUsageTooHigh
expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})
description: "The heap usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Database size
- alert: ElasticsearchDiskOutOfSpace
expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
for: 0m
labels:
severity: critical
annotations:
summary: Elasticsearch disk out of space (instance {{ $labels.instance }})
description: "The disk usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Unassigned shards
- alert: ElasticsearchUnassignedShards
expr: elasticsearch_cluster_health_unassigned_shards > 0
for: 0m
labels:
severity: critical
annotations:
summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})
description: "Elasticsearch has unassigned shards\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Elasticsearch no new documents
- alert: ElasticsearchNoNewDocuments
expr: increase(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1
for: 0m
labels:
severity: warning
annotations:
summary: Elasticsearch no new documents (instance {{ $labels.instance }})
description: "No new documents for 10 min!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
➡ Alert - Elasticsearch missing node
# modify the value with the number of nodes you have in the cluster
- alert: ElasticsearchHealthyNodes
expr: elasticsearch_cluster_health_number_of_nodes < 3
for: 0m
labels:
severity: critical
annotations:
summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
description: "Missing node in Elasticsearch cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Graphs are easier to understand and more user-friendly than a row of numbers. For this purpose, users can plot their time series data in visualized format using Grafana.
Grafana is an open-source dashboarding tool used for visualizing metrics with the help of customizable and illustrative charts and graphs. It connects very well with Prometheus and makes monitoring easy and informative. Dashboards in Grafana are made up of panels, with each panel running a PromQL query to fetch metrics from Prometheus.
Grafana supports community-driven graphs for most of the widely used software, which can be directly imported to the Grafana Community.
NexClipper uses the Elasticsearch exporter by dcwangmit01 dashboard, which is widely accepted and has a lot of useful panels.
What is a Panel?
Panels are the most basic component of a dashboard and can display information in various ways, such as gauge, text, bar chart, graph, and so on. They provide information in a very interactive way. Users can view every panel separately and check the value of metrics within a specific time range.
The values on the panel are queried using PromQL, which is Prometheus Query Language. PromQL is a simple query language used to query metrics within Prometheus. It enables users to query data, aggregate and apply arithmetic functions to the metrics, and then further visualize them on panels.
Here are some examples of panels for metrics from the Elasticsearch exporter: