zfs-localpv/docs/prometheus-monitoring.md
Pawan 896a6032ca chore(metrics): adding list of zfs metrics exposed by prometheus
Signed-off-by: Pawan <pawan@mayadata.io>
2020-01-17 14:40:42 +05:30

21 KiB

Prerequisite

We need to have k8s version 1.15+ to access the CSI volume metrics.

Setup helm

This step uses helm the kubernetes package manager. If you have not setup the helm then do the below configuration, otherwise move to the next step.

$ helm version
Client: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}

$ helm init
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation

$ kubectl create serviceaccount --namespace kube-system tiller
serviceaccount/tiller created

$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
clusterrolebinding.rbac.authorization.k8s.io/tiller-cluster-rule created

$ kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
deployment.extensions/tiller-deploy patched

Install Prometheus Operator

Once the helm is ready and related tiller pods is up and running , use the Prometheus chart from the helm repository

$ helm install stable/prometheus-operator --name prometheus-operator
NAME:   prometheus-operator
LAST DEPLOYED: Thu Jan  9 12:50:03 2020
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/Alertmanager
NAME                              AGE
prometheus-operator-alertmanager  54s

==> v1/ClusterRole
NAME                                              AGE
prometheus-operator-alertmanager                  54s
prometheus-operator-grafana-clusterrole           54s
prometheus-operator-operator                      54s
prometheus-operator-operator-psp                  54s
prometheus-operator-prometheus                    54s
prometheus-operator-prometheus-psp                54s
psp-prometheus-operator-kube-state-metrics        54s
psp-prometheus-operator-prometheus-node-exporter  54s

==> v1/ClusterRoleBinding
NAME                                              AGE
prometheus-operator-alertmanager                  54s
prometheus-operator-grafana-clusterrolebinding    54s
prometheus-operator-operator                      54s
prometheus-operator-operator-psp                  54s
prometheus-operator-prometheus                    54s
prometheus-operator-prometheus-psp                54s
psp-prometheus-operator-kube-state-metrics        54s
psp-prometheus-operator-prometheus-node-exporter  54s

==> v1/ConfigMap
NAME                                                   AGE
prometheus-operator-apiserver                          54s
prometheus-operator-cluster-total                      54s
prometheus-operator-controller-manager                 54s
prometheus-operator-etcd                               54s
prometheus-operator-grafana                            54s
prometheus-operator-grafana-config-dashboards          54s
prometheus-operator-grafana-datasource                 54s
prometheus-operator-grafana-test                       54s
prometheus-operator-k8s-resources-cluster              54s
prometheus-operator-k8s-resources-namespace            54s
prometheus-operator-k8s-resources-node                 54s
prometheus-operator-k8s-resources-pod                  54s
prometheus-operator-k8s-resources-workload             54s
prometheus-operator-k8s-resources-workloads-namespace  54s
prometheus-operator-kubelet                            54s
prometheus-operator-namespace-by-pod                   54s
prometheus-operator-namespace-by-workload              54s
prometheus-operator-node-cluster-rsrc-use              54s
prometheus-operator-node-rsrc-use                      54s
prometheus-operator-nodes                              54s
prometheus-operator-persistentvolumesusage             54s
prometheus-operator-pod-total                          54s
prometheus-operator-pods                               54s
prometheus-operator-prometheus                         54s
prometheus-operator-proxy                              54s
prometheus-operator-scheduler                          54s
prometheus-operator-statefulset                        54s
prometheus-operator-workload-total                     54s

==> v1/DaemonSet
NAME                                          AGE
prometheus-operator-prometheus-node-exporter  54s

==> v1/Deployment
NAME                                    AGE
prometheus-operator-grafana             54s
prometheus-operator-kube-state-metrics  54s
prometheus-operator-operator            54s

==> v1/Pod(related)
NAME                                                     AGE
prometheus-operator-grafana-85bb5d49d-bffdg              54s
prometheus-operator-kube-state-metrics-5d46566c59-p8k6s  54s
prometheus-operator-operator-64844759f7-rpwws            54s
prometheus-operator-prometheus-node-exporter-p9rl8       54s

==> v1/Prometheus
NAME                            AGE
prometheus-operator-prometheus  54s

==> v1/PrometheusRule
NAME                                                      AGE
prometheus-operator-alertmanager.rules                    54s
prometheus-operator-etcd                                  54s
prometheus-operator-general.rules                         54s
prometheus-operator-k8s.rules                             54s
prometheus-operator-kube-apiserver-error                  54s
prometheus-operator-kube-apiserver.rules                  54s
prometheus-operator-kube-prometheus-node-recording.rules  54s
prometheus-operator-kube-scheduler.rules                  54s
prometheus-operator-kubernetes-absent                     54s
prometheus-operator-kubernetes-apps                       54s
prometheus-operator-kubernetes-resources                  54s
prometheus-operator-kubernetes-storage                    54s
prometheus-operator-kubernetes-system                     54s
prometheus-operator-kubernetes-system-apiserver           54s
prometheus-operator-kubernetes-system-controller-manager  54s
prometheus-operator-kubernetes-system-kubelet             54s
prometheus-operator-kubernetes-system-scheduler           54s
prometheus-operator-node-exporter                         54s
prometheus-operator-node-exporter.rules                   54s
prometheus-operator-node-network                          54s
prometheus-operator-node-time                             54s
prometheus-operator-node.rules                            54s
prometheus-operator-prometheus                            54s
prometheus-operator-prometheus-operator                   54s

==> v1/Role
NAME                              AGE
prometheus-operator-grafana-test  54s

==> v1/RoleBinding
NAME                              AGE
prometheus-operator-grafana-test  54s

==> v1/Secret
NAME                                           AGE
alertmanager-prometheus-operator-alertmanager  54s
prometheus-operator-grafana                    54s

==> v1/Service
NAME                                          AGE
prometheus-operator-alertmanager              54s
prometheus-operator-coredns                   54s
prometheus-operator-grafana                   54s
prometheus-operator-kube-controller-manager   54s
prometheus-operator-kube-etcd                 54s
prometheus-operator-kube-proxy                54s
prometheus-operator-kube-scheduler            54s
prometheus-operator-kube-state-metrics        54s
prometheus-operator-operator                  54s
prometheus-operator-prometheus                54s
prometheus-operator-prometheus-node-exporter  54s

==> v1/ServiceAccount
NAME                                          AGE
prometheus-operator-alertmanager              54s
prometheus-operator-grafana                   54s
prometheus-operator-grafana-test              54s
prometheus-operator-kube-state-metrics        54s
prometheus-operator-operator                  54s
prometheus-operator-prometheus                54s
prometheus-operator-prometheus-node-exporter  54s

==> v1/ServiceMonitor
NAME                                         AGE
prometheus-operator-alertmanager             53s
prometheus-operator-apiserver                53s
prometheus-operator-coredns                  53s
prometheus-operator-grafana                  53s
prometheus-operator-kube-controller-manager  53s
prometheus-operator-kube-etcd                53s
prometheus-operator-kube-proxy               53s
prometheus-operator-kube-scheduler           53s
prometheus-operator-kube-state-metrics       53s
prometheus-operator-kubelet                  53s
prometheus-operator-node-exporter            53s
prometheus-operator-operator                 53s
prometheus-operator-prometheus               53s

==> v1beta1/ClusterRole
NAME                                    AGE
prometheus-operator-kube-state-metrics  54s

==> v1beta1/ClusterRoleBinding
NAME                                    AGE
prometheus-operator-kube-state-metrics  54s

==> v1beta1/MutatingWebhookConfiguration
NAME                           AGE
prometheus-operator-admission  54s

==> v1beta1/PodSecurityPolicy
NAME                                          AGE
prometheus-operator-alertmanager              54s
prometheus-operator-grafana                   54s
prometheus-operator-grafana-test              54s
prometheus-operator-kube-state-metrics        54s
prometheus-operator-operator                  54s
prometheus-operator-prometheus                54s
prometheus-operator-prometheus-node-exporter  54s

==> v1beta1/Role
NAME                         AGE
prometheus-operator-grafana  54s

==> v1beta1/RoleBinding
NAME                         AGE
prometheus-operator-grafana  54s

==> v1beta1/ValidatingWebhookConfiguration
NAME                           AGE
prometheus-operator-admission  53s


NOTES:
The Prometheus Operator has been installed. Check its status by running:
  kubectl --namespace default get pods -l "release=prometheus-operator"

  Visit https://github.com/coreos/prometheus-operator for instructions on how
  to create & configure Alertmanager and Prometheus instances using the Operator.

Check all the required pods are up and running

$ kubectl get pods -l "release=prometheus-operator"
NAME                                                 READY   STATUS    RESTARTS   AGE
prometheus-operator-grafana-85bb5d49d-bffdg          2/2     Running   0          2m21s
prometheus-operator-operator-64844759f7-rpwws        2/2     Running   0          2m21s
prometheus-operator-prometheus-node-exporter-p9rl8   1/1     Running   0          2m21s

Setup alert rule

Please check the rules there in the system :-

$ kubectl get PrometheusRule
NAME                                                       AGE
prometheus-operator-alertmanager.rules                     4m21s
prometheus-operator-etcd                                   4m21s
prometheus-operator-general.rules                          4m21s
prometheus-operator-k8s.rules                              4m21s
prometheus-operator-kube-apiserver-error                   4m21s
prometheus-operator-kube-apiserver.rules                   4m21s
prometheus-operator-kube-prometheus-node-recording.rules   4m21s
prometheus-operator-kube-scheduler.rules                   4m21s
prometheus-operator-kubernetes-absent                      4m21s
prometheus-operator-kubernetes-apps                        4m21s
prometheus-operator-kubernetes-resources                   4m21s
prometheus-operator-kubernetes-storage                     4m21s
prometheus-operator-kubernetes-system                      4m21s
prometheus-operator-kubernetes-system-apiserver            4m21s
prometheus-operator-kubernetes-system-controller-manager   4m21s
prometheus-operator-kubernetes-system-kubelet              4m21s
prometheus-operator-kubernetes-system-scheduler            4m21s
prometheus-operator-node-exporter                          4m21s
prometheus-operator-node-exporter.rules                    4m21s
prometheus-operator-node-network                           4m21s
prometheus-operator-node-time                              4m21s
prometheus-operator-node.rules                             4m21s
prometheus-operator-prometheus                             4m21s
prometheus-operator-prometheus-operator                    4m21s

You can edit any of the default rules or setup the new rule to get the alerts. Here is the sample rule to start firing the alerts if available storage space is less than 10% :-

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: prometheus-operator
    chart: prometheus-operator-8.5.4
    heritage: Tiller
    release: prometheus-operator
  name: prometheus-operator-zfs-alertmanager.rules
  namespace: default
spec:
  groups:
  - name: zfsalertmanager.rules
    rules:
    - alert: ZFSVolumeUsageCritical
      annotations:
        message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
          }} in Namespace {{ $labels.namespace }} is only {{ printf "%0.2f" $value
          }}% free.
      expr: |
        100 * kubelet_volume_stats_available_bytes{job="kubelet"}
          /
        kubelet_volume_stats_capacity_bytes{job="kubelet"}
          < 10
      for: 1m
      labels:
        severity: critical

Apply the above yaml so that Prometheus can fire the alert when available space is less than 10%

Check the Prometheus alert

To be able to view the Prometheus web UI, expose it through a Service. A simple way to do this is to use a Service of type NodePort

$ cat prometheus-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30090
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: prometheus-operator-prometheus

apply the above yaml

$ kubectl apply -f prometheus-service.yaml
service/prometheus-service created

Now you can access the alert manager UI via "nodes-external-ip:30090"

$ kubectl get nodes -owide
NAME                                         STATUS   ROLES    AGE    VERSION          INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
gke-zfspv-pawan-default-pool-3e407350-xvzp   Ready    <none>   103m   v1.15.4-gke.22   10.168.0.45   34.94.3.140   Ubuntu 18.04.3 LTS   5.0.0-1022-gke   docker://19.3.2

In this case we can access the alert manager via url http://34.94.3.140:30090/

Check the Alert Manager

To be able to view the Alert Manager web UI, expose it through a Service of type NodePort

$ cat alertmanager-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: alertmanager-service
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30093
    port: 9093
    protocol: TCP
    targetPort: web
  selector:
    alertmanager: prometheus-operator-alertmanager

apply the above yaml

$ kubectl apply -f alertmanager-service.yaml
service/alertmanager-service created

Now you can access the alert manager UI via "nodes-external-ip:30093"

$ kubectl get nodes -owide
NAME                                         STATUS   ROLES    AGE    VERSION          INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
gke-zfspv-pawan-default-pool-3e407350-xvzp   Ready    <none>   103m   v1.15.4-gke.22   10.168.0.45   34.94.3.140   Ubuntu 18.04.3 LTS   5.0.0-1022-gke   docker://19.3.2

In this case we can access the alert manager via url http://34.94.3.140:30093/

ZFS metrics exposed by Prometheus

We can create the rule for ZFS metrics also. Here is the list of ZFS metrics exposed by prometheus :-

node_zfs_abd_linear_cnt
node_zfs_abd_linear_data_size
node_zfs_abd_scatter_chunk_waste
node_zfs_abd_scatter_cnt
node_zfs_abd_scatter_data_size
node_zfs_abd_scatter_order_0
node_zfs_abd_scatter_order_1
node_zfs_abd_scatter_order_10
node_zfs_abd_scatter_order_2
node_zfs_abd_scatter_order_3
node_zfs_abd_scatter_order_4
node_zfs_abd_scatter_order_5
node_zfs_abd_scatter_order_6
node_zfs_abd_scatter_order_7
node_zfs_abd_scatter_order_8
node_zfs_abd_scatter_order_9
node_zfs_abd_scatter_page_alloc_retry
node_zfs_abd_scatter_page_multi_chunk
node_zfs_abd_scatter_page_multi_zone
node_zfs_abd_scatter_sg_table_retry
node_zfs_abd_struct_size
node_zfs_arc_access_skip
node_zfs_arc_anon_evictable_data
node_zfs_arc_anon_evictable_metadata
node_zfs_arc_anon_size
node_zfs_arc_arc_dnode_limit
node_zfs_arc_arc_loaned_bytes
node_zfs_arc_arc_meta_limit
node_zfs_arc_arc_meta_max
node_zfs_arc_arc_meta_min
node_zfs_arc_arc_meta_used
node_zfs_arc_arc_need_free
node_zfs_arc_arc_no_grow
node_zfs_arc_arc_prune
node_zfs_arc_arc_sys_free
node_zfs_arc_arc_tempreserve
node_zfs_arc_bonus_size
node_zfs_arc_c
node_zfs_arc_c_max
node_zfs_arc_c_min
node_zfs_arc_compressed_size
node_zfs_arc_data_size
node_zfs_arc_dbuf_size
node_zfs_arc_deleted
node_zfs_arc_demand_data_hits
node_zfs_arc_demand_data_misses
node_zfs_arc_demand_hit_predictive_prefetch
node_zfs_arc_demand_metadata_hits
node_zfs_arc_demand_metadata_misses
node_zfs_arc_dnode_size
node_zfs_arc_evict_l2_cached
node_zfs_arc_evict_l2_eligible
node_zfs_arc_evict_l2_ineligible
node_zfs_arc_evict_l2_skip
node_zfs_arc_evict_not_enough
node_zfs_arc_evict_skip
node_zfs_arc_hash_chain_max
node_zfs_arc_hash_chains
node_zfs_arc_hash_collisions
node_zfs_arc_hash_elements
node_zfs_arc_hash_elements_max
node_zfs_arc_hdr_size
node_zfs_arc_hits
node_zfs_arc_l2_abort_lowmem
node_zfs_arc_l2_asize
node_zfs_arc_l2_cksum_bad
node_zfs_arc_l2_evict_l1cached
node_zfs_arc_l2_evict_lock_retry
node_zfs_arc_l2_evict_reading
node_zfs_arc_l2_feeds
node_zfs_arc_l2_free_on_write
node_zfs_arc_l2_hdr_size
node_zfs_arc_l2_hits
node_zfs_arc_l2_io_error
node_zfs_arc_l2_misses
node_zfs_arc_l2_read_bytes
node_zfs_arc_l2_rw_clash
node_zfs_arc_l2_size
node_zfs_arc_l2_write_bytes
node_zfs_arc_l2_writes_done
node_zfs_arc_l2_writes_error
node_zfs_arc_l2_writes_lock_retry
node_zfs_arc_l2_writes_sent
node_zfs_arc_memory_all_bytes
node_zfs_arc_memory_direct_count
node_zfs_arc_memory_free_bytes
node_zfs_arc_memory_indirect_count
node_zfs_arc_memory_throttle_count
node_zfs_arc_metadata_size
node_zfs_arc_mfu_evictable_data
node_zfs_arc_mfu_evictable_metadata
node_zfs_arc_mfu_ghost_evictable_data
node_zfs_arc_mfu_ghost_evictable_metadata
node_zfs_arc_mfu_ghost_hits
node_zfs_arc_mfu_ghost_size
node_zfs_arc_mfu_hits
node_zfs_arc_mfu_size
node_zfs_arc_misses
node_zfs_arc_mru_evictable_data
node_zfs_arc_mru_evictable_metadata
node_zfs_arc_mru_ghost_evictable_data
node_zfs_arc_mru_ghost_evictable_metadata
node_zfs_arc_mru_ghost_hits
node_zfs_arc_mru_ghost_size
node_zfs_arc_mru_hits
node_zfs_arc_mru_size
node_zfs_arc_mutex_miss
node_zfs_arc_overhead_size
node_zfs_arc_p
node_zfs_arc_prefetch_data_hits
node_zfs_arc_prefetch_data_misses
node_zfs_arc_prefetch_metadata_hits
node_zfs_arc_prefetch_metadata_misses
node_zfs_arc_size
node_zfs_arc_sync_wait_for_async
node_zfs_arc_uncompressed_size
node_zfs_dmu_tx_dmu_tx_assigned
node_zfs_dmu_tx_dmu_tx_delay
node_zfs_dmu_tx_dmu_tx_dirty_delay
node_zfs_dmu_tx_dmu_tx_dirty_over_max
node_zfs_dmu_tx_dmu_tx_dirty_throttle
node_zfs_dmu_tx_dmu_tx_error
node_zfs_dmu_tx_dmu_tx_group
node_zfs_dmu_tx_dmu_tx_memory_reclaim
node_zfs_dmu_tx_dmu_tx_memory_reserve
node_zfs_dmu_tx_dmu_tx_quota
node_zfs_dmu_tx_dmu_tx_suspended
node_zfs_dnode_dnode_alloc_next_block
node_zfs_dnode_dnode_alloc_next_chunk
node_zfs_dnode_dnode_alloc_race
node_zfs_dnode_dnode_allocate
node_zfs_dnode_dnode_buf_evict
node_zfs_dnode_dnode_free_interior_lock_retry
node_zfs_dnode_dnode_hold_alloc_hits
node_zfs_dnode_dnode_hold_alloc_interior
node_zfs_dnode_dnode_hold_alloc_lock_misses
node_zfs_dnode_dnode_hold_alloc_lock_retry
node_zfs_dnode_dnode_hold_alloc_misses
node_zfs_dnode_dnode_hold_alloc_type_none
node_zfs_dnode_dnode_hold_dbuf_hold
node_zfs_dnode_dnode_hold_dbuf_read
node_zfs_dnode_dnode_hold_free_hits
node_zfs_dnode_dnode_hold_free_lock_misses
node_zfs_dnode_dnode_hold_free_lock_retry
node_zfs_dnode_dnode_hold_free_misses
node_zfs_dnode_dnode_hold_free_overflow
node_zfs_dnode_dnode_hold_free_refcount
node_zfs_dnode_dnode_hold_free_txg
node_zfs_dnode_dnode_move_active
node_zfs_dnode_dnode_move_handle
node_zfs_dnode_dnode_move_invalid
node_zfs_dnode_dnode_move_recheck1
node_zfs_dnode_dnode_move_recheck2
node_zfs_dnode_dnode_move_rwlock
node_zfs_dnode_dnode_move_special
node_zfs_dnode_dnode_reallocate
node_zfs_fm_erpt_dropped
node_zfs_fm_erpt_set_failed
node_zfs_fm_fmri_set_failed
node_zfs_fm_payload_set_failed
node_zfs_vdev_cache_delegations
node_zfs_vdev_cache_hits
node_zfs_vdev_cache_misses
node_zfs_xuio_onloan_read_buf
node_zfs_xuio_onloan_write_buf
node_zfs_xuio_read_buf_copied
node_zfs_xuio_read_buf_nocopy
node_zfs_xuio_write_buf_copied
node_zfs_xuio_write_buf_nocopy
node_zfs_zfetch_hits
node_zfs_zfetch_max_streams
node_zfs_zfetch_misses
node_zfs_zil_zil_commit_count
node_zfs_zil_zil_commit_writer_count
node_zfs_zil_zil_itx_copied_bytes
node_zfs_zil_zil_itx_copied_count
node_zfs_zil_zil_itx_count
node_zfs_zil_zil_itx_indirect_bytes
node_zfs_zil_zil_itx_indirect_count
node_zfs_zil_zil_itx_metaslab_normal_bytes
node_zfs_zil_zil_itx_metaslab_normal_count
node_zfs_zil_zil_itx_metaslab_slog_bytes
node_zfs_zil_zil_itx_metaslab_slog_count
node_zfs_zil_zil_itx_needcopy_bytes
node_zfs_zil_zil_itx_needcopy_count
node_zfs_zpool_nread
node_zfs_zpool_nwritten
node_zfs_zpool_rcnt
node_zfs_zpool_reads
node_zfs_zpool_rlentime
node_zfs_zpool_rtime
node_zfs_zpool_rupdate
node_zfs_zpool_wcnt
node_zfs_zpool_wlentime
node_zfs_zpool_writes
node_zfs_zpool_wtime
node_zfs_zpool_wupdate