Prometheus
https://prometheus.io/docs/prometheus/latest/querying/basics/
https://prometheus.io/docs/prometheus/latest/querying/api/
一、PromQL
1、选择器
- = 与字符串匹配
- != 与字符串不匹配
- =~ 与字符串正则匹配
- !~ 与字符串正则不匹配
http_requests_total{handler=~"/api/v1/.*"}
2、范围查询
ms, s, m, h, d, w, y
http_requests_total[5m]
3、时间位移
查询的时间范围分别前移 5分钟
node_filesystem_free_bytes{mountpoint="/data00"} offset 5m
4、操作符
https://prometheus.io/docs/prometheus/latest/querying/operators/
+, -, *, /, %, ^
==, !=, >, <, >=, <=
and, or, unless
on, ignoring
roup_left, roup_right
5、聚合操作
函数 | 说明 |
---|---|
sum | 求和 |
count | 计数 |
count_values | 对value计数 |
min | 最小值 |
max | 最大值 |
avg | 平均值 |
stddev | 标准差 |
stdvar | 标准方差 |
bottomk | 后n条时序 |
topk | 前n条时序 |
quantile | 分位数 |
@ | 更改查询中各个即时和范围向量的计算时间 |
atan2 | 弧度计算 |
without用于从计算结果中移除列举的标签,而保留其它标签。
by则正好相反,结果向量中只保留列出的标签,其余标签则移除。
sum(http_requests_total) without (instance)
等价于
sum(http_requests_total) by (code,handler,job,method)
# 获取HTTP请求数前5位的时序样本数据
topk(5, http_requests_total)
# quantile用于计算当前样本数据值的分布情况quantile(φ, express)其中0 ≤ φ ≤ 1。
例如,当φ为0.5时,即表示找到当前样本数据中的中位数:
quantile(0.5, http_requests_total)
count by (namespace) (kube_pod_container_resource_limits: {ressource="cpu"})
sum by (node) (node_memory_MemTotal_bytes — node_memory_MemAvailable_bytes)
sum(http_requests_total{method="GET"} @ 1609746000)
rate(http_requests_total[5m])[30m:1m]
sum(apisix_http_requests_total)[1h:]
6、常用方法
https://prometheus.io/docs/prometheus/latest/querying/functions/
1.increase
获取区间向量中第一个样本和最后一个样本,并返回其增长量。
获取节点存储5分钟内的变化量
increase(node_filesystem_free_bytes{mountpoint="/data"}[5m])
2.rate、irate
rate 计算区间向量在时间窗口内平均增长速率,会在单调性发生变化时自动中断。
irate 计算区间向量的增长率,但其反应出的是瞬时增长率。
rate 与 irate 函数仅适用于 Counter 类型的 Metrics。
获取 HTTP Request 请求5 分钟内的变化率。
rate(http_request_total{status="200", method="GET"}[5m])
irate(http_request_total{status="200", method="GET"}[5m])
3.delta、idelta
delta 计算一个区间向量的第一个元素和最后一个元素之间的差值。
idelta与delta函数类似,不同的是它计算最新的2个样本之间的差值。
获取最近两个小时 CPU 的温度差值。
delta(cpu_temp_celsius{host="zeus"}[2h])
4.histogram_quantile
计算分位数
计算延迟的 P99 值。
histogram_quantile(0.99 , rate(prometheus_tsdb_compaction_chunk_range_bucket[5m]))
5.absent
一般用于验证 样本是否存在,如果存在则返回空,如果不存在,则返回 1。
确认节点中 node exporter 是否存在。
absent(up{job="node-exporter", instance="127.0.0.1:9100"})
6.abs、ceil、floor
abs() 绝对值
ceil() 向上取整
floor() 向下取整
7.label_replace
动态标签替换
label_replace(node_boot_time_seconds{instance="10.13.1.10:9100"},"node","$1","instance","(.*):9100")
label_replace(up, "host", "$1", "instance", "(.*):.*")
label_replace(apisix_http_status,"path","$0","matched_uri",".*")
8.label_join
label_join(up{job="api-server",src1="a",src2="b",src3="c"}, "foo", ",", "src1", "src2", "src3")
9.time
当前时间戳
(time()-node_boot_time_seconds)/60/60/24
10.group_right/group_left
a * on (foo, bar) b
a * ignoring (baz) b
a * on (foo, bar) group_left(baz) b
kube_node_info * on (node) group_right() kube_node_status_condition{condition="Ready",status="true"} * on (node) group_right() label_replace(node_boot_time_seconds,"node","$1","instance","(.*):9100")
(sum(label_replace(label_replace(apisix_http_status,"host","$0","matched_host",".*"),"path","$0","matched_uri",".*")) by (code,path,host)) * on (host,path) group_left(service_name,ingress,namespace,service_port) (kube_ingress_path)
(sum(label_replace(label_replace(apisix_http_status,"host","$0","matched_host",".*"),"path","$1","matched_uri","(/[^/.]*).*")) by (code,path,host)) * on (host,path) group_left(service_name,ingress,namespace,service_port) (kube_ingress_path)
(sum(label_replace(apisix_http_status{code!="200"}, "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(ingress,namespace,path,service_name,service_port) kube_ingress_path{service_name="httpbin"}) -
(sum(label_replace(apisix_http_status{code!="200"} offset 15s, "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(ingress,namespace,path,service_name,service_port) kube_ingress_path{service_name="httpbin"})
(sum(label_replace(increase(apisix_http_status[30s]), "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(ingress,namespace,path,service_name,service_port) kube_ingress_path{service_name="httpbin"})
(sum(label_replace(apisix_http_status{code!="200"}, "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(service_name) kube_ingress_path{service_name="default-svc-infer01"} or sum(kube_ingress_path{service_name="default-svc-infer01"}*0) by (host,service_name)) - (sum(label_replace(apisix_http_status{code!="200"} offset 1d, "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(service_name) kube_ingress_path{service_name="default-svc-infer01"} or sum(kube_ingress_path{service_name="default-svc-infer01"}*0) by (host,service_name))
11.increase
增长量
increase(http_requests_total{job="apiserver"}[5m])
12.vector
apisix_http_status or vector(0)
7、常用指标
1.QPS,Queries Per Second 每秒查询率
sum(label_replace(rate(apisix_http_status[30s]), "host", "$0", "matched_host", ".*")) by (host) * on (host) group_left(ingress,namespace,path,service_name,service_port) kube_ingress_path{service_name="httpbin"}
2.RT,Response Time 响应时间
rate 计算区间向量在时间窗口内平均增长速率,会在单调性发生变化时自动中断。
过去5分钟内第90个百分位数的请求延迟
过去5分钟内90%请求的平均响应时间
histogram_quantile(0.90, sum(rate(apisix_http_latency_bucket{type="request"}[5m])) by (le))
{le="+Inf"}
是一个特殊的标签选择器,用于选择直方图中所有桶(bucket)
le less than or equal to, <=
+Inf 正无穷大的“溢出”桶
二、HTTP API
https://prometheus.io/docs/prometheus/latest/querying/api/
curl -G 'http://192.168.0.127:32070/api/v1/query_range' \
--data-urlencode 'query=sum(apisix_http_requests_total)' \
--data-urlencode 'start=2024-07-15T20:10:30.781Z' \
--data-urlencode 'end=2024-07-15T20:10:30.781Z' \
--data-urlencode 'step=15s'
三、自定义metric
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// 定义计数器
var requestTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Number of get requests.",
},
[]string{"code"},
)
func main() {
// 注册指标
prometheus.MustRegister(requestTotal)
// 使用http.HandleFunc来处理metrics请求
http.Handle("/metrics", promhttp.Handler())
// 你的业务逻辑
// ...
// 启动HTTP服务器
http.ListenAndServe(":8080", nil)
}
vim prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'my-custom-metrics'
static_configs:
- targets: ['localhost:8080']
四、kube-state-metrics
https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md
1、允许labels
vim kubeasz/roles/kube-prometheus-stack/files/kube-prometheus-stack/charts/kube-state-metrics/values.yaml
...
extraArgs: ["--metric-labels-allowlist=nodes=[*]", "--metric-annotations-allowlist=nodes=[*]"]
...
--metric-labels-allowlist=pods=[*]
--metric-labels-allowlist=nodes=[*],pods=[*],persistentvolumeclaims=[*],deployments=[*],statefulsets=[*],configmaps=[*],secrets=[*],services=[*],replicasets=[*]
--metric-labels-allowlist=*=[*]
--metric-annotations-allowlist=pods=[*]
--metric-labels-allowlist=pods=[*],nodes=[node,failure-domain.beta.kubernetes.io/zone,topology.kubernetes.io/zone]
2、查询指标
curl 'http://127.0.0.1:9090/metrics'
curl -G 'http://127.0.0.1:9090/api/v1/query' \
--data-urlencode 'query=count(kube_node_status_condition{condition="Ready", status="false"})/count(kubelet_node_name)*100'
curl -G 'http://127.0.0.1:30127/api/v1/query_range' \
--data-urlencode 'query=sum(apisix_http_requests_total)' \
--data-urlencode 'start=1721062917' \
--data-urlencode 'end=1721116919' \
--data-urlencode 'step=15s'
3、其他
--metric-denylist=kube_deployment_spec_.*