prometheus 监控一 | ^画※哲^

关于prometheus 我们都知道它当前是一个开源的监控软件，社区活跃，使用的人也非常多。
今天主要就是针对当前我在配置prometheus的时候遇到的一些点，然后针对配置做一个简单的介绍，下面我去官网下载了一个二进制包，然后直接拿官网的prometheus.yml来说，如下：

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

其中我们可以看到
scrape_interval：1m 表示抓取周期
evaluation_interval:30s 表示计算prometheus 告警规则的周期
scrape_timeout:15s 表示抓取超时时间

注意以上三个项基本定义了抓取数据和计算规则，抓取数据的周期和计算规则的周期是不一样的，抓取数据单独的周期去1m抓取一次，然后计算规则会按照计算规则的周期去30s计算一次。

alerting:
alertmanagers:
– static_configs:
– targets:
# – alertmanager:9093 表示alertmanager的地址
这里prometheus主要负责抓取数据，然后计算规则，产生告警，一旦产生告警后，将会通知到alertmanager做对应的规则路由

rule_files:
# – “first_rules.yml”
# – “second_rules.yml”
表示具体的规则定义，这里我们可以定义多级别的目录，然后使用*.yml这样匹配所有的告警的文件

scrape_configs:
– job_name: ‘prometheus’
# metrics_path defaults to ‘/metrics’
# scheme defaults to ‘http’.
static_configs:
– targets: [‘localhost:9090’]

job_name 我理解主要是分组，然后默认static_configs 这里我们是采用静态的配置，也就是我们写好配置文件后，只能重启服务或者掉接口热加载才会生效。
targets：表示具体的ip+port 去拉去监控的数据，然后默认的路径就是/metrics,也就是： http://localhost:9090/me tics

注意，这里我们可以在静态target的地方打上相应的标签，然后在拉取数据上来后，就会打上相应的标签在数据上
比如：
static_configs:
– targets: [‘localhost:9090’]
labels:
endpoint:”test”

由于agent 采集的数据很多，这个时候我们为了节省一部分空间，或者少看一些数据，我们可以使用metric_relabel_configs 去做一些筛选

– job_name: host
metric_relabel_configs:
– source_labels: [__name__]
regex: (node_cpu.*|node_disk.*)
action: keep
比如，我就可以通过这样的操作去保留cpu, disk的数据，这里是根据监控项metric 去做正则表达式匹配保留,不在上面的我们就可以丢弃了。然后针对action我们可以定义保留keep 或者丢弃drop

有时候我们可能会想prometheus自动检测文件的变化，然后改动完文件即自动拉取对应的target
那么我们可以通过如下配置配上：

– job_name: host
metric_relabel_configs:
– source_labels: [__name__]
regex: (node_cpu.*|node_disk.*)
action: keep
file_sd_configs:
– files: [‘host.yml’]

注意，这里只能监控拉取的target的变化，但是不能监控规则的变化。

一	二	三	四	五	六	日
« 1月
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30