1 问题呈现

docker run启动pmm-server容器后查看:

image.png

状态显示unhealthy,正常来说应该是healthy,这显然有问题。

2 排查过程

1)检查pmm-server容器日志

[root@zabbix6 ~]# docker logs 7044dd8d6aca > docker.log

异常信息如下:

[root@zabbix6 ~]# vim docker.log
2024-01-28 17:04:26,213 INFO spawned: 'clickhouse' with pid 283
2024-01-28 17:04:26,383 INFO exited: clickhouse (exit status 232; not expected)
2024-01-28 17:04:32,683 INFO spawned: 'grafana' with pid 300
2024-01-28 17:04:32,684 INFO spawned: 'qan-api2' with pid 301
2024-01-28 17:04:32,690 INFO exited: grafana (exit status 2; not expected)
2024-01-28 17:04:32,691 INFO exited: qan-api2 (exit status 1; not expected)

根据提示,大概的意思是clickhouse、grafana、qan-api2进程退出。

2)进入查看pmm-server容器进程

[root@zabbix6 bin]# docker exec -it 7044dd8d6aca /bin/bash
[root@7044dd8d6aca opt] # supervisorctl
alertmanager                     RUNNING   pid 25, uptime 0:06:00
clickhouse                       FATAL     Exited too quickly (process log may have details)
dbaas-controller                 STOPPED   Not started
grafana                          FATAL     Exited too quickly (process log may have details)
nginx                            RUNNING   pid 22, uptime 0:06:00
pmm-agent                        RUNNING   pid 114, uptime 0:05:57
pmm-managed                      RUNNING   pid 37, uptime 0:06:00
pmm-update-perform               STOPPED   Not started
pmm-update-perform-init          FATAL     Exited too quickly (process log may have details)
postgresql                       RUNNING   pid 13, uptime 0:06:00
prometheus                       STOPPED   Not started
qan-api2                         BACKOFF   Exited too quickly (process log may have details)
victoriametrics                  RUNNING   pid 23, uptime 0:06:00
vmalert                          RUNNING   pid 24, uptime 0:06:00
vmproxy                          RUNNING   pid 32, uptime 0:06:00

印证了刚刚的推论,一大堆进程都没有运行,看起来问题很多啊。

3)查看具体进程日志

优先看看这几个进程clickhouse、grafana、qan-api2日志。

clickhouse:

supervisor> tail clickhouse
. main @ 0x0000000007111f8f in /usr/bin/clickhouse
1.  ? @ 0x00007f5ed0cd1eb0 in ?
2.  ? @ 0x00007f5ed0cd1f60 in ?
3.  _start @ 0x000000000634716e in /usr/bin/clickhouse
 (version 23.8.2.7 (official build))
Processing configuration file '/etc/clickhouse-server/config.xml'.
Logging information to /srv/logs/clickhouse-server.log
Poco::Exception. Code: 1000, e.code() = 0, Exception: Could not determine local time zone: filesystem error: in canonical: Operation not permitted ["/usr/share/zoneinfo/"] [""], Stack trace (when copying this message, always include the lines below):

1. DateLUT::DateLUT() @ 0x000000000c5f13d8 in /usr/bin/clickhouse
2. OwnPatternFormatter::OwnPatternFormatter(bool) @ 0x000000000c8e224e in /usr/bin/clickhouse
3. Loggers::buildLoggers(Poco::Util::AbstractConfiguration&, Poco::Logger&, String const&) @ 0x000000000c8d846d in /usr/bin/clickhouse
4. BaseDaemon::initialize(Poco::Util::Application&) @ 0x000000000c8b6082 in /usr/bin/clickhouse
5. DB::Server::initialize(Poco::Util::Application&) @ 0x000000000c68bef8 in /usr/bin/clickhouse
6. Poco::Util::Application::run() @ 0x0000000015b1e6fa in /usr/bin/clickhouse
7. DB::Server::run() @ 0x000000000c68bcbe in /usr/bin/clickhouse
8. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000015b2d819 in /usr/bin/clickhouse
9. mainEntryClickHouseServer(int, char**) @ 0x000000000c688a8a in /usr/bin/clickhouse
10. main @ 0x0000000007111f8f in /usr/bin/clickhouse
11. ? @ 0x00007f85da433eb0 in ?
12. ? @ 0x00007f85da433f60 in ?
13. _start @ 0x000000000634716e in /usr/bin/clickhouse
 (version 23.8.2.7 (official build))

grafana:

supervisor> tail grafana
000
0x00007ffdba484180:  0x0000000000000001  0xc1999515713efe00
0x00007ffdba484190:  0x00007ffdba4843a0  0x00000000004324db <runtime.(*pageAlloc).allocRange+0x000000000000021b>
0x00007ffdba4841a0:  0x0000000005d202e8  0x0000000002030000 <github.com/grafana/grafana/pkg/services/libraryelements.(*LibraryElementService).getLibraryElementByUid+0x0000000000000240>
0x00007ffdba4841b0:  0x0000000000000004  0x0000000000000000
0x00007ffdba4841c0:  0x0000000000000002  0xc1999515713efe00
0x00007ffdba4841d0:  0x00007efd709d6740  0x0000000000000006
0x00007ffdba4841e0:  0x0000000000000001  0x00007ffdba484510
0x00007ffdba4841f0:  0x0000000005cf3920  0x00007efd70a2dd06
0x00007ffdba484200:  0x00007efd70bd4e90  0x00007efd70a017f3
0x00007ffdba484210:  0x0000000000000020  0x0000000000000000
0x00007ffdba484220:  0x000000000361013e  0x0000000000000006
0x00007ffdba484230:  0x0000000005eb988a  0x0000000000000000

goroutine 1 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc000072740 sp=0xc000072730 pc=0x4737c8
runtime.main()
        /usr/local/go/src/runtime/proc.go:169 +0x6d fp=0xc0000727e0 sp=0xc000072740 pc=0x441aed
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x4757a1

rax    0x0
rbx    0x7efd709d6740
rcx    0x7efd70a7a58c
rdx    0x6
rdi    0x187
rsi    0x187
rbp    0x187
rsp    0x7ffdba484140
r8     0x7ffdba484210
r9     0x7efd70b8a4e0
r10    0x8
r11    0x246
r12    0x6
r13    0x7ffdba484510
r14    0x5cf3920
r15    0x6
rip    0x7efd70a7a58c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

qan-api2:

supervisor> tail qan-api2
: connect: connection refused
stdlog: qan-api2 v2.41.0.
time="2024-01-28T17:52:00.514+00:00" level=info msg="Log level: info."
time="2024-01-28T17:52:00.514+00:00" level=info msg="DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2" component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.41.0.
time="2024-01-28T17:52:25.252+00:00" level=info msg="Log level: info."
time="2024-01-28T17:52:25.252+00:00" level=info msg="DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2" component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.41.0.
time="2024-01-28T17:52:50.486+00:00" level=info msg="Log level: info."
time="2024-01-28T17:52:50.486+00:00" level=info msg="DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2" component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.41.0.
time="2024-01-28T17:53:17.248+00:00" level=info msg="Log level: info."
time="2024-01-28T17:53:17.248+00:00" level=info msg="DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2" component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.41.0.
time="2024-01-28T17:53:45.089+00:00" level=info msg="Log level: info."
time="2024-01-28T17:53:45.089+00:00" level=info msg="DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2" component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused

上面三个进程的日志重要信息如下:

  • 权限不足:filesystem error: in canonical: Operation not permitted [“/usr/share/zoneinfo/”]

  • 连接失败:stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused

3 解决办法

先看看第一个权限不足的问题,查阅google找到一篇文章遇到了类似问题,文章地址为:https://github.com/ClickHouse/ClickHouse/issues/48296

博客中解决办法是:
image.png

这个问题可能是因为版本的限制,一开始安装使用的docker run命令没有加--privileged参数。

6)删除容器,docker run加上–privileged参数重新创建:

[root@zabbix6 _data]# docker stop 7044dd8d6aca
[root@zabbix6 _data]# docker rm 7044dd8d6aca
[root@zabbix6 _data]# 
docker run --privileged --detach --restart always \
--publish 443:443 \
--volumes-from pmm-data \
--name pmm-server \
percona/pmm-server:2

[root@zabbix6 ~]# docker ps
CONTAINER ID   IMAGE                  COMMAND                CREATED          STATUS                    PORTS                          NAMES
c99a87c5718b   percona/pmm-server:2   "/opt/entrypoint.sh"   18 minutes ago   Up 18 minutes (healthy)   80/tcp, 0.0.0.0:443->443/tcp   pmm-server

STATUS为healthy正常,问题解决!!

【关联文章】

1) MySQL监控方案PMM之PMM Server的安装
2)一个PMM Server部署问题解决办法
3)MySQL监控方案PMM之PMM Client的安装
4)PMM添加MySQ监控服务

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐