文章目录

文档用途

本文主要介绍检查数据库单机, 流复制, repmgr集群服务是否运行及检查运行状态的方法。

注意事项:

该文档涉及到recovery.conf或done文件只是针对安全版V4.3.X或企业版V5版本,之后的安全版V4.5.X或企业版V6版本不再有recovery.conf或done文件,该文件内容将写入postgresql.auto,conf,流复制备库标识问文件为standby.signal。

详细信息

一、单机版数据库的运行状态查看

1、通过数据库自带的操作系统命令 pg_ctl 查看数据库运行状态

# 数据库正常运行显示如下:

[highgo@localhost ~]$ pg_ctl status 

pg_ctl: server is running (PID: 7482)

/opt/HighGoDB-5.6.4/bin/postgres "-D" "/opt/HighGoDB-5.6.4/data"



# 数据库关闭状态显示如下:

[highgo@localhost ~]$ pg_ctl status 

pg_ctl: no server running

2、通过数据库自带的操作系统命令 pg_controldata 查看数据库运行状态

# 数据库正常运行显示如下:

[highgo@localhost ~]$ pg_controldata 

...

Database cluster state:               in production

..



# 数据库关闭状态显示如下:

[highgo@localhost ~]$ pg_controldata 

...

Database cluster state:               shut down

...

3、通过后台进程看数据库运行状态

# 正常运行的数据库会存在如下进程

[highgo@localhost ~]$ ps -ef | grep -v grep | grep postgres 

highgo     8002      1  0 10:41 pts/0    00:00:00 /opt/HighGoDB-5.6.4/bin/postgres

highgo     8007   8002  0 10:41 ?        00:00:00 postgres: logger process   

highgo     8009   8002  0 10:41 ?        00:00:00 postgres: checkpointer process  

highgo     8010   8002  0 10:41 ?        00:00:00 postgres: writer process   

highgo     8011   8002  0 10:41 ?        00:00:00 postgres: wal writer process   

highgo     8012   8002  0 10:41 ?        00:00:00 postgres: autovacuum launcher process  

highgo     8013   8002  0 10:41 ?        00:00:00 postgres: archiver process   

highgo     8014   8002  0 10:41 ?        00:00:00 postgres: stats collector process  

highgo     8015   8002  0 10:41 ?        00:00:00 postgres: bgworker: pg_wait_sampling collector  

highgo     8017   8002  0 10:41 ?        00:00:00 postgres: bgworker: performance diagnosis collector  

highgo     8018   8002  0 10:41 ?        00:00:00 postgres: bgworker: hg_job_scheduler  

highgo     8019   8002  0 10:41 ?        00:00:00 postgres: bgworker: logical replication launcher 

ps:(1)显示内容第一列代表数据库在哪个用户底下运行,上方显示数据库运行在highgo用户底下。

(2)显示内容第一行代表数据库运行的主进程。

4、通过文件系统查看数据库是否处于运行状态

# 数据库处于运行状态时会存在下列文件:
[highgo@localhost data]$ ls -lrth | grep postmaster.pid

-rw-------  1 highgo highgo   82 Jul 23 10:41 postmaster.pid

# 数据库处于关闭状态时:

[highgo@localhost data]$ ls -lrth | grep postmaster

ps: (1)运行上方命令需要进到数据库data目录,命令:cd $PGDATA

(2)但是如果数据库由于异常宕机,可能会存在残留进程,所以不能仅凭这一条来判定数据库处于正常运行状态

(3)如果查看数据库处于运行状态,但是连接不到数据库,查看进程存在很少,有可能是数据库异常宕机,此时需要删除掉数据目录中的 postmaster.pid 文件并重启数据库

二、主备流复制数据库运行状态的查询

1、通过数据库自带的操作系统命令 pg_ctl 查看数据库运行状态

# 数据库正常运行显示如下:

[highgo@localhost ~]$ pg_ctl status 

pg_ctl: server is running (PID: 7465)

/opt/HighGoDB-5.6.4/bin/postgres "-D" "/opt/HighGoDB-5.6.4/data"



# 数据库关闭状态显示如下:

[highgo@localhost ~]$ pg_ctl status 

pg_ctl: no server running

2、通过数据库自带的操作系统命令 pg_controldata 查看数据库运行状态

# 流复制主库正常运行时:

[highgo@node1 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6837288046332662480

Database cluster state:               in production

...

# 流复制主库处于关闭状态时:

[highgo@node1 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6837288046332662480

Database cluster state:               shut down

...



# 流复制备库正常运行时:

[highgo@node2 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6837288046332662480

Database cluster state:               in archive recovery

...



# 流复制备库处于关闭状态时:

[highgo@node2 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6837288046332662480

Database cluster state:               shut down in recovery

...

3、通过后台进程看数据库运行状态

# 流复制主库正常运行时:

[highgo@node1 ~]$ ps -ef | grep postgres 

highgo     8038      1  0 11:07 pts/0    00:00:00 /opt/HighGoDB-5.6.4/bin/postgres

highgo     8043   8038  0 11:07 ?        00:00:00 postgres: logger process   

highgo     8045   8038  0 11:07 ?        00:00:00 postgres: checkpointer process  

highgo     8046   8038  0 11:07 ?        00:00:00 postgres: writer process   

highgo     8047   8038  0 11:07 ?        00:00:00 postgres: wal writer process   

highgo     8048   8038  0 11:07 ?        00:00:00 postgres: autovacuum launcher process  

highgo     8049   8038  0 11:07 ?        00:00:00 postgres: archiver process   

highgo     8050   8038  0 11:07 ?        00:00:00 postgres: stats collector process  

highgo     8051   8038  0 11:07 ?        00:00:00 postgres: bgworker: pg_wait_sampling collector  

highgo     8053   8038  0 11:07 ?        00:00:00 postgres: bgworker: performance diagnosis collector  

highgo     8054   8038  0 11:07 ?        00:00:00 postgres: bgworker: logical replication launcher  

highgo     8055   8038  0 11:07 ?        00:00:00 postgres: wal sender process hgreplica x.x.179.161(35068) streaming 0/4A000098



# 流复制备库正常运行时:

[highgo@node2 ~]$ ps -ef | grep postgres 

highgo     8004      1  1 11:07 pts/0    00:00:00 /opt/HighGoDB-5.6.4/bin/postgres

highgo     8009   8004  0 11:07 ?        00:00:00 postgres: logger process   

highgo     8010   8004  0 11:07 ?        00:00:00 postgres: startup process   recovering 00000001000000000000004A

highgo     8011   8004  0 11:07 ?        00:00:00 postgres: checkpointer process  

highgo     8012   8004  0 11:07 ?        00:00:00 postgres: writer process   

highgo     8013   8004  0 11:07 ?        00:00:00 postgres: stats collector process  

highgo     8014   8004  0 11:07 ?        00:00:00 postgres: bgworker: pg_wait_sampling collector  

highgo     8015   8004  1 11:07 ?        00:00:00 postgres: wal receiver process   streaming 0/4A000098

从上述加粗的内容可以看到,主备库的 wal 进程是有区别的,流复制中主库会向备库发送 wal 日志,所以主库存在 wal sender 进程,备库则会有 wal receiver 进程

ps:

(1)也可以直接进行 wal 进程的查看来进行主备库的区分

# 流复制主库显示如下: 

[highgo@node1 ~]$ ps -ef | grep wal 

highgo     8047   8038  0 11:07 ?        00:00:00 postgres: wal writer process   

highgo     8055   8038  0 11:07 ?        00:00:00 postgres: wal sender process hgreplica x.x.179.161(35068) streaming 0/4A00C5B8

# 流复制备库显示如下: 

[highgo@node2 ~]$ ps -ef | grep wal 

highgo     8015   8004  0 11:07 ?        00:00:00 postgres: wal receiver process   streaming 0/4A00C5B8

(2)另外,如果存在级联备库(向流复制备库请求 wal ),则流复制备库上也会存在 wal sender 进程

4、通过文件系统查看数据库的运行状态

(1)数据库是否处于运行状态:

# 数据库处于运行状态时会存在下列文件: [highgo@localhost data]$ ls -lrth | grep postmaster.pid

-rw-------  1 highgo highgo   82 Jul 23 10:41 postmaster.pid

# 数据库处于关闭状态时:

[highgo@localhost data]$ ls -lrth | grep postmaster

(2)主备库的差别

# 备库存在 recovery.conf 文件

[highgo@node2 data]$ ls -lrth | grep recovery

-rw-rw-r-- 1 highgo highgo  375 Jun 17 10:12 recovery.conf



# 如果主备流复制已经做过至少一次主备切换,主库上会存在 recovery.done 文件

[highgo@node2 data]$ ls -lrth | grep recovery

-rw-rw-r-- 1 highgo highgo  375 Jun 15 10:12 recovery.done

三、repmgr 数据库集群运行状态的查询

1、通过数据库自带的操作系统命令 pg_ctl 查看数据库运行状态

# 数据库正常运行显示如下:

[highgo@node1 ~]$ pg_ctl status 

pg_ctl: server is running (PID: 11727)

/opt/HighGo5.6.5-cluster/bin/postgres "-D" "/opt/HighGo5.6.5-cluster/data"



# 数据库关闭状态显示如下:

[highgo@localhost ~]$ pg_ctl status 

pg_ctl: no server running

2、通过数据库自带的操作系统命令 pg_controldata 查看数据库运行状态

## 集群主库正常运行状态

[highgo@node2 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6836156492751540059

Database cluster state:               in production

...

## 集群备库正常运行状态

[highgo@node1 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6836156492751540059

Database cluster state:               in archive recovery

...



# 集群主库处于关闭状态时:

[highgo@node2 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6836156492751540059

Database cluster state:               shut down

...



# 集群备库处于关闭状态时:

[highgo@node2 ~]$ pg_controldata

pg_control version number:            1002

Catalog version number:               201707211

Database system identifier:           6836156492751540059

Database cluster state:               shut down in recovery

3、通过后台进程看数据库运行状态

# 集群主库正常运行时:

[highgo@node2 ~]$ ps -ef | grep postgres 

highgo    19930      1  0 14:21 pts/1    00:00:00 /opt/HighGo5.6.5-cluster/bin/postgres

highgo    19934  19930  0 14:21 ?        00:00:00 postgres: logger process   

highgo    19936  19930  0 14:21 ?        00:00:00 postgres: checkpointer process   

highgo    19937  19930  0 14:21 ?        00:00:00 postgres: writer process   

highgo    19938  19930  0 14:21 ?        00:00:00 postgres: wal writer process   

highgo    19939  19930  0 14:21 ?        00:00:00 postgres: autovacuum launcher process  

highgo    19940  19930  0 14:21 ?        00:00:00 postgres: archiver process   

highgo    19941  19930  0 14:21 ?        00:00:00 postgres: stats collector process   

highgo    19942  19930  0 14:21 ?        00:00:00 postgres: bgworker: pg_wait_sampling collector  

highgo    19944  19930  0 14:21 ?        00:00:00 postgres: bgworker: performance diagnosis collector  

highgo    19945  19930  0 14:21 ?        00:00:00 postgres: bgworker: logical replication launcher  

highgo    19946  19930  0 14:21 ?        00:00:00 postgres: wal sender process highgo x.x.179.153(63390) streaming 0/27023650



# 集群备库正常运行时:

[highgo@node1 ~]$ ps -ef | grep postgres 

highgo    13966      1  0 14:20 pts/1    00:00:00 /opt/HighGo5.6.5-cluster/bin/postgres

highgo    13970  13966  0 14:20 ?        00:00:00 postgres: logger process   

highgo    13971  13966  0 14:20 ?        00:00:00 postgres: startup process   recovering 000000060000000000000027

highgo    13972  13966  0 14:20 ?        00:00:00 postgres: checkpointer process   

highgo    13973  13966  0 14:20 ?        00:00:00 postgres: writer process   

highgo    13974  13966  0 14:20 ?        00:00:00 postgres: stats collector process   

highgo    13975  13966  0 14:20 ?        00:00:00 postgres: bgworker: pg_wait_sampling collector  

highgo    13981  13966  0 14:21 ?        00:00:00 postgres: wal receiver process   streaming 0/27023650

ps: 也可以使用流复制中所描述的另一个命令(ps -ef | grep wal)来看 wal 进程

4、通过文件系统查看数据库的运行状态

(1)数据库是否处于以运行状态:

# 数据库处于运行状态时会存在下列文件:
[highgo@localhost data]$ ls -lrth | grep postmaster.pid

-rw-------  1 highgo highgo   82 Jul 23 10:41 postmaster.pid

# 数据库处于关闭状态时:

[highgo@localhost data]$ ls -lrth | grep postmaster

(2)主库备库的判断

# 没有做过主备切换时,备库中只有一个 recovery.conf 文件

# 做过主备切换后,主备库都会有 recovery.conf 和 recovery.done 两个文件

[highgo@node1 data]$ ls -lrth | grep recovery

-rw------- 1 highgo highgo  229 Jul 23 09:10 recovery.done

-rw------- 1 highgo highgo  229 Jul 23 09:12 recovery.conf

# 主库中 recovery.conf 时间早于 recovery.done; 备库中 recovery.done 早于 recovery.conf

# 可以通过查看数据库集群状态的命令;如果主备库显示一致且如下所示,说明集群状态正常

## 在 status 一栏中出现 * 表示正常状态,如果出现了其他的符号(比如 !,?),表明集群状态出现了问题

[highgo@node2 ~]$ repmgr cluster show 

 ID | Name            | Role    | Status    | Upstream        | Location | Priority | Replication lag | Last replayed LSN

----+-----------------+---------+-----------+-----------------+----------+----------+-----------------+-------------------

 1  | x.x.179.153 | standby |   running | x.x.179.154 | default  | 100      | 0 bytes         | 0/2703D720       

 2  | x.x.179.154 | primary | * running |                 | default  | 100      | n/a             | none



# 如果两个数据库中都出现了 recovery.conf 时间早于 recovery.done 的情况,数据库集群出错出现了双主

# 出现双主的例子(status 中出现了 !,表明集群状态不正确)

[highgo@node2 ~]$ repmgr cluster show 

 ID | Name            | Role    | Status               | Upstream        | Location | Priority | Replication lag | Last replayed LSN

----+-----------------+---------+----------------------+-----------------+----------+----------+-----------------+-------------------

 1  | x.x.179.153 | standby | ! running as primary | x.x.179.154 | default  | 100      | 624 MB          | 0/2703D720       

 2  | x.x.179.154 | primary | * running            |                 | default  | 100      | n/a             | none             



WARNING: following issues were detected

  - node "x.x.179.153" (ID: 1) is registered as standby but running as primary

四、HGHAC集群数据库运行状态查询

1.通过pg_ctl status、pg_controldata、后台进程及文件系统文件等命令查看方式与流复制环境查询一致,可参考上文第二章节1-4小节。

2.通过systemctl查看集群服务状态

systemctl status hghac

各节点分别执行,查看各节点集群服务状态,正常状态下该服务为running状态,数据库进程信息均可显示。

3.通过hghactl命令查询当前集群状态

/usr/local/hghac/hac/hghactl/hghactl -c /usr/local/hghac/hac/hghac.yml list

该命令可显示当前集群状态,可重点查看Role、State、TL列,正常状态为Role列含有一个leader,其余节点均为replica或sync;State列均为running;TL列数字相同

五、DB_HA集群数据库运行状态查询

1.通过pg_ctl status、pg_controldata、后台进程及文件系统文件等命令查看方式与流复制环境查询一致,可参考上文第二章节1-4小节。

2.通过db_ha命令查看集群状态

/usr/local/db_ha/bin/db_ha select -f /usr/local/db_ha/conf/db_ha.conf

该命令可显示db_ha集群状态,可查看nodety、streamingState、healthy、agentState值,其中nodety值应为一个PRIMARY,其余节点为SATNDBY;streamingState值主节点为none,备节点均为streaming;healthy值均为t;agentState值均为NORMAL。

3.通过sql命令在主库上进行查询

select  * from dn_ha;

该命令可显示当前集群各节点信息,包括节点状态信息、节点角色、节点ip、节点agent端口等,可重点查看node_healthy及iseffective两列值,正常状态两参数值均为t

Logo

更多推荐