一. 前言

上一篇,上篇的pg_hba的配置还是有点问题的,本篇将通过Debug来解析问题根源。

二. 问题

由于postgres是超级管理账户,实际使用中不可能开放给普通用户,我们模拟下生产环境,新建库和用户。
[postgres@k8s03 ~]$ psql -hk8s01 -Upostgres -p5433
psql (PGXL 10r1, based on PG 10.5 (Postgres-XL 10r1))
Type “help” for help.

postgres=# revoke all on schema public from public;
REVOKE
postgres=# revoke all on DATABASE postgres from public;
REVOKE
postgres=# create user produ1 password ‘produ1’;
CREATE ROLE
postgres=# create database prod1 owner produ1;
CREATE DATABASE
postgres=# \c prod1
You are now connected to database “prod1” as user “postgres”.
prod1=# revoke all on schema public from public;
REVOKE
prod1=# revoke all on DATABASE prod1 from public;
REVOKE
prod1=# \q
[postgres@k8s03 ~]$ psql -hk8s01 -p5433 -Uprodu1 -W -dprod1
Password for user produ1:
psql (PGXL 10r1, based on PG 10.5 (Postgres-XL 10r1))
Type “help” for help.

prod1=> create schema s1_1;
WARNING: failed to receive file descriptors for connections
ERROR: Failed to get pooled connections
HINT: This may happen because one or more nodes are currently unreachable, either because of node or network failure.
Its also possible that the target node may have hit the connection limit or the pooler is configured with low connections.
Please check if all nodes are running fine and also review max_connections and max_pool_size configuration parameters

prod1=>
这个报错有价值的信息太少,我们到k8s01服务器上看下日志:
[postgres@k8s01 ~]$ more logfile
[1494] LOG: failed to connect to node, connection string (host=k8s02 port=5433 dbname=prod1 user=produ1 application_name=‘pgxc:coord1’ sslmode=disable options=’-c remotetype=coordinator -c parentnode=coord1
-c DateStyle=iso,mdy -c timezone=prc -c geqo=on -c intervalstyle=postgres -c lc_monetary=en_US.UTF-8’), connection error (fe_sendauth: no password supplied)

[1494] WARNING: can not connect to node 16384
[1494] WARNING: Health map updated to reflect DOWN node (16384)
[1494] LOG: Pooler could not open a connection to node 16384
[1527] LOG: failed to acquire connections
[1527] STATEMENT: create schema s1_1;
[1527] WARNING: failed to receive file descriptors for connections
[1527] ERROR: Failed to get pooled connections
[1527] HINT: This may happen because one or more nodes are currently unreachable, either because of node or network failure.
Its also possible that the target node may have hit the connection limit or the pooler is configured with low connections.
Please check if all nodes are running fine and also review max_connections and max_pool_size configuration parameters
[1527] STATEMENT: create schema s1_1;

在源码中搜索报错关键字,得到如下信息:

报错信息源码位置
failed to connect to node, connection stringpoolmgr.c(line:3045)
can not connect to nodepoolmgr.c(line:2881)
Health map updated to reflect DOWN nodepoolmgr.c(line:2890)
Pooler could not open a connection to nodepoolmgr.c(line:1968&2007)
failed to acquire connectionspoolcomm.c(line:643)
STATEMENT: create schema s1_1
failed to receive file descriptors for connectionspoolmgr.c(line:1273)
Failed to get pooled connectionspgxcnode.c(line:2089&2332)
This may happen because one or more nodes are currently unreachablepgxccode.c(line:2090&2333)
node (coord2:16384) down! Trying pingpoolmgr.c(line:1050)
Health map updated to reflect HEALTHY nodepoolmgr.c(line:1069)

可以看出报错大部分来自poolmgr.c源代码,我们再看下Postgres-XL进程:
postgres@k8s01 ~]$ ps -ef|grep postgres
root 1447 1432 0 07:05 pts/0 00:00:00 su - postgres
postgres 1448 1447 0 07:05 pts/0 00:00:00 -bash
postgres 1468 1448 0 07:07 pts/0 00:00:00 gtm -D /data/pgxl10r1/gtm
postgres 1469 1448 0 07:07 pts/0 00:00:13 gtm_proxy -D /data/pgxl10r1/gtm_proxy
postgres 1480 1 0 07:08 pts/0 00:00:00 /u01/app/pgxl10r1/bin/postgres --datanode -D /data/pgxl10r1/datanode
postgres 1482 1480 0 07:08 ? 00:00:00 postgres: pooler process
postgres 1483 1480 0 07:08 ? 00:00:00 postgres: checkpointer process
postgres 1484 1480 0 07:08 ? 00:00:00 postgres: writer process
postgres 1485 1480 0 07:08 ? 00:00:00 postgres: wal writer process
postgres 1486 1480 0 07:08 ? 00:00:00 postgres: autovacuum launcher process
postgres 1487 1480 0 07:08 ? 00:00:00 postgres: stats collector process
postgres 1488 1480 0 07:08 ? 00:00:00 postgres: cluster monitor process
postgres 1489 1480 0 07:08 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 1492 1 0 07:08 pts/0 00:00:00 /u01/app/pgxl10r1/bin/postgres --coordinator -D /data/pgxl10r1/coord
postgres 1494 1492 0 07:08 ? 00:00:00 postgres: pooler process
postgres 1495 1492 0 07:08 ? 00:00:00 postgres: checkpointer process
postgres 1496 1492 0 07:08 ? 00:00:00 postgres: writer process
postgres 1497 1492 0 07:08 ? 00:00:00 postgres: wal writer process
postgres 1498 1492 0 07:08 ? 00:00:00 postgres: autovacuum launcher process
postgres 1499 1492 0 07:08 ? 00:00:00 postgres: stats collector process
postgres 1500 1492 0 07:08 ? 00:00:00 postgres: cluster monitor process
postgres 1501 1492 0 07:08 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 1527 1492 0 07:13 ? 00:00:00 postgres: produ1 prod1 192.078.100.103(54762) idle
postgres 0720 1448 0 07:33 pts/0 00:00:00 ps -ef
postgres 0721 1448 0 07:33 pts/0 00:00:00 grep --color=auto postgres
就是它了postgres: pooler process,pid:1494号进程。

三. 调式基础

3.1 安装调试工具GDB #all nodes
yum -y install gdb
3.2 GDB常用命令
网络上资料很多,只列下几个最基本常用命令:
attach:attach pid调试正在运行的进程。
continue(简写c):继续执行,到下一个断点处(或运行结束)。
next(简写n): 单步跟踪程序,当遇到函数调用时,直接调用,不进入此函数体。
step(简写s):单步调试如果有函数调用,则进入函数;与命令n不同,n是不进入调用的函数的 。

break n(简写b n):在第n行处设置断点 ;可以带上代码路径和代码名称: b OAGUPDATE.cpp:578)。
clear 行号n:清除第n行的断点 。

print 表达式(简记为 p) :其中“表达式”可以是任何当前正在被测试程序的有效表达式,比如当前正在调试C语言的程序,那么“表达式”可以是任何C语言的有效表达式,包括数字,变量甚至是函数调用。
print a:将显示整数 a 的值 。
set follow-fork-mode child :在目标应用fork子进程后,调试进入子程序。

quit(简记为 q ):退出gdb。

四. GDB Debug范例

由于Postgres-XL的代码量非常庞大,不建议直接阅读,也不建议直接单步调试,Debug过程太耗时是一方面,关键有些逻辑是有时间限制的,比如socket连接,调试时间太长会导致超时,带来不必要的麻烦;两者结合在关键代码位设置断点即可,范例是多次调试过程优化后的展现,以期逻辑更清晰易懂。
4.1 pooler process Debug
ssh session1:主要用于GDB调试
===========================
[postgres@k8s01 ~]$ gdb attach 1494
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright © 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/
attach: No such file or directory.
Attaching to process 1494
Reading symbols from /u01/app/pgxl10r1/bin/postgres…done.
Reading symbols from /lib64/libpthread.so.0…(no debugging symbols found)…done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1…(no debugging symbols found)…done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007f0a703aacb0 in __poll_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64
(gdb) n
Single stepping until exit from function __poll_nocancel,
which has no line number information. #注意此时在session2执行sql语句
PoolerLoop () at poolmgr.c:3315
3315 if (retval < 0)
(gdb) b 3336
Breakpoint 1 at 0x74c3d1: file poolmgr.c, line 3336.
(gdb) c
Continuing.

Breakpoint 1, PoolerLoop () at poolmgr.c:3336
3336 agent_handle_input(agent, &input_message);
(gdb) b 1806
Breakpoint 2 at 0x7497e6: file poolmgr.c, line 1806.
(gdb) c
Continuing.

Breakpoint 2, agent_handle_input (agent=0x1658be0, s=0x7ffe4a453d40) at poolmgr.c:1806
1806 handle_get_connections(agent, s);
(gdb) b 1703
Breakpoint 3 at 0x7494ae: file poolmgr.c, line 1703.
(gdb) c
Continuing.

Breakpoint 3, handle_get_connections (agent=0x1658be0, s=0x7ffe4a453d40) at poolmgr.c:1703
1703 fds = agent_acquire_connections(agent, datanodelist, coordlist, &pids);
(gdb) b 2000
Breakpoint 4 at 0x749e44: file poolmgr.c, line 2000.
(gdb) c
Continuing.

Breakpoint 4, agent_acquire_connections (agent=0x1658be0, datanodelist=0x0, coordlist=0x16579a0, pids=0x7ffe4a453cb0) at poolmgr.c:2000
2000 PGXCNodePoolSlot *slot = acquire_connection(agent->pool, agent->coord_conn_oids[node]);
(gdb) b 2824
Breakpoint 5 at 0x74b616: file poolmgr.c, line 2824.
(gdb) c
Continuing.

Breakpoint 5, acquire_connection (dbPool=0x165abf0, node=16384) at poolmgr.c:2824
2824 nodePool = grow_pool(dbPool, node);
(gdb) b 2992
Breakpoint 6 at 0x74ba22: file poolmgr.c, line 2992.
(gdb) c
Continuing.

Breakpoint 6, grow_pool (dbPool=0x165abf0, node=16384) at poolmgr.c:2992
2992 nodePool->connstr = build_node_conn_str(node, dbPool);这里构建了连接串,分析函数可以发现没有提供密码的功能。
(gdb) n
2994 if (!nodePool->connstr)
(gdb) print nodePool->connstr
$2 = 0x165f4d0 “host=k8s02 port=5433 dbname=prod1 user=produ1 application_name=‘pgxc:coord1’ sslmode=disable options=’-c remotetype=coordinator -c parentnode=coord1 -c DateStyle=iso,mdy -c timezone=prc -c geqo=on -c”…
(gdb) b 3040
Breakpoint 7 at 0x74bbb6: file poolmgr.c, line 3040.
(gdb) c
Continuing.

Breakpoint 7, grow_pool (dbPool=0x165abf0, node=16384) at poolmgr.c:3040
3040 slot->conn = PGXCNodeConnect(nodePool->connstr);
(gdb) s
PGXCNodeConnect (
connstr=0x165f4d0 “host=k8s02 port=5433 dbname=prod1 user=produ1 application_name=‘pgxc:coord1’ sslmode=disable options=’-c remotetype=coordinator -c parentnode=coord1 -c DateStyle=iso,mdy -c timezone=prc -c geqo=on -c”…) at poolmgr.c:3748
3748 conn = PQconnectdb(connstr);
(gdb) s
PQconnectdb (
conninfo=0x165f4d0 “host=k8s02 port=5433 dbname=prod1 user=produ1 application_name=‘pgxc:coord1’ sslmode=disable options=’-c remotetype=coordinator -c parentnode=coord1 -c DateStyle=iso,mdy -c timezone=prc -c geqo=on -c”…) at fe-connect.c:646
646 PGconn *conn = PQconnectStart(conninfo);
(gdb) b 649
Breakpoint 8 at 0xaab522: file fe-connect.c, line 649.
(gdb) c
Continuing.

Breakpoint 8, PQconnectdb (
conninfo=0x165f4d0 “host=k8s02 port=5433 dbname=prod1 user=produ1 application_name=‘pgxc:coord1’ sslmode=disable options=’-c remotetype=coordinator -c parentnode=coord1 -c DateStyle=iso,mdy -c timezone=prc -c geqo=on -c”…) at fe-connect.c:649
649 (void) connectDBComplete(conn);
(gdb) s
connectDBComplete (conn=0x16552e0) at fe-connect.c:1904
1904 PostgresPollingStatusType flag = PGRES_POLLING_WRITING;
(gdb) b 1994
Breakpoint 9 at 0xaad8a0: file fe-connect.c, line 1994.
(gdb) c
Continuing.

Breakpoint 9, connectDBComplete (conn=0x16552e0) at fe-connect.c:1994
1994 flag = PQconnectPoll(conn);
(gdb) s
PQconnectPoll (conn=0x16552e0) at fe-connect.c:2068
2068 bool reset_connection_state_machine = false;
(gdb) b 2095
Breakpoint 10 at 0xaada6d: file fe-connect.c, line 2095.
(gdb) c
Continuing.

Breakpoint 9, connectDBComplete (conn=0x16552e0) at fe-connect.c:1994
1994 flag = PQconnectPoll(conn);
(gdb) c
Continuing.

Breakpoint 9, connectDBComplete (conn=0x16552e0) at fe-connect.c:1994
1994 flag = PQconnectPoll(conn);
(gdb) c
Continuing.

Breakpoint 10, PQconnectPoll (conn=0x16552e0) at fe-connect.c:2095
2095 int n = pqReadData(conn); #读取从服务端发送过来的信息并保存到conn
(gdb) b 2943
Breakpoint 11 at 0xaae8f5: file fe-connect.c, line 2943.
(gdb) c
Continuing.

Breakpoint 11, PQconnectPoll (conn=0x16552e0) at fe-connect.c:2943
2943 if (pqGetInt((int *) &areq, 4, conn)) #从conn获取请求类型
(gdb) n
2948 msgLength -= 4;
(gdb) print areq #10就是AUTH_REQ_SASL
$4 = 10
(gdb) b 2984
Breakpoint 12 at 0xaae9bf: file fe-connect.c, line 2984.
(gdb) c
Continuing.

Breakpoint 12, PQconnectPoll (conn=0x16552e0) at fe-connect.c:2984
2984 res = pg_fe_sendauth(areq, msgLength, conn);
(gdb) s
pg_fe_sendauth (areq=10, payloadlen=15, conn=0x16552e0) at fe-auth.c:822
822 switch (areq)
(gdb) b 978
Breakpoint 13 at 0xac1596: file fe-auth.c, line 978.
(gdb) c
Continuing.

Breakpoint 13, pg_fe_sendauth (areq=10, payloadlen=15, conn=0x16552e0) at fe-auth.c:978
978 if (pg_SASL_init(conn, payloadlen) != STATUS_OK)
(gdb) s
pg_SASL_init (conn=0x16552e0, payloadlen=15) at fe-auth.c:491
491 char *initialresponse = NULL;
(gdb) b 515
Breakpoint 14 at 0xac0d38: file fe-auth.c, line 515.
(gdb) c
Continuing.

Breakpoint 14, pg_SASL_init (conn=0x16552e0, payloadlen=15) at fe-auth.c:515
515 if (pqGets(&mechanism_buf, conn))
(gdb) b 538
Breakpoint 15 at 0xac0da8: file fe-auth.c, line 538.
(gdb) c
Continuing.

Breakpoint 15, pg_SASL_init (conn=0x16552e0, payloadlen=15) at fe-auth.c:538
538 if (strcmp(mechanism_buf.data, SCRAM_SHA_256_NAME) == 0)
(gdb) print mechanism_buf.data #这就是服务端发送过来身份认证信息。
$5 = 0x16564d0 “SCRAM-SHA-256”
(gdb) n
542 conn->password_needed = true;
(gdb) n
543 password = conn->connhost[conn->whichhost].password;
(gdb) n
544 if (password == NULL)
(gdb) n
545 password = conn->pgpass;
(gdb) n
546 if (password == NULL || password[0] == ‘\0’)
(gdb) n
548 printfPQExpBuffer(&conn->errorMessage,
(gdb) n
550 goto error;
(gdb) print conn->errorMessage.data
$7 = 0x1654d00 “fe_sendauth: no password supplied\n” #这里就是我们在日志中看到的报错信息,函数返回后和其它信息拼接后构成完成的报错。
(gdb) c
Continuing.

ssh session2:主要用于执行sql语句
=============================
[postgres@k8s03 ~]$ psql -hk8s01 -p5433 -Uprodu1 -W -dprod1
Password for user produ1:
psql (PGXL 10r1, based on PG 10.5 (Postgres-XL 10r1))
Type “help” for help.

prod1=> create schema s1_1;
当我们在session1按下最后一个c回车后,报错:
WARNING: failed to receive file descriptors for connections
ERROR: Failed to get pooled connections
HINT: This may happen because one or more nodes are currently unreachable, either because of node or network failure.
Its also possible that the target node may have hit the connection limit or the pooler is configured with low connections.
Please check if all nodes are running fine and also review max_connections and max_pool_size configuration parameters
prod1=>

第一阶段总结:从连接串信息可以看出,k8s01连接了k8s02的5433端口,而这个端口是coordinator监听,我们需要进一步Debug coordinator看下它发送给k8s01的信息都来自哪里。

4.2 coordinator Debug
[postgres@k8s02 ~]$ ps -ef|grep postgres
postgres 1460 1 0 14:15 ? 00:00:08 gtm_proxy -D /data/pgxl10r1/gtm_proxy
postgres 1465 1 0 14:16 ? 00:00:00 /u01/app/pgxl10r1/bin/postgres --datanode -D /data/pgxl10r1/datanode
postgres 1467 1465 0 14:16 ? 00:00:00 postgres: pooler process
postgres 1468 1465 0 14:16 ? 00:00:00 postgres: checkpointer process
postgres 1469 1465 0 14:16 ? 00:00:00 postgres: writer process
postgres 1470 1465 0 14:16 ? 00:00:00 postgres: wal writer process
postgres 1471 1465 0 14:16 ? 00:00:00 postgres: autovacuum launcher process
postgres 1472 1465 0 14:16 ? 00:00:00 postgres: stats collector process
postgres 1473 1465 0 14:16 ? 00:00:00 postgres: cluster monitor process
postgres 1474 1465 0 14:16 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 1477 1 0 14:16 ? 00:00:00 /u01/app/pgxl10r1/bin/postgres --coordinator -D /data/pgxl10r1/coord
postgres 1479 1477 0 14:16 ? 00:00:00 postgres: pooler process
postgres 1480 1477 0 14:16 ? 00:00:00 postgres: checkpointer process
postgres 1481 1477 0 14:16 ? 00:00:00 postgres: writer process
postgres 1482 1477 0 14:16 ? 00:00:00 postgres: wal writer process
postgres 1483 1477 0 14:16 ? 00:00:00 postgres: autovacuum launcher process
postgres 1484 1477 0 14:16 ? 00:00:00 postgres: stats collector process
postgres 1485 1477 0 14:16 ? 00:00:00 postgres: cluster monitor process
postgres 1486 1477 0 14:16 ? 00:00:00 postgres: bgworker: logical replication launcher
root 1530 1515 0 14:20 pts/1 00:00:00 su - postgres
postgres 1531 1530 0 14:20 pts/1 00:00:00 -bash
postgres 1594 1531 0 14:30 pts/1 00:00:00 ps -ef
postgres 1595 1531 0 14:30 pts/1 00:00:00 grep --color=auto postgres
[postgres@k8s02 ~]$ gdb attach 1477
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright © 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/
attach: No such file or directory.
Attaching to process 1477
Reading symbols from /u01/app/pgxl10r1/bin/postgres…done.
Reading symbols from /lib64/libpthread.so.0…(no debugging symbols found)…done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1…(no debugging symbols found)…done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007fd88f443a13 in __select_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64
(gdb) n
Single stepping until exit from function __select_nocancel,
which has no line number information.
ServerLoop () at postmaster.c:1834
1834 PG_SETMASK(&BlockSig);

(gdb) n
1868 BackendStartup(port);
(gdb) s
BackendStartup (port=0x2b38550) at postmaster.c:4252
4252 bn = (Backend ) malloc(sizeof(Backend));

(gdb) n
4296 pid = fork_process();
(gdb) s
fork_process () at fork_process.c:47
47 fflush(stdout);
(gdb) n
48 fflush(stderr);
(gdb) set follow-fork-mode child
(gdb) n
61 result = fork();
(gdb) n
[Attaching after process 1569 fork to child process 1569]
[New inferior 2 (process 1569)]
[Detaching after fork from parent process 1477]
[Inferior 1 (process 1477) detached]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
[Switching to Thread 0x7fd89025b740 (LWP 1569)]
62 if (result == 0)
(gdb) n
85 oomfilename = getenv(“PG_OOM_ADJUST_FILE”);
Make breakpoint pending on future shared library load? (y or [n])
(gdb) n
87 if (oomfilename != NULL)
(gdb) n
118 return result;
(gdb) n
119 }
(gdb) n
BackendStartup (port=0x1efb550) at postmaster.c:4297
4297 if (pid == 0) /
child */
(gdb) b 4311
Breakpoint 3 at 0x83d4d2: file postmaster.c, line 4311.
(gdb) c
Continuing.

Breakpoint 3, BackendStartup (port=0x1efb550) at postmaster.c:4311
4311 BackendRun(port);
(gdb) s
BackendRun (port=0x1efb550) at postmaster.c:4592
4592 TimestampDifference(0, port->SessionStartTime, &secs, &usecs);
(gdb) n
4593 srandom((unsigned int) (MyProcPid ^ (usecs << 12) ^ secs));
(gdb) n
4602 maxac = 2; /* for fixed args supplied below /
...
(gdb) n
4639 PostgresMain(ac, av, port->database_name, port->user_name);
(gdb) s
PostgresMain (argc=1, argv=0x1f04528, dbname=0x1f04458 “prod1”, username=0x1f04438 “produ1”) at postgres.c:4071
4071 volatile bool send_ready_for_query = true;
(gdb) n
...
4226 InitProcess();
(gdb) n
4230 PG_SETMASK(&UnBlockSig);
(gdb) n
4239 InitPostgres(dbname, InvalidOid, username, InvalidOid, NULL);
(gdb) s
InitPostgres (in_dbname=0x1f04458 “prod1”, dboid=0, username=0x1f04438 “produ1”, useroid=0, out_dbname=0x0) at postinit.c:569
569 bool bootstrap = IsBootstrapProcessingMode();
(gdb) n
574 elog(DEBUG3, “InitPostgres”);
...
748 PerformAuthentication(MyProcPort);
(gdb) s
PerformAuthentication (port=0x1efb550) at postinit.c:191
191 ClientAuthInProgress = true; /
limit visibility of log messages /
(gdb) n
237 enable_timeout_after(STATEMENT_TIMEOUT, AuthenticationTimeout * 1000);
(gdb) n
242 ClientAuthentication(port); /
might not return, if failure */
(gdb) s
ClientAuthentication (port=0x27a0550) at auth.c:341
341 int status = STATUS_ERROR;
(gdb) n
342 char *logdetail = NULL;
(gdb) n
350 hba_getauthmethod(port); #获取本地pg_hba.conf配置
(gdb) n
352 CHECK_FOR_INTERRUPTS();
(gdb) n
359 if (port->hba->clientcert)
(gdb) n
382 switch (port->hba->auth_method) #用到了上面获取的pg_hba.conf中的身份认证信息
(gdb) n
549 status = CheckPWChallengeAuth(port, &logdetail);
(gdb) s
CheckPWChallengeAuth (port=0x217c550, logdetail=0x7ffd5594e880) at auth.c:768
768 Assert(port->hba->auth_method == uaSCRAM ||
(gdb) n
772 shadow_pass = get_role_password(port->user_name, logdetail);
(gdb) n
783 if (!shadow_pass)
(gdb) n
786 pwtype = get_password_type(shadow_pass);
(gdb) n
797 if (port->hba->auth_method == uaMD5 && pwtype == PASSWORD_TYPE_MD5)
(gdb) n
800 auth_result = CheckSCRAMAuth(port, shadow_pass, logdetail);
(gdb) s
CheckSCRAMAuth (port=0x217c550, shadow_pass=0x2208f90 “SCRAM-SHA-256$4096:gMP7klb55ApmrMVxK/z0Kg==$3oUyfKv72IwmLMF13s43eYjH8U9ZK9Gsrd4r4PkEDzY=:dCytRkDwmj1QU5fFBHcSKOyB3N25fWyVZY2hxTkWv6k=”, logdetail=0x7ffd5594e880) at auth.c:860
860 char *output = NULL;
(gdb) n
861 int outputlen = 0;
(gdb) n
875 if (PG_PROTOCOL_MAJOR(FrontendProtocol) < 3)
(gdb) n
886 sendAuthRequest(port, AUTH_REQ_SASL, SCRAM_SHA_256_NAME “\0”, #向客户端发送AUTH_REQ_SASL&SCRAM_SHA_256_NAME信息,这里就和第一阶段的调试信息对上了。
(gdb) n
900 scram_opaq = pg_be_scram_init(port->user_name, shadow_pass);

五. 总结

好了,回到我们最初遇到的问题,经过Debug对Postgres-XL原理的了解,我们知道基于集群之间的相互信任,连接默认是不携带密码的(当然你也可以配密码文件,不建议),所以集群节点之间只能配成trust(出于安全考虑,建议配具体节点IP,不要配IP范围),集群和外部客户端建议配成scram-sha-256,以确保安全性。
pg_hba.conf修改如下:

host    all             all             192.168.100.101/32      trust
host    all             all             192.168.100.102/32      trust
host    all             all             0.0.0.0/0               scram-sha-256
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐