说明:本文为openGauss技术文章征集作品参赛之一。活动持续进行中,欢迎大家联系微信号modb666投稿。

1、Page整体布局

opengauss基于pg 9.2开发,在存储数据结构上基本上沿用PG的数据结构,并稍微修改。整体布局上这里盗了一张PG的图,进行介绍。

0172cf15e42d0cdf32cd84974e5d23bf.png

这张page结构图中少了在page尾部的special space数据区,一般存放与索引相关的特定数据,table relation没有该区域。

整理布局分为4个部分:

1、PageHeaderData/HeapPageHeaderData 页面头/heap页面头数据 2、linp  array,行指针数组 3、free space 空闲空间 4、tuple数据,行数据 5、special space,在页面尾部,通过pagehead中的pd_special指向该区域, 一般存放与索引相关的特定数据,table relation page该区域为空,这张图没画出来。

下面逐个介绍

2、HeapPageHeaderData

./src/include/storage/bufpage.htypedef struct {    /* XXX LSN is member of *any* block, not only page-organized ones */    PageXLogRecPtr pd_lsn;    /* LSN: next byte after last byte of xlog                               * record for last change to this page */    uint16 pd_checksum;       /* checksum */    uint16 pd_flags;          /* flag bits, see below */    LocationIndex pd_lower;   /* offset to start of free space */    LocationIndex pd_upper;   /* offset to end of free space */    LocationIndex pd_special; /* offset to start of special space */    uint16 pd_pagesize_version;    ShortTransactionId pd_prune_xid;           /* oldest prunable XID, or zero if none */    TransactionId pd_xid_base;                 /* base value for transaction IDs on page */    TransactionId pd_multi_base;               /* base value for multixact IDs on page */    ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* beginning of line pointer array */} HeapPageHeaderData;disk page organizationspace management information generic to any page  pd_lsn    - identifies xlog record for last change to this page.  pd_checksum  - page checksum, if set.  pd_flags  - flag bits.  pd_lower  - offset to start of free space.  pd_upper  - offset to end of free space.  pd_special  - offset to start of special space.  pd_pagesize_version - size in bytes and page layout version number.  pd_prune_xid - oldest XID among potentially prunable tuples on page.
pg的PageHeaderData为20字节,openguass的HeapPageHeaderData在此基础上增加了两个8字节数据项,pd_xid_base,pd_multi_base。这也说明opengauss把事物ID从4字节扩到了8字节.
  • 空闲空间起始位置
    pd_lower指向freespace开始,也就是linp array结尾处
    pd_upper执行freespace结束,也就是tuple data开始处

3、linp,行指针

14937200a3d5d9401d570a846c8349bc.png

typedef struct ItemIdData {    unsigned lp_off : 15, /* offset to tuple (from start of page) */        lp_flags : 2,     /* state of item pointer, see below */        lp_len : 15;      /* byte length of tuple */} ItemIdData;lp_flagsdefine LP_UNUSED 0   /* unused (should always have lp_len=0) */define LP_NORMAL 1   /* used (should always have lp_len>0) */define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */define LP_DEAD 3     /* dead, may or may not have storage */
  • 每个行指针大小为4字节,也就是32bit.

  • 低15位为lp_off,tuple起始位置在页面内的偏移量

  • 中间2位为lp_flags,标志信息
    LP_UNUSED 0  未使用,lp_len=0
    LP_NORMAL 1 正常状态,lp_len>0
    LP_REDIRECT 2 发生HOT时,执行行新的位置,lp_len=0
    LP_DEAD 3   行为dead

  • lp_len tuple占用空间长度

  • 计算linp数组在heappageheader之后,结束于pd_lower
    计算linp指针的个数,(pd_lower-linp开始位置)/4

4、freespace 空闲空间

页面内的空闲空间

linp从pd_lower处往后增长

tuple data从尾部,也就是pd_upper往前增长

5、tuple data

tuple数据,也是页面最重要的部分,由 tuple header+tuple data组成

tuple header

  • Tuple头部是由23byte固定大小的前缀和可选的NullBitMap构成。c81caf803cfd79e40fdd0a32bd64da8c.png

0f8d781aa6b4c45880ea6a064885d0b3.png

  • t_xmin:代表插入此元组的事务xid;

  • t_xmax:代表更新或者删除此元组的事务xid,如果该元组插入后未进行更新或者删除,t_xmax=0;

  • t_cid:command id,代表在当前事务中,已经执行过多少条sql,例如执行第一条sql时cid=0,执行第二条sql时cid=1;

  • t_ctid:待研究,在 pg中为update后旧版本指向新tuple的指针

  • t_infomask标志位,记录各种信息,如是否存在null列,是否有变长列,是否有OID列等

  • 如果有允许为空的列,则存在null bitmap,可以通过t_infomask判断 t_infomask&0x0001, bitmap的大小与列个数有关。

  • t_hoff 记录 header的大小,包含null bitmap,padding

  • tuple header后会有padding,使tuple header的大小为8的整数倍

typedef struct HeapTupleHeaderData {    union {        HeapTupleFields t_heap;        DatumTupleFields t_datum;    } t_choice;    ItemPointerData t_ctid; /* current TID of this or newer tuple */    /* Fields below here must match MinimalTupleData! */    uint16 t_infomask2; /* number of attributes + various flags */    uint16 t_infomask; /* various flag bits, see below */    int8 t_hoff; /* sizeof header incl. bitmap, padding */    /* ^ - 23 bytes - ^ */    bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs -- VARIABLE LENGTH */    /* MORE DATA FOLLOWS AT END OF STRUCT */} HeapTupleHeaderData;typedef struct HeapTupleFields {    ShortTransactionId t_xmin; /* inserting xact ID */    ShortTransactionId t_xmax; /* deleting or locking xact ID */    union {        CommandId t_cid;           /* inserting or deleting command ID, or both */        ShortTransactionId t_xvac; /* old-style VACUUM FULL xact ID */    } t_field3;} HeapTupleFields;/* * information stored in t_infomask: */#define HEAP_HASNULL 0x0001          /* has null attribute(s) */#define HEAP_HASVARWIDTH 0x0002      /* has variable-width attribute(s) */#define HEAP_HASEXTERNAL 0x0004      /* has external stored attribute(s) */#define HEAP_HASOID 0x0008           /* has an object-id field */#define HEAP_COMPRESSED 0x0010       /* has compressed data */#define HEAP_COMBOCID 0x0020         /* t_cid is a combo cid */#define HEAP_XMAX_EXCL_LOCK 0x0040   /* xmax is exclusive locker */#define HEAP_XMAX_SHARED_LOCK 0x0080 /* xmax is shared locker *//* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */#define HEAP_IS_LOCKED (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */#define HEAP_XMIN_INVALID 0x0200   /* t_xmin invalid/aborted */#define HEAP_XMIN_FROZEN (HEAP_XMIN_INVALID | HEAP_XMIN_COMMITTED)#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */#define HEAP_XMAX_INVALID 0x0800   /* t_xmax invalid/aborted */#define HEAP_XMAX_IS_MULTI 0x1000  /* t_xmax is a MultiXactId */#define HEAP_UPDATED 0x2000        /* this is UPDATEd version of row */#define HEAP_MOVED_OFF                          \    0x4000 /* moved to another place by pre-9.0 \            * VACUUM FULL; kept for binary      \            * upgrade support */#define HEAP_MOVED_IN                             \    0x8000 /* moved from another place by pre-9.0 \            * VACUUM FULL; kept for binary        \            * upgrade support */#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)#define HEAP_XACT_MASK 0xFFE0 /* visibility-related bits *//* * information stored in t_infomask2: */#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes *//* bits 0x1800 are available */#define HEAP_HAS_REDIS_COLUMNS 0x2000 /* tuple has hidden columns added by redis */#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */#define HEAP_ONLY_TUPLE 0x8000  /* this is heap-only tuple */#define HEAP2_XACT_MASK 0xC000 /* visibility-related bits */
tuple  data

1、tuple data 有多个列(属性)组成

2、每个属性的长度分定长与变长

3、每个属性有不同的字节对齐

4、每个属性有不同的存储策略

5、tuple data中不存储null值

6、起始位置8字节对齐

变长类型(typlen = -1)存储策略
  • PLAN 避免压缩和行外存储

  • EXTENDED 允许压缩和行外存储,大多数可以TOAST的数据类型的默认策略

  • EXTERNAL 允许行外存储,但不许压缩

  • MAIN 允许压缩,尽量不使用行外存储

参考 pg_type.typstorage

  • p: Value must always be stored plain.

  • e: Value can be stored in a “secondary” relation (if relation has one, see pg_class.reltoastrelid).

  • m: Value can be stored compressed inline.

  • x: Value can be stored compressed inline or stored in “secondary” storage.

对齐(起始地址需要是某个长度的整数倍)

参考 pg_type.typalign

  • c = char alignment, i.e., no alignment needed.

  • s = short alignment (2 bytes on most machines).

  • i = int alignment (4 bytes on most machines).

  • d = double alignment (8 bytes on many machines, but by no means all).

typalign | typstorage | count ----------+------------+------- c        | p          |     6 i        | m          |     3 d        | x          |   269 s        | p          |     3 d        | p          |    17 i        | p          |    39 i        | x          |    70(7 rows)

通过pg_type可以查到各个数据类型数据存储长度,对齐大小,存储策略

也可以通过pg_attribute去查表上各列的相关属性

如下面t表示例

create table  t(id int,id2 bigint,c varchar(100),d date,ts timestamp);insert into t select i,i,'test'||i,now(),now() from generate_series(1,1000)i;test=# SELECT a.attname,test-#   pg_catalog.format_type(a.atttypid, a.atttypmod),attlen,test-#   (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid) for 128)test(#    FROM pg_catalog.pg_attrdef dtest(#    WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),test-#   a.attnotnull, a.attnum,test-#   (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type ttest(#    WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation,test-#   a.attidentity,test-#   NULL AS indexdef,test-#   NULL AS attfdwoptions,test-#   a.attstorage,attalign,test-#   CASE WHEN a.attstattarget=-1 THEN NULL ELSE a.attstattarget END AS attstattarget, pg_catalog.col_description(a.attrelid, a.attnum)test-# FROM pg_catalog.pg_attribute atest-# WHERE a.attrelid = 't'::regclass AND a.attnum > 0 AND NOT a.attisdroppedtest-# ORDER BY a.attnum;attname |         format_type         | attlen | substring | attnotnull | attnum | attstorage | attalign ---------+-----------------------------+--------+-----------+------------+--------+------------+----------id      | integer                     |      4 |           | f          |      1 | p          | i        id2     | bigint                      |      8 |           | f          |      2 | p          | d        c       | character varying(100)      |     -1 |           | f          |      3 | x          | i        d       | date                        |      4 |           | f          |      4 | p          | i        ts      | timestamp without time zone |      8 |           | f          |      5 | p          | d        (5 rows)test=# select * from heap_page_items(get_raw_page('t',0));lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                       t_data                                       -----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+------------------------------------------------------------------------------------  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | \x010000000000000001000000000000000d74657374310000e91d000000000000961c1f75bc590200  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | \x020000000000000002000000000000000d74657374320000e91d000000000000961c1f75bc590200  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | \x030000000000000003000000000000000d74657374330000e91d000000000000961c1f75bc590200  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | \x040000000000000004000000000000000d74657374340000e91d000000000000961c1f75bc590200  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | \x050000000000000005000000000000000d74657374350000e91d000000000000961c1f75bc590200                                                                                                                                 0100000000000000 0100000000000000 0d74657374310000 e91d000000000000 961c1f75bc590200test=# select * from heap_page_item_attrs(get_raw_page('t',0),'t'); lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                             t_attrs                                            -----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+-------------------------------------------------------------------------------------------------  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | {"\\x01000000","\\x0100000000000000","\\x0d7465737431","\\xe91d0000","\\x961c1f75bc590200"}  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | {"\\x02000000","\\x0200000000000000","\\x0d7465737432","\\xe91d0000","\\x961c1f75bc590200"}  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | {"\\x03000000","\\x0300000000000000","\\x0d7465737433","\\xe91d0000","\\x961c1f75bc590200"}  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | {"\\x04000000","\\x0400000000000000","\\x0d7465737434","\\xe91d0000","\\x961c1f75bc590200"}  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | {"\\x05000000","\\x0500000000000000","\\x0d7465737435","\\xe91d0000","\\x961c1f75bc590200"}
变长类型的存储

定长数据类型根据pg_attribute.attlen对齐后直接存储在tupledata中,变长数据类型由于长度不固定,存储时有些特殊。需要在数据中保存长度等信息。
pg_attribute.attlen==-1则为变长类型
1、第一个字节(va_header)=0x01,第二个字节va_tag=18(VARTAG_INDIRECT)
toast机制
1 字节va_header+1字节va_tag+4*4字节varatt_external(toast指针)

typedef struct varatt_external {    int32 va_rawsize;  /* Original data size (includes header) */    int32 va_extsize;  /* External saved size (doesn't) */    Oid va_valueid;    /* Unique ID of value within TOAST table */    Oid va_toastrelid; /* RelID of TOAST table containing it */} varatt_external;

2、short varlena,当长度<=126时
判断va_header!=0x01 and va_header&0x01==0x01

第一个字节前7位存储大小,最后一位为0x01
3、full 4-byte header varlena 
(前4字节>>2)&0x3FFFFFFF 数据长度

feeca52a4245bbb157394a24920320b3.png Gauss松鼠会 是汇集数据库爱好者和关注者的大本营, 大家共同学习、探索、分享数据库前沿知识和技术, 互助解决问题,共建数据库技术交流圈。
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐