PostgreSQL GIN + B-Tree
在之前的帖子中,我展示了如何构建一个表,除了数组数据类型之外,它还充当索引,但由触发器维护并使用显式连接进行查询。 GIN 索引可能更可取,因为它无需声明触发器即可维护,而且即使我们查询表时查询计划器也可以使用它。但是,查询表更有效,因为我的查询除了过滤tag_ids或group_ids之外,还过滤了created_date时间戳。我能够将它添加到我的表自定义表中,但不能添加到 GIN 索引中:
postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
ERROR: data type timestamp with time zone has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
进入全屏模式 退出全屏模式
但是,如果我们仍想使用 GIN 索引,则可以将 GIN 索引列与 BTREE 索引列结合起来,这要归功于btree_gin扩展:
postgres=> create extension btree_gin;
CREATE EXTENSION
进入全屏模式 退出全屏模式
现在可以使用created_date timestamptz对group_ids[]和tag_ids[]进行索引:
postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
CREATE INDEX
Time: 67771.641 ms (01:07.772)
postgres=> create index posts_by_user_tag_ids_created on posts_by_user using gin (tag_ids, created_date);
CREATE INDEX
Time: 68495.108 ms (01:08.495)
进入全屏模式 退出全屏模式
首先,让我们看一下之前的执行计划。
之前的执行计划
我在上一篇文章中创建的表上做了这个,其中填充了 500 万行:
postgres=> \d posts_by_user
Table "public.posts_by_user"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+------------------------------
user_id | bigint | | not null |
post_id | bigint | | not null | generated always as identity
group_ids | bigint[] | | |
tag_ids | bigint[] | | |
content | text | | |
created_date | timestamp with time zone | | not null |
Indexes:
"posts_by_user_pkey" PRIMARY KEY, btree (user_id, created_date, post_id)
"posts_by_user_user_id_post_id_key" UNIQUE CONSTRAINT, btree (user_id, post_id)
postgres=> select count(*) from posts_by_user;
count
---------
5000000
(1 row)
Time: 417.648 ms
进入全屏模式 退出全屏模式
在创建 GIN 索引之前,这是我的查询:
postgres=> explain (analyze, buffers)
select *
from posts_by_user
where created_date > now() - interval '1 month'
and tag_ids @>'{1}'
order by created_date desc limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=275177.48..275177.73 rows=100 width=405) (actual time=188.100..188.121 rows=100 loops=1)
Buffers: shared hit=69605
-> Sort (cost=275177.48..275178.02 rows=217 width=405) (actual time=188.098..188.111 rows=100 loops=1)
Sort Key: created_date DESC
Sort Method: quicksort Memory: 124kB
Buffers: shared hit=69605
-> Bitmap Heap Scan on posts_by_user (cost=174740.49..275169.19 rows=217 width=405) (actual time=149.292..187.996 rows=193 loops=1)
Recheck Cond: (created_date > (now() - '1 mon'::interval))
Filter: (tag_ids @> '{1}'::bigint[])
Rows Removed by Filter: 37305
Heap Blocks: exact=35233
Buffers: shared hit=69605
-> Bitmap Index Scan on posts_by_user_pkey (cost=0.00..174740.44 rows=35513 width=0) (actual time=140.588..140.589 rows=39145 loops=1)
Index Cond: (created_date > (now() - '1 mon'::interval))
Buffers: shared hit=32824
Planning Time: 0.118 ms
Execution Time: 188.340 ms
(17 rows)
Time: 282.833 ms
进入全屏模式 退出全屏模式
这效率不高:从主键索引中读取 32824 个页面以过滤时间戳,但稍后过滤Filter: (tag_ids @> '{1}'::bigint[])的标签。这可以从Rows Removed by Filter: 38714被读取到稍后丢弃。
使用(tag_ids)的 GIN 索引,它更好,但仍然不是最优的:
postgres=> explain (analyze, buffers)
select *
from posts_by_user
where created_date > now() - interval '1 month'
and tag_ids @>'{1}'
order by created_date desc limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=89306.17..89306.42 rows=100 width=405) (actual time=9.542..9.562 rows=100 loops=1)
Buffers: shared hit=1530
-> Sort (cost=89306.17..89306.71 rows=217 width=405) (actual time=9.540..9.552 rows=100 loops=1)
Sort Key: created_date DESC
Sort Method: quicksort Memory: 124kB
Buffers: shared hit=1530
-> Bitmap Heap Scan on posts_by_user (cost=288.80..89297.88 rows=217 width=405) (actual time=2.009..9.478 rows=193 loops=1)
Recheck Cond: (tag_ids @> '{1}'::bigint[])
Filter: (created_date > (now() - '1 mon'::interval))
Rows Removed by Filter: 26807
Heap Blocks: exact=1521
Buffers: shared hit=1530
-> Bitmap Index Scan on posts_by_user_tag_ids (cost=0.00..288.75 rows=30500 width=0) (actual time=1.682..1.682 rows=27000 loops=1)
Index Cond: (tag_ids @> '{1}'::bigint[])
Buffers: shared hit=9
Planning Time: 0.129 ms
Execution Time: 9.594 ms
(17 rows)
Time: 103.207 ms
进入全屏模式 退出全屏模式
现在,GIN 索引用于访问来自tag_ids的特定标签,但稍后对时间戳Filter: (created_date > (now() - '1 mon'::interval))的过滤会丢弃Rows Removed by Filter: 26807行。
带btree_gin的执行计划
这是btree_gin创建using gin (tag_ids, created_date)的计划:
postgres=> explain (analyze, buffers)
select *
from posts_by_user
where created_date > now() - interval '1 month'
and tag_ids @>'{1}'
order by created_date desc limit 100;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=917.04..917.29 rows=100 width=405) (actual time=56.695..56.710 rows=100 loops=1)
Buffers: shared hit=454
-> Sort (cost=917.04..917.58 rows=217 width=405) (actual time=56.694..56.700 rows=100 loops=1)
Sort Key: created_date DESC
Sort Method: quicksort Memory: 124kB
Buffers: shared hit=454
-> Bitmap Heap Scan on posts_by_user (cost=54.23..908.74 rows=217 width=405) (actual time=56.456..56.624 rows=193 loops=1)
Recheck Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
Heap Blocks: exact=182
Buffers: shared hit=454
-> Bitmap Index Scan on posts_by_user_tag_ids_created (cost=0.00..54.17 rows=217 width=0) (actual time=56.427..56.428 rows=193 loops=1)
Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
Buffers: shared hit=272
Planning Time: 0.155 ms
Execution Time: 57.426 ms
(15 rows)
Time: 151.462 ms
进入全屏模式 退出全屏模式
这一次,所有谓词都是索引条件:Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))并且没有Rows Removed by Filter稍后要删除。
这是 GIN 索引的最佳用法,在 57 毫秒内读取大约 454 个缓冲区(在缓存中时 - 从磁盘读取的时间可能要长得多)。
但是,上一篇文章中提出的解决方案,如果您可以构建和查询附加表,则效率更高:
postgres=> explain (analyze,buffers)
select posts_by_user.*
from posts_by_user
join posts_by_tag
using(user_id, created_date, post_id)
where posts_by_tag.created_date
> now() - interval '1 month'
and tag_id =1
order by created_date desc limit 100
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.00..741.19 rows=59 width=405) (actual time=0.039..0.696 rows=100 loops=1)
Buffers: shared hit=419
-> Nested Loop (cost=1.00..741.19 rows=59 width=405) (actual time=0.038..0.685 rows=100 loops=1)
Buffers: shared hit=419
-> Index Only Scan Backward using posts_by_tag_pkey on posts_by_tag (cost=0.57..241.75 rows=59 width=24) (actual time=0.021..0.143
rows=100 loops=1)
Index Cond: ((tag_id = 1) AND (created_date > (now() - '1 mon'::interval)))
Heap Fetches: 100
Buffers: shared hit=19
-> Index Scan using posts_by_user_user_id_post_id_key on posts_by_user (cost=0.43..8.45 rows=1 width=405) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((user_id = posts_by_tag.user_id) AND (post_id = posts_by_tag.post_id))
Filter: (posts_by_tag.created_date = created_date)
Buffers: shared hit=400
Planning Time: 0.261 ms
Execution Time: 0.731 ms
(14 rows)
Time: 95.173 ms
进入全屏模式 退出全屏模式
即使在读取与 GIN 索引相同数量的块时,由于结构相似,简单的 Index Scan 和 Index Only Scan 比 GIN 在 PostgreSQL 上所需的 Bitmap Scans 更快。
我写这些帖子是为了表明在 PostgreSQL 中,你有很多可能性,不仅限于规范化的堆表和 btree 索引。正确的选择需要对访问模式有很好的理解,并通过执行计划进行验证。我将在下一个 PGDay 讨论现代数据建模:
裕嘉数据库
@yugabute
我们很高兴能参加 Nordic PGDay 2022,这是一场#PostgreSQL面对面会议,将于 2022 年 3 月 22 日在芬兰赫尔辛基举行。
不要错过 Yugabyte 的@FranckPachot演讲“现代 SQL 应用程序的数据建模:3NF?ARRAY?JSONB?”美国东部时间上午 11:15(格林威治标准时间 +2)。 🚀
hubs.ly/Q015X49d0
上午 11:00 - 2022 年 3 月 16 日
[
](https://twitter.com/intent/retweet?tweet_id u003d1504049841827160066)
这不仅限于 PostgreSQL,而且应该在所有 PostgreSQL 兼容的数据库中都可用。我已经在兼容 PostgreSQL 12.7 的 RDS Aurora 中运行了它。上一篇文章是在 YugabyteDB 上,它有 GIN 索引,没有 Bitmap Scan,与我构建的表非常相似,但完全自动化和透明。btree_gin扩展尚未出现,但很快 - 在问题 7850中跟踪 GIN 功能
更多推荐


所有评论(0)