在之前的帖子中,我展示了如何构建一个表,除了数组数据类型之外,它还充当索引,但由触发器维护并使用显式连接进行查询。 GIN 索引可能更可取,因为它无需声明触发器即可维护,而且即使我们查询表时查询计划器也可以使用它。但是,查询表更有效,因为我的查询除了过滤tag_idsgroup_ids之外,还过滤了created_date时间戳。我能够将它添加到我的表自定义表中,但不能添加到 GIN 索引中:

postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
ERROR:  data type timestamp with time zone has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

进入全屏模式 退出全屏模式

但是,如果我们仍想使用 GIN 索引,则可以将 GIN 索引列与 BTREE 索引列结合起来,这要归功于btree_gin扩展:

postgres=> create extension btree_gin;
CREATE EXTENSION

进入全屏模式 退出全屏模式

现在可以使用created_date timestamptzgroup_ids[]tag_ids[]进行索引:

postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
CREATE INDEX
Time: 67771.641 ms (01:07.772)
postgres=> create index posts_by_user_tag_ids_created   on posts_by_user using gin (tag_ids, created_date);
CREATE INDEX
Time: 68495.108 ms (01:08.495)

进入全屏模式 退出全屏模式

首先,让我们看一下之前的执行计划。

之前的执行计划

我在上一篇文章中创建的表上做了这个,其中填充了 500 万行:

postgres=> \d posts_by_user
                                 Table "public.posts_by_user"
    Column    |           Type           | Collation | Nullable |           Default
--------------+--------------------------+-----------+----------+------------------------------
 user_id      | bigint                   |           | not null |
 post_id      | bigint                   |           | not null | generated always as identity
 group_ids    | bigint[]                 |           |          |
 tag_ids      | bigint[]                 |           |          |
 content      | text                     |           |          |
 created_date | timestamp with time zone |           | not null |
Indexes:
    "posts_by_user_pkey" PRIMARY KEY, btree (user_id, created_date, post_id)
    "posts_by_user_user_id_post_id_key" UNIQUE CONSTRAINT, btree (user_id, post_id)

postgres=> select count(*) from posts_by_user;
  count
---------
 5000000
(1 row)

Time: 417.648 ms

进入全屏模式 退出全屏模式

在创建 GIN 索引之前,这是我的查询:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=275177.48..275177.73 rows=100 width=405) (actual time=188.100..188.121 rows=100 loops=1)
   Buffers: shared hit=69605
   ->  Sort  (cost=275177.48..275178.02 rows=217 width=405) (actual time=188.098..188.111 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=69605
         ->  Bitmap Heap Scan on posts_by_user  (cost=174740.49..275169.19 rows=217 width=405) (actual time=149.292..187.996 rows=193 loops=1)
               Recheck Cond: (created_date > (now() - '1 mon'::interval))
               Filter: (tag_ids @> '{1}'::bigint[])
               Rows Removed by Filter: 37305
               Heap Blocks: exact=35233
               Buffers: shared hit=69605
               ->  Bitmap Index Scan on posts_by_user_pkey  (cost=0.00..174740.44 rows=35513 width=0) (actual time=140.588..140.589 rows=39145 loops=1)
                     Index Cond: (created_date > (now() - '1 mon'::interval))
                     Buffers: shared hit=32824
 Planning Time: 0.118 ms
 Execution Time: 188.340 ms
(17 rows)

Time: 282.833 ms

进入全屏模式 退出全屏模式

这效率不高:从主键索引中读取 32824 个页面以过滤时间戳,但稍后过滤Filter: (tag_ids @> '{1}'::bigint[])的标签。这可以从Rows Removed by Filter: 38714被读取到稍后丢弃。

使用(tag_ids)的 GIN 索引,它更好,但仍然不是最优的:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=89306.17..89306.42 rows=100 width=405) (actual time=9.542..9.562 rows=100 loops=1)
   Buffers: shared hit=1530
   ->  Sort  (cost=89306.17..89306.71 rows=217 width=405) (actual time=9.540..9.552 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=1530
         ->  Bitmap Heap Scan on posts_by_user  (cost=288.80..89297.88 rows=217 width=405) (actual time=2.009..9.478 rows=193 loops=1)
               Recheck Cond: (tag_ids @> '{1}'::bigint[])
               Filter: (created_date > (now() - '1 mon'::interval))
               Rows Removed by Filter: 26807
               Heap Blocks: exact=1521
               Buffers: shared hit=1530
               ->  Bitmap Index Scan on posts_by_user_tag_ids  (cost=0.00..288.75 rows=30500 width=0) (actual time=1.682..1.682 rows=27000 loops=1)
                     Index Cond: (tag_ids @> '{1}'::bigint[])
                     Buffers: shared hit=9
 Planning Time: 0.129 ms
 Execution Time: 9.594 ms
(17 rows)

Time: 103.207 ms

进入全屏模式 退出全屏模式

现在,GIN 索引用于访问来自tag_ids的特定标签,但稍后对时间戳Filter: (created_date > (now() - '1 mon'::interval))的过滤会丢弃Rows Removed by Filter: 26807行。

带btree_gin的执行计划

这是btree_gin创建using gin (tag_ids, created_date)的计划:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=917.04..917.29 rows=100 width=405) (actual time=56.695..56.710 rows=100 loops=1)
   Buffers: shared hit=454
   ->  Sort  (cost=917.04..917.58 rows=217 width=405) (actual time=56.694..56.700 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=454
         ->  Bitmap Heap Scan on posts_by_user  (cost=54.23..908.74 rows=217 width=405) (actual time=56.456..56.624 rows=193 loops=1)
               Recheck Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
               Heap Blocks: exact=182
               Buffers: shared hit=454
               ->  Bitmap Index Scan on posts_by_user_tag_ids_created  (cost=0.00..54.17 rows=217 width=0) (actual time=56.427..56.428 rows=193 loops=1)
                     Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
                     Buffers: shared hit=272
 Planning Time: 0.155 ms
 Execution Time: 57.426 ms
(15 rows)

Time: 151.462 ms

进入全屏模式 退出全屏模式

这一次,所有谓词都是索引条件:Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))并且没有Rows Removed by Filter稍后要删除。

这是 GIN 索引的最佳用法,在 57 毫秒内读取大约 454 个缓冲区(在缓存中时 - 从磁盘读取的时间可能要长得多)。

但是,上一篇文章中提出的解决方案,如果您可以构建和查询附加表,则效率更高:

postgres=> explain (analyze,buffers)
           select posts_by_user.*
            from posts_by_user
            join posts_by_tag
            using(user_id, created_date, post_id)
            where posts_by_tag.created_date
                   > now() - interval '1 month'
                  and tag_id =1
            order by created_date desc limit 100
 ;
                                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.00..741.19 rows=59 width=405) (actual time=0.039..0.696 rows=100 loops=1)
   Buffers: shared hit=419
   ->  Nested Loop  (cost=1.00..741.19 rows=59 width=405) (actual time=0.038..0.685 rows=100 loops=1)
         Buffers: shared hit=419
         ->  Index Only Scan Backward using posts_by_tag_pkey on posts_by_tag  (cost=0.57..241.75 rows=59 width=24) (actual time=0.021..0.143
rows=100 loops=1)
               Index Cond: ((tag_id = 1) AND (created_date > (now() - '1 mon'::interval)))
               Heap Fetches: 100
               Buffers: shared hit=19
         ->  Index Scan using posts_by_user_user_id_post_id_key on posts_by_user  (cost=0.43..8.45 rows=1 width=405) (actual time=0.005..0.005 rows=1 loops=100)
               Index Cond: ((user_id = posts_by_tag.user_id) AND (post_id = posts_by_tag.post_id))
               Filter: (posts_by_tag.created_date = created_date)
               Buffers: shared hit=400
 Planning Time: 0.261 ms
 Execution Time: 0.731 ms
(14 rows)

Time: 95.173 ms

进入全屏模式 退出全屏模式

即使在读取与 GIN 索引相同数量的块时,由于结构相似,简单的 Index Scan 和 Index Only Scan 比 GIN 在 PostgreSQL 上所需的 Bitmap Scans 更快。

我写这些帖子是为了表明在 PostgreSQL 中,你有很多可能性,不仅限于规范化的堆表和 btree 索引。正确的选择需要对访问模式有很好的理解,并通过执行计划进行验证。我将在下一个 PGDay 讨论现代数据建模:

未知的推文媒体内容

YugabyteDB 个人资料图片

裕嘉数据库

@yugabute

推特徽标

我们很高兴能参加 Nordic PGDay 2022,这是一场#PostgreSQL面对面会议,将于 2022 年 3 月 22 日在芬兰赫尔辛基举行。

不要错过 Yugabyte 的@FranckPachot演讲“现代 SQL 应用程序的数据建模:3NF?ARRAY?JSONB?”美国东部时间上午 11:15(格林威治标准时间 +2)。 🚀

hubs.ly/Q015X49d0

上午 11:00 - 2022 年 3 月 16 日

Twitter 回复操作[Twitter 转发操作](https://twitter.com/intent/retweet?tweet_id u003d1504049841827160066)推特赞动作

这不仅限于 PostgreSQL,而且应该在所有 PostgreSQL 兼容的数据库中都可用。我已经在兼容 PostgreSQL 12.7 的 RDS Aurora 中运行了它。上一篇文章是在 YugabyteDB 上,它有 GIN 索引,没有 Bitmap Scan,与我构建的表非常相似,但完全自动化和透明。btree_gin扩展尚未出现,但很快 - 在问题 7850中跟踪 GIN 功能

Logo

PostgreSQL社区为您提供最前沿的新闻资讯和知识内容

更多推荐