PostgreSQL GIN + B-Tree

Postgredaxiang

77人浏览 · 2022-08-15 06:16:54

Postgredaxiang · 2022-08-15 06:16:54 发布

在之前的帖子中,我展示了如何构建一个表,除了数组数据类型之外,它还充当索引,但由触发器维护并使用显式连接进行查询。 GIN 索引可能更可取,因为它无需声明触发器即可维护,而且即使我们查询表时查询计划器也可以使用它。但是,查询表更有效,因为我的查询除了过滤tag_ids或group_ids之外,还过滤了created_date时间戳。我能够将它添加到我的表自定义表中,但不能添加到 GIN 索引中:

postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
ERROR:  data type timestamp with time zone has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

进入全屏模式退出全屏模式

但是,如果我们仍想使用 GIN 索引,则可以将 GIN 索引列与 BTREE 索引列结合起来,这要归功于btree_gin扩展:

postgres=> create extension btree_gin;
CREATE EXTENSION

进入全屏模式退出全屏模式

现在可以使用created_date timestamptz对group_ids[]和tag_ids[]进行索引:

postgres=> create index posts_by_user_group_ids_created on posts_by_user using gin (group_ids, created_date);
CREATE INDEX
Time: 67771.641 ms (01:07.772)
postgres=> create index posts_by_user_tag_ids_created   on posts_by_user using gin (tag_ids, created_date);
CREATE INDEX
Time: 68495.108 ms (01:08.495)

进入全屏模式退出全屏模式

首先,让我们看一下之前的执行计划。

之前的执行计划

我在上一篇文章中创建的表上做了这个,其中填充了 500 万行:

postgres=> \d posts_by_user
                                 Table "public.posts_by_user"
    Column    |           Type           | Collation | Nullable |           Default
--------------+--------------------------+-----------+----------+------------------------------
 user_id      | bigint                   |           | not null |
 post_id      | bigint                   |           | not null | generated always as identity
 group_ids    | bigint[]                 |           |          |
 tag_ids      | bigint[]                 |           |          |
 content      | text                     |           |          |
 created_date | timestamp with time zone |           | not null |
Indexes:
    "posts_by_user_pkey" PRIMARY KEY, btree (user_id, created_date, post_id)
    "posts_by_user_user_id_post_id_key" UNIQUE CONSTRAINT, btree (user_id, post_id)

postgres=> select count(*) from posts_by_user;
  count
---------
 5000000
(1 row)

Time: 417.648 ms

进入全屏模式退出全屏模式

在创建 GIN 索引之前,这是我的查询:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=275177.48..275177.73 rows=100 width=405) (actual time=188.100..188.121 rows=100 loops=1)
   Buffers: shared hit=69605
   ->  Sort  (cost=275177.48..275178.02 rows=217 width=405) (actual time=188.098..188.111 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=69605
         ->  Bitmap Heap Scan on posts_by_user  (cost=174740.49..275169.19 rows=217 width=405) (actual time=149.292..187.996 rows=193 loops=1)
               Recheck Cond: (created_date > (now() - '1 mon'::interval))
               Filter: (tag_ids @> '{1}'::bigint[])
               Rows Removed by Filter: 37305
               Heap Blocks: exact=35233
               Buffers: shared hit=69605
               ->  Bitmap Index Scan on posts_by_user_pkey  (cost=0.00..174740.44 rows=35513 width=0) (actual time=140.588..140.589 rows=39145 loops=1)
                     Index Cond: (created_date > (now() - '1 mon'::interval))
                     Buffers: shared hit=32824
 Planning Time: 0.118 ms
 Execution Time: 188.340 ms
(17 rows)

Time: 282.833 ms

进入全屏模式退出全屏模式

这效率不高:从主键索引中读取 32824 个页面以过滤时间戳,但稍后过滤Filter: (tag_ids @> '{1}'::bigint[])的标签。这可以从Rows Removed by Filter: 38714被读取到稍后丢弃。

使用(tag_ids)的 GIN 索引,它更好,但仍然不是最优的:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=89306.17..89306.42 rows=100 width=405) (actual time=9.542..9.562 rows=100 loops=1)
   Buffers: shared hit=1530
   ->  Sort  (cost=89306.17..89306.71 rows=217 width=405) (actual time=9.540..9.552 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=1530
         ->  Bitmap Heap Scan on posts_by_user  (cost=288.80..89297.88 rows=217 width=405) (actual time=2.009..9.478 rows=193 loops=1)
               Recheck Cond: (tag_ids @> '{1}'::bigint[])
               Filter: (created_date > (now() - '1 mon'::interval))
               Rows Removed by Filter: 26807
               Heap Blocks: exact=1521
               Buffers: shared hit=1530
               ->  Bitmap Index Scan on posts_by_user_tag_ids  (cost=0.00..288.75 rows=30500 width=0) (actual time=1.682..1.682 rows=27000 loops=1)
                     Index Cond: (tag_ids @> '{1}'::bigint[])
                     Buffers: shared hit=9
 Planning Time: 0.129 ms
 Execution Time: 9.594 ms
(17 rows)

Time: 103.207 ms

进入全屏模式退出全屏模式

现在,GIN 索引用于访问来自tag_ids的特定标签,但稍后对时间戳Filter: (created_date > (now() - '1 mon'::interval))的过滤会丢弃Rows Removed by Filter: 26807行。

带btree_gin的执行计划

这是btree_gin创建using gin (tag_ids, created_date)的计划:

postgres=> explain (analyze, buffers)
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=917.04..917.29 rows=100 width=405) (actual time=56.695..56.710 rows=100 loops=1)
   Buffers: shared hit=454
   ->  Sort  (cost=917.04..917.58 rows=217 width=405) (actual time=56.694..56.700 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: quicksort  Memory: 124kB
         Buffers: shared hit=454
         ->  Bitmap Heap Scan on posts_by_user  (cost=54.23..908.74 rows=217 width=405) (actual time=56.456..56.624 rows=193 loops=1)
               Recheck Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
               Heap Blocks: exact=182
               Buffers: shared hit=454
               ->  Bitmap Index Scan on posts_by_user_tag_ids_created  (cost=0.00..54.17 rows=217 width=0) (actual time=56.427..56.428 rows=193 loops=1)
                     Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))
                     Buffers: shared hit=272
 Planning Time: 0.155 ms
 Execution Time: 57.426 ms
(15 rows)

Time: 151.462 ms

进入全屏模式退出全屏模式

这一次,所有谓词都是索引条件:Index Cond: ((tag_ids @> '{1}'::bigint[]) AND (created_date > (now() - '1 mon'::interval)))并且没有Rows Removed by Filter稍后要删除。

这是 GIN 索引的最佳用法,在 57 毫秒内读取大约 454 个缓冲区(在缓存中时 - 从磁盘读取的时间可能要长得多)。

但是,上一篇文章中提出的解决方案,如果您可以构建和查询附加表,则效率更高:

postgres=> explain (analyze,buffers)
           select posts_by_user.*
            from posts_by_user
            join posts_by_tag
            using(user_id, created_date, post_id)
            where posts_by_tag.created_date
                   > now() - interval '1 month'
                  and tag_id =1
            order by created_date desc limit 100
 ;
                                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.00..741.19 rows=59 width=405) (actual time=0.039..0.696 rows=100 loops=1)
   Buffers: shared hit=419
   ->  Nested Loop  (cost=1.00..741.19 rows=59 width=405) (actual time=0.038..0.685 rows=100 loops=1)
         Buffers: shared hit=419
         ->  Index Only Scan Backward using posts_by_tag_pkey on posts_by_tag  (cost=0.57..241.75 rows=59 width=24) (actual time=0.021..0.143
rows=100 loops=1)
               Index Cond: ((tag_id = 1) AND (created_date > (now() - '1 mon'::interval)))
               Heap Fetches: 100
               Buffers: shared hit=19
         ->  Index Scan using posts_by_user_user_id_post_id_key on posts_by_user  (cost=0.43..8.45 rows=1 width=405) (actual time=0.005..0.005 rows=1 loops=100)
               Index Cond: ((user_id = posts_by_tag.user_id) AND (post_id = posts_by_tag.post_id))
               Filter: (posts_by_tag.created_date = created_date)
               Buffers: shared hit=400
 Planning Time: 0.261 ms
 Execution Time: 0.731 ms
(14 rows)

Time: 95.173 ms

进入全屏模式退出全屏模式

即使在读取与 GIN 索引相同数量的块时,由于结构相似,简单的 Index Scan 和 Index Only Scan 比 GIN 在 PostgreSQL 上所需的 Bitmap Scans 更快。

我写这些帖子是为了表明在 PostgreSQL 中,你有很多可能性,不仅限于规范化的堆表和 btree 索引。正确的选择需要对访问模式有很好的理解,并通过执行计划进行验证。我将在下一个 PGDay 讨论现代数据建模:

裕嘉数据库

@yugabute

我们很高兴能参加 Nordic PGDay 2022,这是一场#PostgreSQL面对面会议,将于 2022 年 3 月 22 日在芬兰赫尔辛基举行。

不要错过 Yugabyte 的@FranckPachot演讲“现代 SQL 应用程序的数据建模:3NF?ARRAY?JSONB?”美国东部时间上午 11:15(格林威治标准时间 +2)。 🚀

hubs.ly/Q015X49d0

上午 11:00 - 2022 年 3 月 16 日

[](https://twitter.com/intent/retweet?tweet_id u003d1504049841827160066)

这不仅限于 PostgreSQL,而且应该在所有 PostgreSQL 兼容的数据库中都可用。我已经在兼容 PostgreSQL 12.7 的 RDS Aurora 中运行了它。上一篇文章是在 YugabyteDB 上,它有 GIN 索引,没有 Bitmap Scan,与我构建的表非常相似,但完全自动化和透明。btree_gin扩展尚未出现,但很快 - 在问题 7850中跟踪 GIN 功能

PostgreSQL

PostgreSQL社区为您提供最前沿的新闻资讯和知识内容

更多推荐

PostgreSQL 计数查询效率,物化视图 [重复]

问题:PostgreSQL 计数查询效率,物化视图 [重复] 可能重复: PostgreSQL 计数查询优化使用 PostgreSQL 9.2,我们试图弄清楚是否有一种方法可以跟踪查询的结果数量,并以有效的方式返回该数字。这个查询应该每秒执行几次(可能几十到几百甚至几千次)。我们现在的查询看起来像这样,但我们想知道这是否效率低下: -- Get # of rows that do not hav

PostgreSQL

多对多中的唯一性

问题:多对多中的唯一性我无法弄清楚谷歌的哪些术语,所以帮助标记这个问题或只是以相关问题的方式向我指出会有所帮助。我相信我有一个典型的多对多关系: CREATE TABLE groups ( id integer PRIMARY KEY); CREATE TABLE elements ( id integer PRIMARY KEY); CREATE TABLE groups_elements

PostgreSQL

Django 与 postgresql - manage.py syncdb 返回错误

问题:Django 与 postgresql - manage.py syncdb 返回错误我从 Django 开始。我设置了一些使用 SQLite 工作的站点,但是在将 DB 引擎更改为 postgresql manage.py syncdb 后返回错误。我已经用谷歌搜索了 2 天,但对我仍然没有任何作用。Postgres 用户 'joe' 具有超级用户权限和本地 'joe ' 数据库存在。