What is a "bitmap index"?
Answer a question
I have a PostgreSQL query that is taking longer than I'd like. I'm looking at the output of EXPLAIN ANALYZE and it mentions a Bitmap Index Scan. I've been searching the 'net and reading for about 10 minutes, but I cannot figure out:
Is a Bitmap Index a manufactured thing—something that I could improve if I added a real index to some column somewhere—or is it a specific type of a real index?
Here's the single table that I'm querying:
bugbot4b=> \d bug_snapshots
             Table "public.bug_snapshots"
   Column   |            Type             | Modifiers
------------+-----------------------------+-----------
 fixin_id   | integer                     | not null
 created_on | timestamp without time zone | not null
 pain       | integer                     | not null
 status_id  | integer                     | not null
Indexes:
    "bug_snapshots_pkey" PRIMARY KEY, btree (fixin_id, created_on)
Foreign-key constraints:
    "bug_snapshots_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE SET NULL
    "bug_snapshots_status_id_fkey" FOREIGN KEY (status_id) REFERENCES statuses(id)
And here's the result of analyzing the query. Note that there are about 3k distinct fixin_id literally within the query (elided in the below), and the table has 900k rows. Counting just those rows within the particular time range yields 15,000 rows.
EXPLAIN ANALYZE SELECT "created_on", sum("pain") AS "sum_pain" FROM "bug_snapshots"
WHERE (("fixin_id" IN (11,12,33,…,5351))
   AND ("status_id" IN (2, 7, 5, 3))
   AND ("created_on" >= '2013-10-08 16:42:26.994994-0700')
   AND ("created_on" <= '2013-11-07 15:42:26.994994-0800')
   AND ("pain" < 999))
GROUP BY "created_on"
ORDER BY "created_on";
Sort  (cost=59559.33..59559.38 rows=20 width=12) (actual time=19.472..19.494 rows=30 loops=1)
 Sort Key: created_on
 Sort Method:  quicksort  Memory: 18kB
 ->  HashAggregate  (cost=59558.64..59558.89 rows=20 width=12) (actual time=19.401..19.428 rows=30 loops=1)
       ->  Bitmap Heap Scan on bug_snapshots  (cost=9622.42..59509.25 rows=9878 width=12) (actual time=6.849..13.420 rows=6196 loops=1)
             Recheck Cond: ((fixin_id = ANY ('{11,12,33,…,5351}'::integer[])) AND (created_on >= '2013-10-08 16:42:26.994994'::timestamp without time zone) AND (created_on <= '2013-11-07 15:42:26.994994'::timestamp without time zone))
             Filter: ((pain < 999) AND (status_id = ANY ('{2,7,5,3}'::integer[])))
             ->  Bitmap Index Scan on bug_snapshots_pkey  (cost=0.00..9619.95 rows=11172 width=0) (actual time=6.801..6.801 rows=6196 loops=1)
                   Index Cond: ((fixin_id = ANY ('{11,12,33,…,5351}'::integer[])) AND (created_on >= '2013-10-08 16:42:26.994994'::timestamp without time zone) AND (created_on <= '2013-11-07 15:42:26.994994'::timestamp without time zone))
Total runtime: 19.646 ms
(10 rows)
Is the result of ANALYZE telling me that I need to add an index to the fixin_id (and/or other fields) to improve the speed? Or that this is just "slow" because of its size?
Answers
"Bitmap index scan"
Postgres doesn't have "bitmap indexes" per se. A "bitmap index scan" is an index access method that's allowed for certain index types (including default btree indexes). It's especially useful to combine multiple index lookups. The manual:
An index access method can support "plain" index scans, "bitmap" index scans, or both.
You can disable bitmap-scanning (for debugging purposes only!) by setting:
SET enable_bitmapscan = FALSE;
Optimize query performance
With long lists, joining to a derived table is often faster than a lengthy IN expression. You can use VALUES or unnest() for that purpose. Or even a temporary table, possibly with indexes. See:
- Query table by indexes from integer array
SELECT created_on, sum(pain) AS sum_pain
FROM   unnest('{11,12,33,…,5351}'::int[]) AS f(fixin_id)
JOIN   bug_snapshots USING (fixin_id)
WHERE  status_id IN (2, 7, 5, 3)
AND    created_on >= '2013-10-08 16:42:26.994994-0700'::timestamptz
AND    created_on <= '2013-11-07 15:42:26.994994-0800'::timestamptz
AND    pain < 999
GROUP  BY created_on
ORDER  BY created_on;
A partial multicolumn index would probably help (a lot). That depends on details like data distribution, load, stable query conditions, etc. Most importantly, the selectivity of WHERE expressions: a partial index typically only makes sense if many or most rows are excluded. Something like:
CREATE INDEX bug_snapshots_part_idx ON bug_snapshots (fixin_id, created_on, pain)
WHERE  status_id IN (2, 7, 5, 3)
AND    pain < 999;
The sequence of columns in the index matters. That is also true for your primary key, btw, which implements another multicolumn index. See:
- Is a composite index also good for queries on the first field?
db<>fiddle here
  sqlfiddle
Performance testing in fiddles is hardly reliable. Run your own tests! Also there were many improvements to Postgres since 2013, when this answer was written!
timestamp [without time zone]
One more thing: bug_snapshots.created_on is type timestamp. Timestamps are interpreted according to your current time zone setting.
  But in the query you try to compare to literals with time zone (timestamptz). This would work with an explicit cast:
WHERE  created_on >= '2013-10-08 16:42:26.994994-0700'::timestamptz
Your literal would be cast do timestamptz and translated to your local time zone accordingly. However, since you don't provide the data type, Postgres casts your literal to the matching type timestamp (not timestamptz) ignoring the time zone offset. Most probably not your intention!
Consider this test:
SELECT min(created_on), max(created_on)
FROM   bug_snapshots
WHERE  created_on >= '2013-10-08 16:42:26.994994-0700'
AND    created_on <= '2013-11-07 15:42:26.994994-0800'
See:
- Ignoring time zones altogether in Rails and PostgreSQL
更多推荐
 
 
所有评论(0)