我有一个简单的查询,如select * from xxx where col is not null limit 10
. 我不知道为什么 Postgres 更喜欢 seq 扫描,这比部分索引慢得多(我已经分析了表)。如何调试这样的问题?
该表有超过 400 万行。大约 350,000 行满意pid is not null
。
我认为成本估算可能有问题。seq 扫描的成本低于索引扫描。但是如何深入研究呢?
我有一个猜测,但不确定。非空行约占总行数的 10%。这意味着当 seq 扫描 100 行时,它可能会得到 10 行不为空。并且它认为 seq 扫描 100 行的成本低于索引扫描 10 行然后随机获取 10 行的成本。是吗?
> \d data_import
+--------------------+--------------------------+----------------------------------------------------------------------------+
| Column | Type | Modifiers |
|--------------------+--------------------------+----------------------------------------------------------------------------|
| id | integer | not null default nextval('data_import_id_seq'::regclass) |
| name | character varying(64) | |
| market_activity_id | integer | not null |
| hmsr_id | integer | not null default (-1) |
| site_id | integer | not null default (-1) |
| hmpl_id | integer | not null default (-1) |
| hmmd_id | integer | not null default (-1) |
| hmci_id | integer | not null default (-1) |
| hmkw_id | integer | not null default (-1) |
| creator_id | integer | |
| created_at | timestamp with time zone | |
| updated_at | timestamp with time zone | |
| bias | integer | |
| pid | character varying(128) | default NULL::character varying |
+--------------------+--------------------------+----------------------------------------------------------------------------+
Indexes:
"data_import_pkey" PRIMARY KEY, btree (id)
"unique_hmxx" UNIQUE, btree (site_id, hmsr_id, hmpl_id, hmmd_id, hmci_id, hmkw_id) WHERE pid IS NULL
"data_import_pid_idx" UNIQUE, btree (pid) WHERE pid IS NOT NULL
"data_import_created_at_idx" btree (created_at)
"data_import_hmsr_id" btree (hmsr_id)
"data_import_updated_at_idx" btree (updated_at)
> set enable_seqscan to false;
apollon> explain (analyse, verbose) select * from data_import where pid is not null limit 10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
| QUERY PLAN
|-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Limit (cost=0.42..5.68 rows=10 width=84) (actual time=0.059..0.142 rows=10 loops=1)
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
| -> Index Scan using data_import_pid_idx on public.data_import (cost=0.42..184158.08 rows=350584 width=84) (actual time
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
| Index Cond: (data_import.pid IS NOT NULL)
| Planning time: 0.126 ms
| Execution time: 0.177 ms
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN
Time: 0.054s
> set enable_seqscan to true;
> explain (analyse, verbose) select * from data_import where pid is not null limit 10
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit (cost=0.00..2.37 rows=10 width=84) (actual time=407.042..407.046 rows=10 loops=1) |
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
| -> Seq Scan on public.data_import (cost=0.00..83016.60 rows=350584 width=84) (actual time=407.041..407.045 rows=10 loops=1) |
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
| Filter: (data_import.pid IS NOT NULL) |
| Rows Removed by Filter: 3672502 |
| Planning time: 0.116 ms |
| Execution time: 407.078 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.426s
你的问题是
Rows Removed by Filter: 3672502
PostgreSQL 知道值的分布以及它们如何与物理表布局相关联,但它不知道表开头的行都为 NULL pid
。
如果 NULL 分布均匀,顺序扫描将很快找到 10 个命中并停止,但实际上,它必须读取 3672512 行才能找到 10 个匹配的行。
如果您在 之前添加ORDER BY pid
(即使您不需要它)LIMIT
,优化器将做正确的事情。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句