可以肯定地说,这是一个奇怪的问题,但是对于这种行为的解释我有些困惑:
背景:(不需要知道)
因此,开始时,我正在编写一个快速查询,并粘贴的列表,UNIQUERIDENTIFIER
并希望它们在WHERE X IN (...)
子句中统一。过去,我UNIQUERIDENTIFIER
在列表的顶部使用了一个空(全零),因此我可以粘贴以下形式的统一集UNIQUERIDENTIFIER
:,'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'
。这次,为避免碰到零,我插入了一个NEWID()
想法,认为几乎不可能发生碰撞的可能性,这令我惊讶地产生了成千上万的额外结果,例如桌子的50%以上。
开始提问:(您确实需要了解的部分)
该查询:
-- SETUP: (i boiled this down to the bare minimum)
-- just creating a table with 500 PK UNIQUERIDENTIFIERs
IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable;
CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY);
INSERT INTO #wtfTable
SELECT TOP(500) NEWID()
FROM master.sys.all_objects o1 (NOLOCK)
CROSS JOIN master.sys.all_objects o2 (NOLOCK);
-- ACTUAL QUERY:
SELECT *
FROM #wtfTable
WHERE [WtfId] IN ('00000000-0000-0000-0000-000000000000', NEWID());
...应该从统计上产生bupkis。但是,如果您运行十次左右,有时您会得到大量选择。例如,在上一次运行中,我收到465/500行,这意味着返回了93%以上的行。
虽然我知道NEWID()
将按行重新计算,但在地狱中并没有统计到如此之高的机会。我在这里写的所有内容都是产生细微差别所必需的SELECT
,删除任何东西都可以防止它的发生。顺便说一句,您可以将替换IN
为WHERE WtfId = '...' OR WtfId = NEWID()
,仍然会得到相同的结果。我正在使用最新的SQL SERVER 2014 Standard修补程序,没有我知道的任何奇特设置被激活。
那么外面有人知道这是怎么回事吗?提前致谢。
编辑:
这'00000000-0000-0000-0000-000000000000'
是一个红色鲱鱼,这是一个适用于整数的版本:(有趣的是,我需要将表的大小用整数提高到1000,以产生有问题的查询计划...)
IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable;
CREATE TABLE #wtfTable (WtfId INT PRIMARY KEY);
INSERT INTO #wtfTable
SELECT DISTINCT TOP(1000) CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT)
FROM sys.tables o1 (NOLOCK)
CROSS JOIN sys.tables o2 (NOLOCK);
SELECT *
FROM #wtfTable
WHERE [WtfId] IN (0, CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT));
或者您可以只替换文字UNIQUEIDENTIFIER
并执行以下操作:
DECLARE @someId UNIQUEIDENTIFIER = NEWID();
SELECT *
FROM #wtfTable
WHERE [WtfId] IN (@someId, NEWID());
两者都产生相同的结果...问题是为什么会这样?
让我们看一下执行计划。
在此查询的特定运行中,Seek
返回了51行,而不是估计的1行。
以下实际查询生成的形状相同的计划,但是更容易分析,因为我们有两个变量@ID1
和@ID2
,您可以在计划中对其进行跟踪。
CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY);
INSERT INTO #wtfTable
SELECT TOP(500) NEWID()
FROM master.sys.all_objects o1 (NOLOCK)
CROSS JOIN master.sys.all_objects o2 (NOLOCK);
DECLARE @ID1 UNIQUEIDENTIFIER;
DECLARE @ID2 UNIQUEIDENTIFIER;
SELECT TOP(1) @ID1 = WtfId
FROM #wtfTable
ORDER BY WtfId;
SELECT TOP(1) @ID2 = WtfId
FROM #wtfTable
ORDER BY WtfId DESC;
-- ACTUAL QUERY:
SELECT *
FROM #wtfTable
WHERE WtfId IN (@ID1, @ID2);
DROP TABLE #wtfTable;
如果仔细检查该计划中的运算符,您会看到IN
查询的一部分被转换为具有两行三列的表。该Concatenation
运营商将返回该表。该帮助器表中的每一行都定义了索引范围。
ExpFrom ExpTo ExpFlags
@ID1 @ID1 62
@ID2 @ID2 62
Internal ExpFlags
specify what kind of range seek is needed (<
, <=
, >
, >=
). If you add more variables to IN
clause you'll see them in the plan concatenated to this helper table.
Sort
and Merge Interval
operators make sure that any possible overlapping ranges are merged. See detailed post about Merge Interval
operator by Fabiano Amorim which examines the plans with this shape. Here is another good post about this plan shape by Paul White.
In the end the helper table with two rows is joined with the main table and for each row in the helper table there is a range seek in the clustered index from ExpFrom
to ExpTo
, which is shown in the Index Seek
operator. The Seek
operator shows <
and >
, but it is misleading. The actual comparison is defined internally by the Flags
value.
If you had some different set of ranges, for example:
WHERE
([WtfId] >= @ID1 AND [WtfId] < @ID2)
OR [WtfId] = @ID3
, you would still see the same shape of the plan with the same seek predicate, but different Flags
values.
So, there are two seeks:
from @ID1 to @ID1, which returns one row
from @ID2 to @ID2, which returns one row
In the query with variables internal expressions lead to getting values from the variables when needed. The value of the variable doesn't change during the query execution and everything behaves correctly as expected.
NEWID()
affects itWhen we use NEWID
as in your example:
SELECT *
FROM #wtfTable
WHERE WtfId IN ('00000000-0000-0000-0000-000000000000', NEWID());
the plan and all internal processing is the same as for variables.
The difference is that this internal table effectively becomes:
ExpFrom ExpTo ExpFlags
0...0 0...0 62
NEWID() NEWID() 62
NEWID()
is called two times. Naturally, each call produces a different value, which by chance results in a range that covers some existing values in the table.
There are two range scans of the clustered index with ranges
from `0...0` to `0...0`
from `some_id_1` to `some_id_2`
Now it is easy to see how such query can return some rows, even though the chances of NEWID
collision is very small.
Apparently, optimiser thinks that it can call NEWID
twice instead of remembering the first generated random value and using it further in the query. There have been other cases when optimiser called NEWID
more times than expected producing similar seemingly impossible results.
For example:
Is it legal for SQL Server to fill PERSISTED columns with data that does not match the definition?
Inconsistent results with NEWID() and PERSISTED computed column
Optimiser should know that NEWID()
is non-deterministic. Overall, it feels like a bug.
I don't know anything about SQL Server internals, but my wild guess looks like this: There are runtime constant functions like RAND()
. NEWID()
was put into this category by mistake. Then somebody noticed that people do not expect it to return the same ID in the same fashion as RAND()
returns the same random number for each invocation. And they patched it by actually regenerating new ID each time NEWID()
appears in expressions. But overall rules for optimiser remained the same as for RAND()
, so higher level optimiser thinks that all invocations of NEWID()
return the same value and freely rearranges expressions with NEWID()
which leads to unexpected results.
There is another question about a similar strange behaviour of NEWID()
:
答案是说,有一个Connect错误报告,并且将其关闭为“无法修复”。Microsoft的评论实质上说这种行为是设计使然。
优化器不保证标量函数的执行时间或执行次数。这是一个长久以来的宗旨。这是基本的“余地”,它使优化程序有足够的自由来获得查询计划执行的重大改进。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句