Normally I write in German, but I hope that maybe some Microsoft Engineer might read it and put the problem on their schedule. So, I apologise to my German readers for writing in poor English… 😉
When the SQL Server optimizer internally transforms a SQL Statement to an other, it normally results in a better execution plan. When You write for example a subquery the optimizer tries to make a join instead. This is fine as long a the result set is the same. But now we discovered a situation when this is not true: If you use a non-deterministic function in a query, the optimizer use the same query plan as if the function were deterministic. Let us examine this simple example:
We have a table with 5 rows: the key consists of the tupel (ID1,ID2). We add a new column to add a GUID later.
ID1 | ID2 | GUID |
---|---|---|
1 | 1 | NULL |
1 | 2 | NULL |
2 | 1 | NULL |
2 | 2 | NULL |
3 | 1 | NULL |
For each different value of ID1 we want to add a GUID with the non-deterministic function NewID():
SELECT
ID1
, NEWID() as [NewId]
FROM Table1
GROUP BY ID1
This results in something like:
ID1 | NewId |
---|---|
1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
Now, we want to use the above query to assign the new values to the column GUID. To make the execution plan easier to understand, I used a SELECT, not an UPDATE. But the result is the same:
SELECT
t.ID1
, sub.NewId
FROM (SELECT
ID1
, NEWID() as [NewId]
FROM Table1
GROUP BY ID1) as sub
JOIN Table1 as t
ON t.ID1=sub.ID1
It should result in something like this:
ID1 | NewId |
---|---|
1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
But this is not the case. The result is:
ID1 | NewId |
---|---|
1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
1 | 4EC62D6E-8ABC-4563-8A81-2DB8E655D3CA |
2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
2 | 73B7386E-688D-48D2-A585-C66C6911EFC3 |
3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
Do you see the difference? The reason for this is the wrong execution plan:
|--Compute Scalar(DEFINE:([Expr1004]=newid()))
|–Merge Join(Inner Join, MERGE:([tempdb].[dbo].[Table1].[ID1])=([t].[ID1]), RESIDUAL:([tempdb].[dbo].[Table1].[ID1]=[tempdb].[dbo].[Table1].[ID1] as [t].[ID1]))
|–Sort(DISTINCT ORDER BY:([tempdb].[dbo].[Table1].[ID1] ASC))
| |–Table Scan(OBJECT:([tempdb].[dbo].[Table1]))
|–Sort(ORDER BY:([t].[ID1] ASC))
|–Table Scan(OBJECT:([tempdb].[dbo].[Table1] AS [t]))
The optimizer choose to do the join first and to execute the function for each row. This is not correct.
So the general rule I recommend: Do not use non-deterministic functions in complex SQL queries. The result depends on the chosen query plan. Today the result may be correct, but if the row numbers change or you add an index the result may get incorrect! This is not as it should be.
Unfortunately I was not able to convince the German support that this is a bug. They send me a workaround like this:
DECLARE @Rows AS BigInt;
SELECT @Rows = Count(*) FROM Table1;
SELECT
t.ID1
, sub.NewId
FROM (SELECT TOP(@Rows)
ID1
, NEWID() as [NewId]
FROM Table1
GROUP BY ID1
ORDER BY NewId) as sub
JOIN Table1 as t
ON t.ID1=sub.ID1
The performance is very poor and I would recommend using a temporary table to store the new GUIDs instead, if you have a big row number. This seems to perform a little better.
Update (21.5.2008): Encouraged by Christoph I send the problem to Craig Freedman. He answered the very next day, that he presented the problem to the developer responsible for this code for his comments on that question. Here is the response of the developer Craig forwarded to me:
In general, SQL Server does not guarantee the timing of execution of scalar operators. For non-deterministic scalars (built-in and user-defined), that means the timing semantics (number of times executed and when) is not defined. In addition, it may change from one plan to another, or from one release to the next.
Craig added:
Unfortunately, I am not aware of (and was unable to identify) a better workaround than simply storing the GUIDs in a temp table before joining with the original table.
(My Thanks to Craig! I am very impressed that he ask the responsible developer. And thanks to the unknown developer.)
Update (11.6.2008): Itzik Ben-Gan did it! Microsoft confirmed to him this is a bug.
Currently I read the very good book "T-SQL Querying" written by Itzik. In the very first chapter he described that the optimizer can make shurtcuts in the physical execution plan if the result set is the same like using the logical execution plan. I asked him about his opinion about this problem and he made Microsoft to accept that this is a bug. He wrote:
I got back a response from Microsoft saying it is a bug, and apparently a regression from SQL Server 2000.
If you think Microsoft should fix this bug, then "vote" for this bug, please. The more people vote, the more likely MS will fix it.
Update (19.6.2008): Thanks for your support. The MS Developer Jim answered to this bug:
Thankyou for this bug report. One of the most interesting discussions around.
This hits to the very heart of the issue – is optimization allowed to change a program's semantics? Ie: if a program yields certain answers, but runs slowly, is it legitimate for a Query Optimizer make that program run faster, yet also change the results given?
Before shouting "NO!" (my own personal inclination too :-), consider: the good news is that, in 99% of cases, the answers ARE the same. So Query Optimization is a clear win. The bad news is that, if the query contains side-effecting code, then different plans CAN indeed yield different results. And NEWID() is one such side-effecting (non-deterministic) 'function' that exposes the difference. [Actually, if you experiment, you can devise others – for example, short-circuit evaluation of AND clauses: make the second clause throw an arithmetic divide-by-zero – different optimizations may execute that second clause BEFORE the first clause] This reflects Craig's explanation, elsewhere in this thread, that SqlServer does not guarantee when scalar operators are executed.
So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code – so that results of JOINs, for example, follow the semantics of a nested-loop execution – then we can use appropriate OPTIONs to force that behavior – as UC points out. But the resulting code will run slow – that's the cost of, in effect, hobbling the Query Optimizer.
All that said, we are moving the Query Optimizer in the direction of "as expected" behavior for NEWID() – trading off performance for "results as expected".
[…]
Anyhow, this bug is now assigned to the QO Dev team for a deeper look.
I presume "QO Dev team" is the "Query Optimizer Development team". 😉
Update (27.7.2008): In the last issue of the SQL Server Magazine Newsletter Itzik Ben-Gan wrote some bad news about this bug:
I posted the bug on Microsoft Connect (FeedbackID=350485), and after consideration, Microsoft decided to close the item and mark it as “Won’t Fix”. The reasoning behind the decision not to fix the bug is that in the vast majority of the cases, the optimization aspects that lead to the bug yield better performance without sacrificing the correctness of the query, and if you fall into one of the unusual cases where the correctness of the query is compromised, you can consider alternatives (e.g., physically materialize the data along with the NEWID values in a table).
This was quite unexpected for me. But I think there is nothing we can do about it. Thats a pity…
Hallo,
müssen es unbedingt GUIDs sein?
Alternativ ginge es sonst auch so:
with cte as
(
select
ID1
,row_number() over(order by (select 0)) Grp
from #t
group by
ID1
)
select
t.ID1
,cte.Grp
from cte
inner join #t t
on
t.ID1 = cte.ID1
Bei dem Änderungsvorschlag von MS finde ich es bedenklich, das @Rows als bigint deklariert wird, aber count() int als Rückgabetypen liefert und dieses count() überhaupt durchgeführt wird. Stattdessen liesse sich der maximale int-wert von 2147483647 (oder der von bigint) benutzen:
select
t.ID1
,sub.newid
from
(
select top(2147483647)
ID1
,newid() newid
from #t
group by
ID1
order by
newid
) sub
inner join #t t
on
t.ID1 = sub.ID1
"I hope that maybe some Microsoft Engineer might read it and put the problem on their schedule" -> das kann ich mir nicht vorstellen, aber vielleicht hat Craig Freedman Lust, sich damit auseinander zu setzen: http://blogs.msdn.com/craigfr/
Vg
Christoph Ingenhaag
Hallo,
Deine Idee mit der ROW_NUMBER() gefällt mir gut. Sie dürfte sogar sehr performant sein, wenn der Index passt. Leider sind die Kollegen in dem Projekt ziemlich auf GUIDs festgelegt, das weiss ich noch von anderen Beratungen. Ich gestehe: deswegen habe ich gar nicht versucht ihnen das in dem konkreten Fall auszureden. Außerdem war ich ziemlich sicher, dass das ratz-fatz als Bug anerkannt wird und machte gleich einen Incident dazu auf. Andere weniger offensichtliche Probleme wurden meiste recht bald anerkannt, was mir leider falsche Hoffnung machte. Aber das Problem greift sehr tief in den Optimizer ein. Deswegen waren sie möglicherweise auf dem Dampfer: Was nicht sein darf, das ist auch nicht.
Die Idee mit Craig Freedman greife ich auf, Danke. Ich wollte mir sowieso überlegen, wo ich das am besten ins Rennen schicke… 🙂
Die Sache mit dem COUNT war mir noch gar nicht aufgefallen! Aber stimmt, du hast recht. Du hast einen guten Blick.
PS: Damit Dein SQL so hübsch angezeigt wird, wie Du ihn gemacht hast, habe ich sie mittels "<code lang="sql"></code>" geklammert. Das ist auch grundsätzlich für alle in Kommentaren erlaubt.
Oh incidentally, and also to make things clear, these products
are the ones that I might have done differently if I were to
start my career across again. However, with all the variables who have the possibility to have in the way of constructing a
healthy portfolio, it makes you wonder if it's even possible.
While private investors are usually necessary to invest at the
least R 300 000 in funds 12 months, the idea could be that the network's individuals collectively invest to have
businesses going and finally generate income by selling their holdings within three to five years.
The land is attracting with fresh atmosphere, the National park nearby,
gardens and hills. Moreover, this location looks like it's an added good thing about the
project as it is well-known being connected on the major destinations
of your respective choices. Kandivali goes wrong with own so
much constructions suiting to several class of income groups and offering various prices.
These may 't be capable of heating a full unit however, these may
make some areas more comfortable and cozy. Those seeking a top-notch life and accommodations can decide the high
class constructions. The wellbeing manifestation of these structures additionally assumed a part inside the expanded patterns
of steel buildings in United Kingdom.
Hence make haste and then try to gather different instructions concerning
the real estate affairs. Right in Alabama, the
Alabama State Board of Licensure for Qualified Engineers and Land
Surveyors was placed in 1935 to shield the general public by
aiding "to safeguard way of life, well-being, and property, and to advertise the general public welfare by furnishing for that licensing and regulation of people within the practices of engineering and land surveying. All ALTA surveys must fulfill the minimum requirements as determined jointly through the American Land Title Association as well as the American Congress on Surveying and Mapping (ACSM).
Lucknow is a city that is known for its traditional culture and fascination with art, music and food, but these days it really is learning to be a favorite destination for many big names like HCL, Wipro, IBM etc.
Baroda also ranks amongst the top 3 cities to be the safest cities to
call home in. Godrej Garden City presented by Godrej Properties
is precisely located inside the North-West region of Ahmedabad considered to be Jagatpur.
As such, investment advice is similar to a guide to direct one's path on the correct direction.
5 percent of GDP you commence to acquire an inkling
why bond investors don't want to part with any
further cash. Mutual fund ratings run on the existing and also the past only, with no relation to
the future performance of the particular fund.
Have you ever thought about creating an e-book or guest
authoring on other sites? I have a blog based on the same topics you discuss and would really like to
have you share some stories/information. I know my audience would value your work.
If you're even remotely interested, feel free to shoot me
an email.
A look at yoga weight loss reality andd fiction along with completely different points of a yoga weight reduction plan.
Superdry men's Germany Trophy Collection t-shirt.
They should be sure to look special and naturally feel special to offer the ideal look you desire.
Most make-up artists do charge for a trial because they
have to schedule time on the appointment book.
Get them sorted for your beauticians or think of visiting a threading salon.
Die Trophäe ist allerdings dennoch erspielbar.
Aw, this was an extremely nice post. Taking a few minutes and actual effort to generate
a great article… but what can I say… I put things off a lot and don't manage to get
nearly anything done.
Thanks for some other informative web site.
Where else may I am getting that kind of information written in such a perfect means?
I've a undertaking that I am just now operating on, and I've
been on the glance out for such info.