改善可怕的 MERGE 性能

Question

我最近问了一个关于如何解决 tsql 查询中的问题的问题，这导致我使用 MERGE 语句。然而，这被证明是有问题的，因为它的性能很糟糕。

我需要做的是根据结果集插入行，并保存插入行的 id 及其产生的数据（参见相关问题）。

我得到了这样的查询。

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

;WITH person AS
(
    SELECT top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE t.Personnumber NOT IN 
        (
            SELECT i.Account FROM [security].[Accounts] i
        )
)

MERGE [security].[Identities] AS tar
USING person AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out using CTE Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT $action, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;


SELECT * FROM @temp

使用此查询，我将插入所有行，然后将它们与源值一起保存到临时 table 以供以后处理。

这在 10k 行以下的情况下效果很好。但是我这样做的数据集接近 200 万行。 我运行这个查询大约一个小时还没有完成（在一个充实的高级 Azure 数据库上）。

问题：我怎样才能让它更快。我可以在不合并的情况下达到同样的结果吗？

Answer 1

在我看来，您的身份 table 只是被用作序列生成器，因为除了时间戳之外，您没有向其中插入任何内容。您是否考虑过使用 SEQUENCE 而不是 table 来生成密钥？使用序列可能会消除此过程，因为您可以在需要时生成密钥。

将数百万行输出到一个 table 变量不太可能可行。 Table 变量通常最多适用于几千行。

INSERT INTO security.Accounts (GlobalId, Account, Firstname, Lastname)
SELECT NEXT VALUE FOR AccountSeq, r.Personnumber, r.Firstname, r.Lastname
FROM temp.RawRoles AS r
LEFT JOIN security.Accounts AS a ON r.Personnumber = a.Account
WHERE a.Personnumber IS NULL;

INSERT INTO security.identities (GlobalId, Created, Updated)
SELECT a.GlobalId, GETUTCDATE() AS Created, GETUTCDATE() AS Updated
FROM security.Accounts AS a
LEFT JOIN security.identities AS i ON a.GlobalId = i.GlobalId
WHERE i.GlobalId IS NULL;

Answer 2

乍一看，MERGE 似乎并不是性能下降的罪魁祸首。合并条件始终为假 (0=1) 并且插入（进入 [security].[Identities]）是唯一可能的 path/way 转发。

绕过 [security].[Identities] 和 MERGE 将 200 万行插入 @temp 需要多长时间？

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

--is this fast?!?
INSERT INTO @temp(action, GlobalId, Personnumber, Firstname, LastName)
SELECT 'insert', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN 
(
    SELECT i.Account FROM [security].[Accounts] i
);

检查：

[temp].[RawRoles].Personnumber 的数据类型是什么？ Personnumber 是 nvarchar(100) 吗？
一个人的号码需要存外文吗？ Nchar 的大小是 char 的两倍。如果您有字母数字（常用拉丁字符）或带前导零的数字，varchar/char 可能是更好的选择。如果您的要求可以用数字数据类型满足，那么 int/bigint/decimal 将是首选。
[temp].[RawRoles].Personnumber 有索引吗？如果没有索引，存在性检查将需要对 [temp].[RawRoles].Personnumber 进行排序或散列。这可能是资源 throughput/dtu 的额外成本。 [temp].RawRoles 上的聚集索引可能是最有益的，考虑到大多数 temp.RawRoles 最终将获得 processed/inserted。
[security].[Accounts].Account 的数据类型是什么？列上有索引吗？两列 [security].[Accounts].Account & [temp].[RawRoles].Personnumber 应该是 same 数据类型，理想情况下两者都有索引。如果 [security].[Accounts] 是已处理的 [temp].[RawRoles] 的最终目的地，那么 table 可能包含数百万行，并且任何未来处理都需要 Account 列上的索引。索引的缺点是插入速度较慢。如果200万是第一个bulk/data，最好在将"bulk"插入security.Accounts时Account上不要有索引（以后再创建）

总结一下：

--contemplate&decide whether a change of the Account datatype is needed. (a datatype change can have many implications, for applications using the db)

--change the data type of Personnumber to the datatype of Account(security.Accounts)
ALTER TABLE temp.RawRoles ALTER COLUMN Personnumber "datatype of security.Accounts.Account" NOT NULL; -- rows having undefined Personnumber?

--clustered index Personnumber
CREATE /*UNIQUE*/ CLUSTERED INDEX uclxPersonnumber ON temp.RawRoles(Personnumber); --unique preferred, if possible

--index on account (not needed[?] when security.Accounts is empty)
CREATE INDEX idxAccount ON [security].Accounts(Account);


--baseline, how fast can we do a straight forward insertion of 2 million rows?
DECLARE @tempbaseline AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100) --ignore this for now
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

INSERT INTO @tempbaseline([action], GlobalId, Personnumber, Firstname, LastName)
SELECT 'INSERT', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)    
--if the execution time (baseline) is acceptable, proceed with the merge code
--"merge with output into" should be be "slightly"/s slower than the baseline.
--if the baseline is not acceptable (simple insertion takes too much time) then merge is futile

/*
DECLARE @temp....


MERGE [security].[Identities] AS tar
USING 
(
    SELECT --top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
) AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out in the USING Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT 'INSERT' /** only insert is possible $action */, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;   


--delete the index on Account (the process will insert 2mil)
DROP INDEX idxAccount ON [security].Accounts --review and create this index after the bulk of accounts is inserted.

...your process

*/

改善可怕的 MERGE 性能

Improving horrible MERGE performance

tsql

sql-server

performance

sql-merge