特殊字符(夏威夷语“Okina”)导致奇怪的字符串行为
Special character (Hawaiian 'Okina) leads to weird string behavior
Hawaiian quote 在 T-SQL 中与字符串函数一起使用时有一些奇怪的行为。这里发生了什么?我错过了什么吗?其他角色是否也有同样的问题?
SELECT UNICODE(N'ʻ') -- Returns 699 as expected.
SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ
SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ
SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_
SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -
此外,在 LIKE
中使用时很奇怪,例如:
DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
@table
VALUES
('John'),
('Jane')
SELECT
*
FROM
@table
WHERE
[Name] LIKE N'%ʻ%' -- This returns both records. I expected none.
我无法提供详细的答案,但我可以提供满足您期望的解决方案。
这与归类有关,但我不确定为什么 Windows 归类会产生意想不到的结果。如果你使用二进制排序规则,你会得到预期的结果(请参阅 Solomons excellent answer for which BIN to use):
SELECT REPLACE(N'aʻ' COLLATE Latin1_General_BIN, N'a', N'_')
Returns _ʻ
DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
@table
VALUES
(N'John'),
(N'Jane'),
(N'Hawaiʻi'),
(N'Hawai''i'),
(NCHAR(699))
SELECT
*
FROM
@table
WHERE
[Name] like N'%ʻ%' COLLATE Latin1_General_BIN
Returns:
Hawaiʻi
ʻ
您可以使用以下代码(改编自@SolomonRutzky () 的代码)检查哪个排序规则符合您的期望。它为所有排序规则计算 SELECT REPLACE(N'"ʻ', N'ʻ', N'_')) = '"_'
:
DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';
SELECT @SQL += REPLACE(N'
IF((SELECT REPLACE(N''"ʻ'' COLLATE {Name}, N''ʻ'', N''_'')) = ''"_'')
BEGIN
RAISERROR(N''%4d. {Name}'', 10, 1, @Counter) WITH NOWAIT;
SET @Counter += 1;
END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
ORDER BY col.[name]
--PRINT @SQL;
EXEC (@SQL);
The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?
几件事:
- 这不是夏威夷语"quote":是影响发音的“glottal stop”
- 这不是 "weird" 行为:它只是出乎您的意料。
此行为并非特指 "problem",不过是的,还有其他角色表现出类似的行为。例如,下面的字符(U+02DA Ring Above)根据它在字符的哪一侧表现略有不同:
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a', N'_'); -- Returns a_a
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚', N'_'); -- Returns _aa
现在,任何使用 SQL Server 2008 或更新版本的人都应该使用 100(或更新)级别的排序规则。他们在 100 系列、非编号系列或大部分过时的 SQL 服务器排序规则(名称以SQL_
).
这里的问题不是它不等同于任何其他字符(在二进制排序规则之外),事实上它确实等同于另一个字符 (U+0312 Combining Turned Comma Above):
;WITH nums AS
(
SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
FROM [master].sys.all_columns ac1
CROSS JOIN [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
NCHAR(nums.[num]) AS [Character]
FROM nums
WHERE NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue BINvalue Character
699 0x02BB ʻ
786 0x0312 ̒
*/
问题是这是一个 "spacing modifier" 字符,因此它附加到它之前或之后的字符,并修改其含义/发音,具体取决于您正在处理的修饰字符。
根据 Unicode Standard, Chapter 7 (Europe-I),第 7.8 节(修饰字母),第 323 页(文档的,而非 PDF 的):
7.8 Modifier Letters
Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...
Spacing Modifier Letters: U+02B0–U+02FF
Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.
下面的例子应该有助于说明。我使用的是 100 级排序规则,它需要区分重音(即名称包含 _AS
):
SELECT REPLACE(N'ʻ' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _
SELECT REPLACE(N'ʻa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns __aa
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns aʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a', N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א', N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns _aa
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1
SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2
如果您需要以忽略其预期语言行为的方式处理此类字符,那么是的,您必须使用二进制排序规则。在这种情况下,请使用最新级别的排序规则,并使用 BIN2
而不是 BIN
(假设您使用的是 SQL Server 2005 或更新版本)。含义:
- SQL 服务器 2000:
Latin1_General_BIN
- SQL 服务器 2005:
Latin1_General_BIN2
- SQL 服务器 2008、2008 R2、2012、2014 和 2016:
Latin1_General_100_BIN2
- SQL 服务器 2017 及更新版本:
Japanese_XJIS_140_BIN2
如果您对我为什么提出该建议感到好奇,请参阅:
Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)
并且,有关排序规则/Unicode/编码/等的更多信息,请访问:Collations Info
Hawaiian quote 在 T-SQL 中与字符串函数一起使用时有一些奇怪的行为。这里发生了什么?我错过了什么吗?其他角色是否也有同样的问题?
SELECT UNICODE(N'ʻ') -- Returns 699 as expected.
SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ
SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ
SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_
SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -
此外,在 LIKE
中使用时很奇怪,例如:
DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
@table
VALUES
('John'),
('Jane')
SELECT
*
FROM
@table
WHERE
[Name] LIKE N'%ʻ%' -- This returns both records. I expected none.
我无法提供详细的答案,但我可以提供满足您期望的解决方案。
这与归类有关,但我不确定为什么 Windows 归类会产生意想不到的结果。如果你使用二进制排序规则,你会得到预期的结果(请参阅 Solomons excellent answer for which BIN to use):
SELECT REPLACE(N'aʻ' COLLATE Latin1_General_BIN, N'a', N'_')
Returns _ʻ
DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
@table
VALUES
(N'John'),
(N'Jane'),
(N'Hawaiʻi'),
(N'Hawai''i'),
(NCHAR(699))
SELECT
*
FROM
@table
WHERE
[Name] like N'%ʻ%' COLLATE Latin1_General_BIN
Returns:
Hawaiʻi
ʻ
您可以使用以下代码(改编自@SolomonRutzky (SELECT REPLACE(N'"ʻ', N'ʻ', N'_')) = '"_'
:
DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';
SELECT @SQL += REPLACE(N'
IF((SELECT REPLACE(N''"ʻ'' COLLATE {Name}, N''ʻ'', N''_'')) = ''"_'')
BEGIN
RAISERROR(N''%4d. {Name}'', 10, 1, @Counter) WITH NOWAIT;
SET @Counter += 1;
END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
ORDER BY col.[name]
--PRINT @SQL;
EXEC (@SQL);
The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?
几件事:
- 这不是夏威夷语"quote":是影响发音的“glottal stop”
- 这不是 "weird" 行为:它只是出乎您的意料。
此行为并非特指 "problem",不过是的,还有其他角色表现出类似的行为。例如,下面的字符(U+02DA Ring Above)根据它在字符的哪一侧表现略有不同:
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a', N'_'); -- Returns a_a SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚', N'_'); -- Returns _aa
现在,任何使用 SQL Server 2008 或更新版本的人都应该使用 100(或更新)级别的排序规则。他们在 100 系列、非编号系列或大部分过时的 SQL 服务器排序规则(名称以SQL_
).
这里的问题不是它不等同于任何其他字符(在二进制排序规则之外),事实上它确实等同于另一个字符 (U+0312 Combining Turned Comma Above):
;WITH nums AS
(
SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
FROM [master].sys.all_columns ac1
CROSS JOIN [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
NCHAR(nums.[num]) AS [Character]
FROM nums
WHERE NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue BINvalue Character
699 0x02BB ʻ
786 0x0312 ̒
*/
问题是这是一个 "spacing modifier" 字符,因此它附加到它之前或之后的字符,并修改其含义/发音,具体取决于您正在处理的修饰字符。
根据 Unicode Standard, Chapter 7 (Europe-I),第 7.8 节(修饰字母),第 323 页(文档的,而非 PDF 的):
7.8 Modifier Letters
Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...Spacing Modifier Letters: U+02B0–U+02FF
Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.
下面的例子应该有助于说明。我使用的是 100 级排序规则,它需要区分重音(即名称包含 _AS
):
SELECT REPLACE(N'ʻ' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _
SELECT REPLACE(N'ʻa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns __aa
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns aʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a', N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א', N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns _aa
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1
SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2
如果您需要以忽略其预期语言行为的方式处理此类字符,那么是的,您必须使用二进制排序规则。在这种情况下,请使用最新级别的排序规则,并使用 BIN2
而不是 BIN
(假设您使用的是 SQL Server 2005 或更新版本)。含义:
- SQL 服务器 2000:
Latin1_General_BIN
- SQL 服务器 2005:
Latin1_General_BIN2
- SQL 服务器 2008、2008 R2、2012、2014 和 2016:
Latin1_General_100_BIN2
- SQL 服务器 2017 及更新版本:
Japanese_XJIS_140_BIN2
如果您对我为什么提出该建议感到好奇,请参阅:
Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)
并且,有关排序规则/Unicode/编码/等的更多信息,请访问:Collations Info