如何找出排序规则是使用单词排序还是字符串排序？

Question

讨论 "word sort" 和 "string sort" 之间的差异。

当 SQL 归类将使用 "word sort" 与 "string sort" 时如何以编程方式查询？

推论：是否所有排序规则都对 Unicode 字符串使用 "word sort" 而对非 Unicode 字符串使用 "string sort"？

SELECT * from sys.fn_HelpCollations()
WHERE name = 'SQL_Latin1_General_CP1_CI_AS'

提供了很多关于排序规则的细节，但请注意它没有提到 "word sort"。

Answer 1

让我们从 Microsoft 给出的这些类型的定义开始（摘自 CompareOptions Enumeration MSDN 页面的 "Remarks" 部分）：

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

Unicode 是文化敏感和加权的，XML 和 N 前缀类型是 Unicode，所以他们可以说 Unicode 类型中的数据使用 "word sort" 而数据在非 Unicode 类型中使用 "string sort"。序数指的是 BIN 和 BIN2 归类，尽管 BIN 归类由于它们处理第一个字符的方式不是 100% 序数。

但让我们看看 SQL 服务器说它在做什么。运行以下：

DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES 
    ('a-b-c', N'a-b-c'),
    ('ac', N'ac'),
    ('aba', N'aba'),
    ('a-b', N'a-b'),
    ('ab', N'ab');

SELECT sd.ANSI AS [ANSI-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

结果：

ANSI-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

UTF16-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
a-b
a-b-c
ab
aba
ac

UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

嗯。只有 SQL_ 排序规则与 VARCHAR 字段的组合似乎在做可以被认为是 "string sort" 的事情。 SQL_ 归类与 NVARCHAR 字段结合使用是有道理的 "word sort" 它与非 SQL_ 归类的 Unicode 处理方式相同。但是，除了 SQL 服务器排序规则（即以 SQL_ 开头）之外，是否还有其他东西可以确定 "string" 与 "word" 排序？让我们看看我们可以提取的排序规则的唯一属性：

SELECT N'Latin1_General_100_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage') AS [CodePage],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'LCID') AS [LCID],
      COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'ComparisonStyle') AS [ComparisonStyle]
UNION ALL
SELECT N'SQL_Latin1_General_CP1_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle');

结果：

CollationName                  CodePage   LCID    ComparisonStyle
----------------------------   --------   ----    ---------------
Latin1_General_100_CI_AS       1252       1033    196609
SQL_Latin1_General_CP1_CI_AS   1252       1033    196609

所以，那里没有明显的区别。这似乎给我们留下了这个：

字符串排序完成时间：

排序规则名称以 SQL_、AND
数据（字段、变量、字符串文字）是非 Unicode（即 CHAR/VARCHAR/TEXT）

有关 Unicode 排序的更多信息，请查看以下资源：

Unicode Collation Charts (per language) - 显示每种语言的字符，显示它们是如何排序的
Unicode Collation Algorithm (UCA) - 一些 "light"（哈！）阅读用于对 Unicode 数据进行排序的算法 - 这是默认算法，除非被特定区域设置的规则覆盖.
Collation Guidelines - 如何阅读特定于语言环境的覆盖图表

Answer 2

出色的答案表明，除了由 SQL_ 整理器处理的非 Unicode 类型外，所有其他数据均根据 "Unicode Collation" 规则排序。
令人困惑的是，Microsoft 不使用 Unicode 标准的排序规则。
根据https://support.microsoft.com/en-us/kb/322112
SQL Server 2000 supports two types of collations:
- SQL collations
- Windows collations
[...]

For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data.

[...]

A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules.
我的解释是：
- SQL_ 整理者是 "SQL collations"
- 所有其他整理器都是 "Windows collators"。
- 除由 SQL_ 整理器处理的非 Unicode 类型外，所有其他数据均根据 "Windows collations".

所以，让我们深入研究 "Windows collations"。

根据https://msdn.microsoft.com/en-us/library/ms143515(v=sql.105).aspx

For Unicode data types, data comparisons are based on the Unicode code points.
winnls.h 包含 "word sort":

//  Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: “coop” and “co-op” will sort together in a list
//
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <——-  hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we’re     <——-  apostrophe (punctuation)
//
//
//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <——-  hyphen (punctuation)
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we’re     <——-  apostrophe (punctuation)
//                        went
//                        were

最后，根据 https://msdn.microsoft.com/en-us/library/windows/desktop/dd318144(v=vs.85).aspx

[...] all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.

如何找出排序规则是使用单词排序还是字符串排序？

How to find out whether collation uses word sort or string sort?

sql-server

sorting

unicode

collation