C#:属性与常量的不同字符串编码

C#: Different string encoding on attribute vs. constant

我正在为一个旨在删除无效代码点(例如孤立的代理项对)的函数编写测试。 但是,根据我编写测试的方式,我发现代理对的编码方式有所不同。

虽然此版本的测试通过:

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "\uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

这个没有:

        [TestCategory("UnitTest")]
        [TestMethod]
        [DataRow("\uDDDD1975")]
        public void RemoveOrhpanedSurrogatePair(string input)
        {
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

查看调试器,第一个变体将字符串编码为 "\uDDDD1975",但第二个变体生成 "��1975",它显示为两个有效字符,而不是一个孤立的代理项对。

我认为可以在@jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of \uFFFD characters (the Unicode replacement character 中找到答案的线索(除此之外还有什么),它用于在将二进制文件解码为文本时指示损坏的数据)。

[Description(Value)]
class Test
{
    const string Value = "\uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

将产生:

Attribute: fffd fffd
Constant: dddd