清理用于记录的 unicode 字符串

Question

我正在编写一个字符串清理器，以便在将数据写入具有以下规则的日志文件之前使用：

指定的字符被列入白名单（A-Za-z0-9 以及 <>[],.:_- 和空格）
指定的字符在三角括号内被转换成他们名字的英文版本（例如"," => "<comma>"、"%" => "<percent>"）
任何其他内容都将转换为三角括号内的 unicode 编号（例如 "φ" => "<U+03C6>"、"π" => "<U+03C0>"）

到目前为止 1 和 2 正在运行，但 3 没有运行。这是我目前的情况：

    public static string Safe(string s)
    {
        s = s
            .Replace("<", "ooopen-angle-brackettt") // must come first
            .Replace(">", "ccclose-angle-brackettt") // must come first
            //.Replace(",", "<comma>") // allow
            //.Replace(".", "<dot>") // allow
            //.Replace(":", "<colon>") // allow
            .Replace(";", "<semi-colon>")
            .Replace("{", "<open-curly-bracket>")
            .Replace("}", "<close-curly-bracket>")
            //.Replace("[", "<open-square-bracket>") // allow
            //.Replace("]", "<close-square-bracket>") // allow
            .Replace("(", "<open-bracket>")
            .Replace(")", "<close-bracket>")
            .Replace("!", "<exclamation-mark>")
            .Replace("@", "<at>")
            .Replace("#", "<hash>")
            .Replace("$", "<dollar>")
            .Replace("%", "<percent>")
            .Replace("^", "<hat>")
            .Replace("&", "<and>")
            .Replace("*", "<asterisk>")
            //.Replace("-", "<dash>") // allow
            //.Replace("_", "<underscore>") // allow
            .Replace("+", "<plus>")
            .Replace("=", "<equals>")
            .Replace("\", "<forward-slash>")
            .Replace("\"", "<double-quote>")
            .Replace("'", "<single-quote>")
            .Replace("/", "<forward-slash>")
            .Replace("?", "<question-mark>")
            .Replace("|", "<pipe>")
            .Replace("~", "<tilde>")
            .Replace("`", "<backtick>")
            .Replace("ooopen-angle-brackettt", "<open-angle-bracket>")
            .Replace("ccclose-angle-brackettt", "<close-angle-bracket>");
        // all working upto here. broken below:

        Regex itemRegex = new Regex(@"[^A-Za-z0-9<>[\]:.,_\s-]", RegexOptions.Compiled);
        foreach (Match itemMatch in itemRegex.Matches(s))
        {
            // the reason for [0] and [1] is that I read that unicode consists of 2 characters
            s = s.Replace(
                itemMatch.ToString(),
                "<U+" +
                    (((int)(itemMatch.ToString()).ToCharArray()[0]).ToString("X4")).ToString() +
                    (((int)(itemMatch.ToString()).ToCharArray()[1]).ToString("X4")).ToString() +
                ">"
            );
        }
        return s;
    }

正则表达式部分未捕获输入字符串中的 unicode 字符。我该如何解决这个问题

Answer 1

问题是我假设 C# string 中存在的单个 unicode 值在将该字符串转换为 char 数组时会转换为多个项目（char[]).如果您将鼠标悬停在 visual studio 中的 string 和 char 类型上，那么它实际上会告诉您这些类型如何与 unicode 相关：

string：将文本表示为一系列 unicode 字符
char：表示一个字符为UTF-16编码单元

这意味着 C# 字符串中的每个 "letter"（即字符）实际上是一个 unicode char，因此当您将字符串转换为 char 数组时该数组的每一项现在都包含 1 个 unicode 字符。

还有另外一块缺失的拼图：我们如何知道 Regex.Match() 一次对 1 个 unicode 字符进行操作？它使用 UTF-16 还是 UTF-32？这个问题的答案我looked up the documentation:

\unnnn - Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).

所以 C# regex 支持 UTF-16（2 个字节），但不支持 UTF-32。像 .{1} 这样的模式将恰好捕获 1 个 UTF-16 字符。

那么解决方案就是不要尝试从原始问题的 itemMatch.ToString().ToCharArray() 中获取 2 项 - 因为那里只有 1 项！这是规则 3 的缺失解决方案（我被卡住的部分）：

        Regex itemRegex = new Regex(@"[^A-Za-z0-9<>[\]:\.,_\s-]", RegexOptions.Compiled); // {1} is implied

        foreach (Match itemMatch in itemRegex.Matches(s))
        {
            char unicodeChar = itemMatch.ToString().ToCharArray()[0]; // 1 char = 16 bits
            int unicodeNumber = (int)unicodeChar;
            string unicodeHex = unicodeNumber.ToString("X4");
            s = s.Replace(itemMatch.ToString(), "<U+" + unicodeHex + ">");
        }
        return s;

清理用于记录的 unicode 字符串

Sanitize a unicode string for logging

c#

regex

unicode-string