C# .NET 中的 UTF-16 安全子字符串

Question

我想得到一个给定长度的子字符串，比如 150。但是，我想确保我没有切断 unicode 字符之间的字符串。

例如见以下代码：

var str = "Hello world!";
var substr = str.Substring(0, 6);

这里 substr 是一个无效的字符串，因为笑脸字符被切成两半。

相反，我想要一个功能如下：

var str = "Hello world!";
var substr = str.UnicodeSafeSubstring(0, 6);

其中 substr 包含 "Hello"

作为参考，以下是我在 Objective-C 中使用 rangeOfComposedCharacterSequencesForRange

的方法

NSString* str = @"Hello world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];

C# 中的等效代码是什么？

Answer 1

看起来您想要在字形上拆分字符串，即在单个显示的字符上。

在那种情况下，你有一个方便的方法：StringInfo.SubstringByTextElements:

var str = "Hello world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);

Answer 2

这应该是 return 从索引 startIndex 开始并且长度最大为 length 的“完整”字素的最大子串...所以 initial/final “拆分”代理对将被删除，初始组合标记将被删除，缺少组合标记的最终字符将被删除。

请注意，这可能不是您所要求的...您似乎想使用字素作为度量单位（或者您可能想包括最后一个字素，即使它的长度会超过 length参数)

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);

            if (startIndex == length)
            {
                break;
            }
        }

        return sb.ToString();
    }
}

变体将在子字符串的末尾简单地包含“额外”字符，如果有必要使整个字素：

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            if (startIndex >= length)
            {
                break;
            }

            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);
        }

        return sb.ToString();
    }
}

这将 return 你问的 "Hello world!".UnicodeSafeSubstring(0, 6) == "Hello"。

注意：值得指出的是，这两种解决方案都依赖于StringInfo.GetTextElementEnumerator。此方法未按预期工作 prior to a fix in .NET5，因此，如果您使用的是早期版本的 .NET，那么这将拆分更复杂的多字符表情符号。

Answer 3

这是截断 (startIndex = 0) 的简单实现：

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;

C# .NET 中的 UTF-16 安全子字符串

UTF-16 safe substring in C# .NET

.net

c#

string

unicode

xamarin.ios