获取第一个非标准英文字符的索引

Question

当我发现一个不属于标准英语字母表的字符时，我正在尝试处理一个字符串并将其分成两部分。例如 This is a stríng with áccents.，我需要知道第一个或每个带重音符号的字符的索引 (í)。

我认为解决方案介于 System.Text.Encoding 和 System.Globalization 之间，但我错过了一些东西...

重要的是要知道它是否是带重音的字符，如果可能的话排除 space。

void Main()
{
    var str = "This is a stríng with áccents.";
    var strBeforeFirstAccent = str.Substring(0, getIndexOfFirstCharWithAccent(str));
    Console.WriteLine(strBeforeFirstAccent);

}

int getIndexOfFirstCharWithAccent(string str){
    //Process logic
    return 13;
}

谢谢！

Answer 1

正则表达式 [^a-zA-Z ] 将查找非重音罗马字母和空格以外的字符。

所以：

var regex = new Regex("[^a-zA-Z ]");
var match = regex.Match("This is a stríng with áccents.");

将returní

和match.Index将包含它的位置。

Answer 2

另一个可能的解决方案（fixed/adapted 来自 Cortright 的回答）是枚举 Unicode 对。

const string input = "This is a stríng with áccents .";
byte[] array = Encoding.Unicode.GetBytes(input);

for (int i = 0; i < array.Length; i += 2)
{
    if (((array[i]) | (array[i + 1] << 8)) > 128)
    {
        Console.WriteLine((array[i] | (array[i + 1] << 8)) + " at index " + (i / 2) + " is not within the ASCII range");
    }
}

这将打印超出允许的 ASCII 值范围的所有数值的列表。（我把 ASCII 的原始定义取为 0-127。）

就我个人而言，我推荐 David Arno 的解决方案。我只 post 这作为一个潜在的选择。（如果您对其进行基准测试，它可能更快。同样，它可能也更易于管理。）

更新：我刚刚测试了它，它似乎仍然可以正确识别更高范围内的字符 (U+10000 - U+10FFFF)，因为 not 被允许。事实上，这是因为代理项对也在 ASCII 范围之外。唯一的问题是它将它们识别为两个个字符对，而不是一个。

输出：

237 at index 13 is not within the ASCII range
225 at index 22 is not within the ASCII range
55378 at index 30 is not within the ASCII range
57186 at index 31 is not within the ASCII range

获取第一个非标准英文字符的索引

Get index of first non standard english character

c#

linq

globalization

diacritics

character-encoding