我如何知道字符串中的哪些字符是 C 中单个重音字符的组合？

Question

我的母语不是英语，是巴西葡萄牙语，我们有这些重读字符（á、à、ã、õ 等等）。

所以，我的问题是，如果我将这些字符之一放入一个字符串中，并尝试遍历其中的每个字符，我将得到两个字符是显示“ã”所必需的屏幕。

这是一张关于我迭代字符串“(Não Informado)”的图片，意思是：不知情。如果我们一个一个地计算每个字符，字符串的长度应该是 15。但是如果我们调用 strlen("(Não Informado)");，结果是 16。

我用来打印这张图片中每个字符的代码是这个：

void print_buffer (const char * buffer) {
    int size = strlen(buffer);
    printf("BUFFER: %s / %i\n", buffer, size);

    for (int i = 0; buffer[i] != '[=10=]'; ++i) {
        printf("[%i]: %i\n", i, (unsigned char) buffer[i]);
    }
}

因此，在图形应用程序中，缓冲区可以显示“ãbc”，并且在原始字符串中我们不会有 3 个字符，但实际上有 4 个。

所以这是我的问题，有没有办法知道字符串中的哪些字符是这些特殊字符的组合？是否有规则来设计和限制这种情况的发生？它总是由 2 个字符组成吗？例如，一个特殊字符可以由 3 或 4 组成吗？

谢谢

Answer 1

is there a way to know which characters inside a string are a composition of those special characters?

是的，要检查某个字节是否是多字节字符的一部分，您只需要按位运算(c & 0x80):

#include <stdio.h>

int is_multibyte(int c)
{
    return c & 0x80;
}

int main(void)
{
    const char *str = "ãbc";

    while (*str != 0)
    {
        printf(
            "%c %s part of a multibyte\n",
            *str, is_multibyte(*str) ? "is" : "is not"
        );
        str++;
    }
    return 0;
}

输出：

� is part of a multibyte
� is part of a multibyte
b is not part of a multibyte
c is not part of a multibyte

The string should have a length of 15 if we count each character one by one. But if we call strlen("(Não Informado)");, the result is 16.

看来您对代码点数而不是字节数感兴趣，是吗？

在这种情况下，您要使用 (c & 0xc0) != 0x80:

进行屏蔽

#include <stdio.h>

size_t mylength(const char *str)
{
    size_t len = 0;

    while (*str != 0)
    {
        if ((*str & 0xc0) != 0x80)
        {
            len++;
        }
        str++;
    }
    return len;
}

int main(void)
{
    const char *str = "ãbc";

    printf("Length of \"%s\" = %zu\n", str, mylength(str));
    return 0;
}

输出：

Length of "ãbc" = 3

Could a special character be composed of 3 or 4, for example?

当然可以，欧元符号€就是一个例子（3字节），来自这个nice answer:

U+007F 之前的任何内容都占用 1 个字节：基本拉丁语
然后到 U+07FF 需要 2 个字节：希腊语、阿拉伯语、西里尔语、希伯来语等
然后到U+FFFF需要3个字节：中文、日文、韩文、天城文等
除此之外需要 4 个字节

Is there a rule to design and restrict this occurrence?

如果你的意思是能够处理所有具有相同宽度的字符，C 有专门的宽字符库：

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_CTYPE, "");

    const wchar_t *str = L"ãbc";

    while (*str != 0)
    {
        printf("%lc\n", *str);
        str++;
    }
    return 0;
}

输出：

ã
b
c

要获得长度，您可以使用 wcslen:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_CTYPE, "");

    const wchar_t *str = L"ãbc";

    printf("Length of \"%ls\" = %zu\n", str, wcslen(str));
    return 0;
}

输出：

Length of "ãbc" = 3

但是如果“限制”是指“避免”那些多字节字符，则可以将 UTF8 音译为纯 ASCII。如果 posix 是一个选项，请查看 iconv, you have an example here

El cañón de María vale 1000 €

转换为

El canon de Maria vale 1000 EUR

在你的情况下

Não Informado

转换为

Nao Informado

我如何知道字符串中的哪些字符是 C 中单个重音字符的组合？

How can I know which characters inside a string are compositions of a single accentuated character in C?

c

string

design-patterns

character-encoding

special-characters