如何从 unicode 字符串中获取单个字符并进行比较、打印?

How to get single characters from unicode string and compare, print them?

我正在使用 libunistring 在 C 中处理 unicode 字符串。不能使用其他图书馆。我的目标是从索引位置的 unicode 字符串中读取单个字符,打印它,并将其与固定值进行比较。这应该很简单,但是...

这是我的尝试(完整的 C 程序):

/* This file must be UTF-8 encoded in order to work */

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>


int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
    int result = u32_cmp(&charExpected, &charActual, 1);
    if (result == 0) {
        printf("%s is recognized as '%lc', good!\n", label, charExpected);
    } else {
        printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
    }
    return result;
}


int main() {
    setlocale(LC_ALL, "");     /* switch from default "C" encoding to system encoding */
    const char *enc = locale_charset();
    printf("Current locale charset: %s (should be UTF-8)\n\n", enc);

    const char *buf = "foo 楽あり bébé";
    const uint32_t *mbcs = u32_strconv_from_locale(buf);

    printf("%s\n", u32_strconv_to_locale(mbcs));

    uint32_t c0 = mbcs[0];
    uint32_t c5 = mbcs[5];
    uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];

    printf(" - char 0: %lc\n", c0);
    printf(" - char 5: %lc\n", c5);
    printf(" - last  : %lc\n", cLast);

    /* When this file is UTF-8-encoded, I'm passing a UTF-8 character
     * as a uint32_t, which should be wrong! */
    cmpchr("Char 0", 'f', c0);
    cmpchr("Char 5", 'あ', c5);
    cmpchr("Last char", 'é', cLast);

    return 0;
}

为了运行这个程序:

  1. 将程序保存到名为 ustridx.c
  2. 的 UTF-8 编码文件中
  3. sudo apt-get install libunistring-dev
  4. gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
  5. 确保终端设置为 UTF-8 语言环境 (locale)
  6. 运行 它与 ./ustridx

输出:

Current locale charset: UTF-8 (should be UTF-8)

foo 楽あり bébé
 - char 0: f
 - char 5: あ
 - last  : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.

期望的行为是 char 5last char 被正确识别,并在输出的最后两行中正确打印.

来自 libunistring 的文档:

 Compares S1 and S2, each of length N, lexicographically.  Returns a
 negative value if S1 compares smaller than S2, a positive value if
 S1 compares larger than S2, or 0 if they compare equal.

if语句中的比较是错误的。这就是不匹配的原因。当然,这揭示了其他不相关的问题也需要解决。但是,这就是比较结果令人费解的原因。

'あ''é' 是无效的字符文字。字符文字中只允许 basic source character set and escape sequences 中的字符。

GCC 但是会发出警告 (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.

从 C11 开始,您可以使用 UTF-32 字符文字,例如 U'あ',这会导致字符的 Unicode 代码点的 char32_t 值.尽管根据我的阅读,标准不允许在文字中使用 あ 等字符,但 cppreference 上的示例似乎表明编译器通常允许这样做。
一种符合标准的可移植解决方案是对字符文字使用 Unicode 转义序列,例如 U'\u3042' 表示 あ,但这与使用整数常量(例如 0x3042.

几乎没有区别