如何从 unicode 字符串中获取单个字符并进行比较、打印?
How to get single characters from unicode string and compare, print them?
我正在使用 libunistring 在 C 中处理 unicode 字符串。不能使用其他图书馆。我的目标是从索引位置的 unicode 字符串中读取单个字符,打印它,并将其与固定值进行比较。这应该很简单,但是...
这是我的尝试(完整的 C 程序):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
为了运行这个程序:
- 将程序保存到名为 ustridx.c
的 UTF-8 编码文件中
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
- 确保终端设置为 UTF-8 语言环境 (
locale
)
- 运行 它与
./ustridx
输出:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
期望的行为是 char 5 和 last char 被正确识别,并在输出的最后两行中正确打印.
来自 libunistring 的文档:
Compares S1 and S2, each of length N, lexicographically. Returns a
negative value if S1 compares smaller than S2, a positive value if
S1 compares larger than S2, or 0 if they compare equal.
if
语句中的比较是错误的。这就是不匹配的原因。当然,这揭示了其他不相关的问题也需要解决。但是,这就是比较结果令人费解的原因。
'あ'
和 'é'
是无效的字符文字。字符文字中只允许 basic source character set and escape sequences 中的字符。
GCC 但是会发出警告 (see godbolt) saying warning: multi-character character constant
. This is a different case, and is about character constants such as 'abc'
, which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.
从 C11 开始,您可以使用 UTF-32 字符文字,例如 U'あ'
,这会导致字符的 Unicode 代码点的 char32_t
值.尽管根据我的阅读,标准不允许在文字中使用 あ 等字符,但 cppreference 上的示例似乎表明编译器通常允许这样做。
一种符合标准的可移植解决方案是对字符文字使用 Unicode 转义序列,例如 U'\u3042'
表示 あ,但这与使用整数常量(例如 0x3042
.
几乎没有区别
我正在使用 libunistring 在 C 中处理 unicode 字符串。不能使用其他图书馆。我的目标是从索引位置的 unicode 字符串中读取单个字符,打印它,并将其与固定值进行比较。这应该很简单,但是...
这是我的尝试(完整的 C 程序):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
为了运行这个程序:
- 将程序保存到名为 ustridx.c 的 UTF-8 编码文件中
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
- 确保终端设置为 UTF-8 语言环境 (
locale
) - 运行 它与
./ustridx
输出:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
期望的行为是 char 5 和 last char 被正确识别,并在输出的最后两行中正确打印.
来自 libunistring 的文档:
Compares S1 and S2, each of length N, lexicographically. Returns a
negative value if S1 compares smaller than S2, a positive value if
S1 compares larger than S2, or 0 if they compare equal.
if
语句中的比较是错误的。这就是不匹配的原因。当然,这揭示了其他不相关的问题也需要解决。但是,这就是比较结果令人费解的原因。
'あ'
和 'é'
是无效的字符文字。字符文字中只允许 basic source character set and escape sequences 中的字符。
GCC 但是会发出警告 (see godbolt) saying warning: multi-character character constant
. This is a different case, and is about character constants such as 'abc'
, which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.
从 C11 开始,您可以使用 UTF-32 字符文字,例如 U'あ'
,这会导致字符的 Unicode 代码点的 char32_t
值.尽管根据我的阅读,标准不允许在文字中使用 あ 等字符,但 cppreference 上的示例似乎表明编译器通常允许这样做。
一种符合标准的可移植解决方案是对字符文字使用 Unicode 转义序列,例如 U'\u3042'
表示 あ,但这与使用整数常量(例如 0x3042
.