Linux 如何在 C 语言中将非 ASCII 字符压缩为 1 个字节?

How to compress Non-ASCII characters to 1 byte in C for Linux?

我有一个土耳其语单词列表。我需要比较它们的长度。但是由于一些土耳其字符不是 ASCII,我无法正确比较它们的长度。非 ASCII 土耳其语字符占用 2 个字节。

例如:

#include <stdio.h>
#include <string.h>

int main()
{
    char s1[] = "ab";
    char s2[] = "çş";

    printf("%d\n", strlen(s1)); // it prints 2
    printf("%d\n", strlen(s2)); // it prints 4

    return 0;
}

我的朋友说可以在 Windows 中使用下面的代码行来做到这一点:

system("chcp 1254");

他说它将土耳其字符填充到扩展的 ASCII table。但是它在 Linux 中不起作用。

Linux有没有办法做到这一点?

一种可能是使用宽字符串来存储单词。它不会将字符存储为一个字节,但可以解决您的主要问题。获得一组适用于您的语言的功能。该程序如下所示:

#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main()
{
    wchar_t s1[] = L"ab";
    wchar_t s2[] = L"çş";

    printf("%d\n", wcslen(s1)); // it prints 2
    printf("%d\n", wcslen(s2)); // it prints 2

    return 0;
}

现在是 2017 年,很快就是 2018 年了。所以 使用 UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib)。

chcp 1254 是一些 Windows 与 UTF-8 系统无关的特定内容。所以算了。

如果您编写 GUI 应用程序,请使用像 GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) 这样的小部件工具包。

如果您碰巧得到一些文件,您需要知道它使用的编码(这只是您需要得到并遵循的一些 约定)。在 一些 情况下,file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) 命令。