Linux 如何在 C 语言中将非 ASCII 字符压缩为 1 个字节？

Question

我有一个土耳其语单词列表。我需要比较它们的长度。但是由于一些土耳其字符不是 ASCII，我无法正确比较它们的长度。非 ASCII 土耳其语字符占用 2 个字节。

例如：

#include <stdio.h>
#include <string.h>

int main()
{
    char s1[] = "ab";
    char s2[] = "çş";

    printf("%d\n", strlen(s1)); // it prints 2
    printf("%d\n", strlen(s2)); // it prints 4

    return 0;
}

我的朋友说可以在 Windows 中使用下面的代码行来做到这一点：

system("chcp 1254");

他说它将土耳其字符填充到扩展的 ASCII table。但是它在 Linux 中不起作用。

Linux有没有办法做到这一点？

Answer 1

一种可能是使用宽字符串来存储单词。它不会将字符存储为一个字节，但可以解决您的主要问题。获得一组适用于您的语言的功能。该程序如下所示：

#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main()
{
    wchar_t s1[] = L"ab";
    wchar_t s2[] = L"çş";

    printf("%d\n", wcslen(s1)); // it prints 2
    printf("%d\n", wcslen(s2)); // it prints 2

    return 0;
}

Answer 2

现在是 2017 年，很快就是 2018 年了。所以 使用 UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib)。

chcp 1254 是一些 Windows 与 UTF-8 系统无关的特定内容。所以算了。

如果您编写 GUI 应用程序，请使用像 GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) 这样的小部件工具包。

如果您碰巧得到一些文件，您需要知道它使用的编码（这只是您需要得到并遵循的一些约定）。在一些情况下，file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) 命令。

Linux 如何在 C 语言中将非 ASCII 字符压缩为 1 个字节？

How to compress Non-ASCII characters to 1 byte in C for Linux?

c

linux

ascii

non-ascii-characters