为什么使用 locale-character 时打印的字符串更短

Question

我写了下面的代码。我正在尝试使用非 ASCII 字符打印指定长度的字符串。

int main(int argc, char **argv)
{   
    setlocale(LC_ALL,"pl_PL");
    printf("%-10sx\n","ą");
    printf("%-10sx\n","a");
    return 0;
}

输出结果如下：

ą        x
a         x

使用非ASCII字符时少一个（白色）space。为什么会这样？

Answer 1

Why printed string is shorter when locale-charater is used

因为表示一个multi-byte字符串所需的列数不等于字符占用的字节数。

Why does it behave like this?

字符串 "ą" 占用 2 个字节（零终止字符还占用 1 个字节），但显示在 1 列上。所以会有8个空格的padding。

字符串"a"的长度为1个字节，所以会有9个空格填充。

Is there any way to overcome this issue without manually changing the desired length when the string contains a non-ASCII character?

使用一个库，该库包含字符及其宽度之间的映射数据库，用于您正在使用的编码。遍历字符串，获取表示它所需的列数。然后将显示的宽度添加到您希望从该长度获得的偏移量。总的来说，显示字符的宽度是一项 non-trivial 任务，并且存在问题和边缘情况。

使用 https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ . Read about normal, multi-byte and wide stings and about uchar8_t uchar16_t and uchar32_t strings - https://en.cppreference.com/w/c/string/wide and https://en.cppreference.com/w/c/string/multibyte 进入 unicode 世界。

在 Linux 上，您的语言环境默认使用 UTF-8，您的终端很可能使用 UTF-8，并且您的编译器对字符串文字使用 UTF-8 编码（这些都是单独的属性和可以混合）。在 Linux 上，您可以将字符串转换为宽字符串（它本身很难）并遍历字符串并使用 wcswidth 获取列数。还有图书馆 - libunistring 具有 u8_width 功能，ICU 具有 u_countChar32 和类似功能。

我能看到一些东西：

#define _XOPEN_SOURCE   // for wcwidth on Linux
#include <wchar.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <limits.h>
#include <stdint.h>
#include <locale.h>

size_t mbswidth(const char *s, size_t n) { // similar to wcswidth
   mbstate_t ps;
   memset(&ps, 0, sizeof(ps));
   size_t ret = 0;
   while (n != 0 && *s != 0) {
      wchar_t wc;
      const size_t rr = mbrtowc(&wc, s, n, &ps);
      if (rr == (size_t)-1 || rr == (size_t)-2) {
          return rr;
      }
      assert(rr != 0);
      n -= rr;
      s += rr;
      ret += wcwidth(wc);
   }
   return ret;
}

int main() {
   setlocale(LC_ALL, "pl_PL.UTF-8"); // see https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
   const char *s = "ą";  // I think I would `= u8"ą";` on newer compilers
   printf("%-*sx\n", 10 + (int)mbswidth(s, SIZE_MAX), s);
   printf("%-*sx\n", 10, "a");
}