如何在 C 语言中使用 'æ'、'ø' 和 'å' 进行运算

Question

我用 C 编写了一个程序，它都可以替换或删除字符串中的所有元音。此外，我希望它适用于这些字符：'æ'、'ø'、'å'。

我曾尝试使用 strstr()，但我未能在不替换包含“æ”、“ø”或“å”的行中的所有字符的情况下实现它。我也读过 wchar，但这似乎只会使一切复杂化。

程序正在处理这个字符数组：

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

我试过这个数组：

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

但它给出了这些警告：

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

如果我想用 'a' 替换每个元音，它会将“å”替换为“�a”。

我也试过 UTF-8 hexval 的 'æ'、'ø' 和 'å'。

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

但它给出了这个错误：

excess elements in char array initializer

有没有一种方法可以在不使它变得太复杂的情况下完成这项工作？

Answer 1

有两种方法可以使该角色可用。第一个是 code pages, which would allow you to use extended ASCII characters（值 128-255），但代码页依赖于系统和区域设置，所以一般来说这是个坏主意。

更好的选择是使用 unicode. The typical case with unicode is to use wide character literals, like in this post:

wchar_t str[] = L"αγρω";

您的代码的关键问题是您试图 compare ASCII with UTF8, which can be a problem。这个问题的解决方案很简单：将所有文字转换为宽字符 UTF8 等价物，以及您的字符串。您需要使用通用编码而不是混合编码，除非您有转换函数来帮忙。

Answer 2

了解 UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

您需要了解 character encoding 您使用的是什么。

强烈推荐UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)）。阅读 utf8everywhere.....

^{我不推荐 wchar_t，其宽度、范围和符号是特定于实现的（您不能确定 Unicode 适合 wchar_t；传言在 Windows 上不适合）。另外将 UTF-8 输入转换为 Unicode/UCS4 可能很耗时，比处理 UTF-8 更耗时...}

请理解，在 UTF-8 中，一个字符可以编码为几个字节。例如 ê（法语重音 e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery 小写）被编码为两个字节 0xd1, 0x8b 并且两者都被认为是元音但都不适合一个 char （这是一个 8 位byte 在你和我的机器上）。

vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters)的概念。

您确定 æ 和 œ 是字母还是元音？（FWIW，å & œ & æ 在 Unicode 中被归类为字母和小写）。我在法国小学被教导他们是ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

在 Linux 上，因为 UTF-8 是默认编码（并且在最近的发行版中越来越难以获得其他编码），我不建议使用 wchar_t，但使用UTF-8 char（因此函数处理多字节编码的 UTF-8），例如（使用 Glib UTF8 和 Unicode 函数）：

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '[=10=]' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

我不确定我的函数名称是否正确；但你在评论中告诉我，挪威语（我不知道）的元音字符不超过我的函数计算的字符数。

我故意没有把 UTF-8 放在文字字符串或宽字符文字中（仅在注释中）。还有其他过时的字符编码（阅读 EBCDIC or KOI8），您可能需要交叉编译代码。

如何在 C 语言中使用 'æ'、'ø' 和 'å' 进行运算

How to do operations with 'æ', 'ø' and 'å' in C

c

arrays

replace

char

wchar