C 字符串编码 Windows/Linux
C String encoding Windows/Linux
如果我获取包含 7 位 ASCII table 之外字符的字符串的长度,我在 Windows 和 Linux 上得到不同的结果:
Windows: strlen("ö") = 1
Linux: strlen("ö") = 2
在 Windows 机器上,字符串显然以“扩展”ascii 格式编码为 0xF6
,而在 Linux 机器上,它以 UTF-8 编码 0xC3 0x96
,给出2个字符的长度。
问题:
为什么 C 字符串在 Windows 和 Linux 机器上的编码不同?
这个问题是我在代码审查 (see this thread) 上与一位论坛成员的讨论中提出的。
Why does a C string gets differently encoded on a Windows and a Linux machine?
首先,这不是 Windows/Linux(操作系统)问题,而是一个编译器问题,因为 Windows 上存在编译器,其编码类似于 gcc(在 Linux 上很常见)。
这是 C 允许的,并且两个编译器制造商根据他们自己的编程目标绘制了不同的实现,MS 使用 CP-1252 and Linux using Unicode. 。 MS 的选择早于 Unicode。不同的编译器制造商采用不同的解决方案也就不足为奇了。
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. C11dr §5.2.1 1 (My emphasis)
strlen("ö") = 1
strlen("ö") = 2
"ö"
根据编译器的 源字符扩展字符 .
进行编码
我怀疑 MS 专注于维护他们的代码库并鼓励其他语言。 Linux 只是 Unicode 到 C 的早期适配器,尽管 MS 一直是 Unicode 的早期影响者。
作为 Unicode support grows,我希望这将成为未来的解决方案。
如果我获取包含 7 位 ASCII table 之外字符的字符串的长度,我在 Windows 和 Linux 上得到不同的结果:
Windows: strlen("ö") = 1
Linux: strlen("ö") = 2
在 Windows 机器上,字符串显然以“扩展”ascii 格式编码为 0xF6
,而在 Linux 机器上,它以 UTF-8 编码 0xC3 0x96
,给出2个字符的长度。
问题:
为什么 C 字符串在 Windows 和 Linux 机器上的编码不同?
这个问题是我在代码审查 (see this thread) 上与一位论坛成员的讨论中提出的。
Why does a C string gets differently encoded on a Windows and a Linux machine?
首先,这不是 Windows/Linux(操作系统)问题,而是一个编译器问题,因为 Windows 上存在编译器,其编码类似于 gcc(在 Linux 上很常见)。
这是 C 允许的,并且两个编译器制造商根据他们自己的编程目标绘制了不同的实现,MS 使用 CP-1252 and Linux using Unicode.
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. C11dr §5.2.1 1 (My emphasis)
strlen("ö") = 1
strlen("ö") = 2
"ö"
根据编译器的 源字符扩展字符 .
我怀疑 MS 专注于维护他们的代码库并鼓励其他语言。 Linux 只是 Unicode 到 C 的早期适配器,尽管 MS 一直是 Unicode 的早期影响者。
作为 Unicode support grows,我希望这将成为未来的解决方案。