如何更改 wchar.h 以使 wchar_t 与 wint_t 相同?

How to change wchar.h to make wchar_t the same type as wint_t?

wchar_t 定义在 wchar.h

目前,如果开发者只想使用wchar_t,他们做不到 这没有从编译器获得类型转换警告。如果 wchar_twint_t做成同类型,对双方都有好处。 希望同时拥有 wint_twchar_t 的开发人员 程序(例如,如果他们希望他们的代码不仅在 glibc) 可以做到这一点而不会收到编译器警告。开发人员 只想使用 wchar_t(以避免使用 wint_t 和 显式类型转换)也可以在不收到编译器警告的情况下执行此操作。 而且它不会带来任何不兼容或可移植性问题,除了如果仅使用 wchar_t 的代码将在使用原始 wchar.h 的机器上编译,编译器将打印那些讨厌的警告(如果 -Wconversion 已启用),但编译后的程序将以完全相同的方式工作。

C 标准 (9899:201x 7.29) 提到:

wchar_t and wint_t can be the same integer type.

此外,在 glibc 中宽字符总是 ISO10646/Unicode/UCS-4,所以他们总是使用4个字节。因此,没有什么 阻止 wchar_t 与 glibc 中的 wint_t 相同的类型。

但是glibc的开发者好像不想做wint_twchar_t 由于某种原因相同的类型。因此,我想更改的本地副本 wchar.h.

ISO10646/Unicode/UCS-4 对扩展字符集使用 2^31 值 (未使用的 MSB):

0xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

注意,4 字节类型可以容纳 2^31 个额外值(MSB 为“1”):

1xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

这些额外值中的任何一个都可以用来表示 WEOF,因此一个 4字节类型可以用来存放所有的字符集andWEOF.

注意,使用修改后的 wchar.h 不需要重新编译 glibc,因为 wint_t 可以有符号或无符号(因为 -10xffffffff 都有 MSB “1”,在任何表示中,并且由于 MSB 不用于 ISO10646/Unicode/UCS-4).

wchar_t 的定义在以下摘自 wchar.h 的某处完成。 如何更改它以使 wchar_twint_t 的类型相同?

#ifndef _WCHAR_H

#if !defined __need_mbstate_t && !defined __need_wint_t
# define _WCHAR_H 1
# include <features.h>
#endif

#ifdef _WCHAR_H
/* Get FILE definition.  */
# define __need___FILE
# if defined __USE_UNIX98 || defined __USE_XOPEN2K
#  define __need_FILE
# endif
# include <stdio.h>
/* Get va_list definition.  */
# define __need___va_list
# include <stdarg.h>

# include <bits/wchar.h>

/* Get size_t, wchar_t, wint_t and NULL from <stddef.h>.  */
# define __need_size_t
# define __need_wchar_t
# define __need_NULL
#endif
#if defined _WCHAR_H || defined __need_wint_t || !defined __WINT_TYPE__
# undef __need_wint_t
# define __need_wint_t
# include <stddef.h>

/* We try to get wint_t from <stddef.h>, but not all GCC versions define it
   there.  So define it ourselves if it remains undefined.  */
# ifndef _WINT_T
/* Integral type unchanged by default argument promotions that can
   hold any value corresponding to members of the extended character
   set, as well as at least one value that does not correspond to any
   member of the extended character set.  */
#  define _WINT_T
typedef unsigned int wint_t;
# else
/* Work around problems with the <stddef.h> file which doesn't put
   wint_t in the std namespace.  */
#  if defined __cplusplus && defined _GLIBCPP_USE_NAMESPACES \
      && defined __WINT_TYPE__
__BEGIN_NAMESPACE_STD
typedef __WINT_TYPE__ wint_t;
__END_NAMESPACE_STD
#  endif
# endif

/* Tell the caller that we provide correct C++ prototypes.  */
# if defined __cplusplus && __GNUC_PREREQ (4, 4)
#  define __CORRECT_ISO_CPP_WCHAR_H_PROTO
# endif
#endif

#if (defined _WCHAR_H || defined __need_mbstate_t) && !defined ____mbstate_t_defined
# define ____mbstate_t_defined  1
/* Conversion state information.  */
typedef struct
{
  int __count;
  union
  {
# ifdef __WINT_TYPE__
    __WINT_TYPE__ __wch;
# else
    wint_t __wch;
# endif
    char __wchb[4];
  } __value;        /* Value so far.  */
} __mbstate_t;
#endif
#undef __need_mbstate_t


/* The rest of the file is only used if used if __need_mbstate_t is not
   defined.  */
#ifdef _WCHAR_H

# ifndef __mbstate_t_defined
__BEGIN_NAMESPACE_C99
/* Public type.  */
typedef __mbstate_t mbstate_t;
__END_NAMESPACE_C99
#  define __mbstate_t_defined 1
# endif

#ifdef __USE_GNU
__USING_NAMESPACE_C99(mbstate_t)
#endif

#ifndef WCHAR_MIN
/* These constants might also be defined in <inttypes.h>.  */
# define WCHAR_MIN __WCHAR_MIN
# define WCHAR_MAX __WCHAR_MAX
#endif

#ifndef WEOF
# define WEOF (0xffffffffu)
#endif

/* For XPG4 compliance we have to define the stuff from <wctype.h> here
   as well.  */
#if defined __USE_XOPEN && !defined __USE_UNIX98
# include <wctype.h>
#endif


__BEGIN_DECLS

__BEGIN_NAMESPACE_STD
/* This incomplete type is defined in <time.h> but needed here because
   of `wcsftime'.  */
struct tm;
__END_NAMESPACE_STD
/* XXX We have to clean this up at some point.  Since tm is in the std
   namespace but wcsftime is in __c99 the type wouldn't be found
   without inserting it in the global namespace.  */
__USING_NAMESPACE_STD(tm)

如果我们需要在使用-Wconversion编译器选项时避免类型转换警告,我们需要将所有库函数的原型中的wint_t更改为wchar_t,并把' #define WEOF (-1)' 到 wchar.hwctype.h

的开头

对于wchar.h,命令是:

sudo perl -i -pe 'print qq(#define WEOF (-1)\n) if $.==1; next unless /Copy SRC to DEST\./..eof; s/\bwint_t\b/wchar_t/g' /usr/include/wchar.h

对于wctype.h,命令是:

sudo perl -i -pe 'print qq(#define WEOF (-1)\n) if $.==1; next unless /Wide-character classification functions/..eof; s/\bwint_t\b/wchar_t/g' /usr/include/wctype.h

同样,如果您使用其他使用 wint_t 的头文件,只需将这些头文件中的原型中的 wint_t 更改为 wchar_t

解释如下。

Some Unix systems define wchar_t as a 16-bit type and thereby follow Unicode very strictly. This definition is perfectly fine with the standard, but it also means that to represent all characters from Unicode and ISO 10646 one has to use UTF-16 surrogate characters, which is in fact a multi-wide-character encoding. But resorting to multi-wide-character encoding contradicts the purpose of the wchar_t type.

现在,唯一可以用于数据交换的编码是UTF-8,它可以容纳的最大数据位数是31:

1111110x    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx

所以,你看到实际上没有必要将 wint_t 作为一个单独的类型(因为 4 字节(即 32 位)数据类型用于存储 Unicode 代码点)。也许它有一些 "backward compatibility" 之类的应用程序,但在新代码中它毫无意义。再一次,因为它完全违背了使用宽字符的目的(现在不能处理 UTF-8 对使用宽字符毫无意义)。

注意,事实上的 wint_t 并没有被使用。例如,参见 man mbstowcs 中的示例。 wchar_t 类型的变量被传递给 iswlower() 和来自 wctype.h 的其他函数,它们采用 wint_t.

请注意,引入 wint_t 是因为 wchar_t 在传递给 printf() 等人时可能是受 'default promotion' 规则约束的类型。这很重要,例如,在调用 printf():

wchar_t wc = …;
printf("%lc", wc);

wc 的值可能会转换为 wint_t。如果你正在编写像 printf() 这样的函数,它需要使用来自 <stdarg.h>va_arg() 宏,那么你应该使用类型 wint_t 来获取值。

标准指出 wint_t 可能与 wchar_t 的类型相同,但如果 wchar_t 是(16 位)short(或 unsigned short), wint_t 可能是(32 位)int。对于第一个近似值,wint_t 仅在 wchar_t 是 16 位类型时才重要。当然,完整的规则要复杂得多。例如,int 可能是 16 位类型 — 但这很少成为问题。

ISO/IEC 9899:2011

7.29 Extended multibyte and wide character utilities <wchar.h>

7.29.1 Introduction

¶1 The header <wchar.h> defines four macros, and declares four data types, one tag, and many functions.326)

2 The types declared are wchar_t and size_t (both described in 7.19);

mbstate_t

which is a complete object type other than an array type that can hold the conversion state information necessary to convert between sequences of multibyte characters and wide characters;

wint_t

which is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set (see WEOF below);327)

326) See ‘‘future library directions’’ (7.31.16).
327) wchar_t and wint_t can be the same integer type.

§7.19 Common definitions <stddef.h>

¶2 … and

wchar_t

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero. Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__.

请参阅 Why the argument type of putchar(), fputc(), and putc() is not char 以了解引用 C 标准中的 'default promotion' 规则的一处。可能还有其他问题的信息也可用。