Perl + Curses:期望来自 getchar() 的 UTF-8 编码多字节字符,但没有得到任何

Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any

我正在试用 Bryan Henderson 的 Perl 接口到 ncurses 库:Curses

作为一个简单的练习,我尝试获取在屏幕上键入的单个字符。这直接基于 NCURSES Programming HOWTO,并进行了改编。

当我调用 Perl 库的 getchar() 时,我希望收到一个字符,可能是多字节(如 this part of the library manpage 中所述,它有点复杂,因为必须处理函数的特殊情况键,没有输入,但这只是通常的花饰)。

就是下面代码中的子程序read1ch()

这适用于 ASCII 字符,但不适用于 0x7F 以上的字符。例如,当点击 è (Unicode 0x00E8, UTF-8: 0xC3, 0xA8) 时,我实际上获得了代码 0xE8 而不是 UTF-8 编码的东西。将它打印到 LANG=en_GB.UTF-8 无法正常工作的终端,无论如何我都期待 0xC3A8。

我需要更改什么才能使其正常工作,即将 è 作为正确的字符或 Perl 字符串获取?

getchar() 截取的 C 代码是 here 顺便说一下。也许它只是没有用 C_GET_WCH 集编译?如何查明?

附录

附录 1

尝试使用

设置binmode
binmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

这应该可以解决任何编码问题,因为终端期望并发送 UTF-8,但这没有帮助。

还尝试使用 use open 设置流编码(不太确定这与上述方法之间的区别),但这也没有帮助

use open qw(:std :encoding(UTF-8));

附录 2

Perl Curses shim 的联机帮助页说:

If wget_wch() is not available (i.e. The Curses library does not understand wide characters), this calls wgetch() [get a 1-byte char from a curses window], but returns the values described above nonetheless. This can be a problem because with a multibyte character encoding like UTF-8, you will receive two one-character strings for a two-byte-character (e.g. "Ã" and "¤" for "ä").

这里可能是这种情况,但是 wget_wch() 确实存在于这个系统上。

附录 3

试图查看 C 代码的作用,并直接将 fprintf 添加到 curses/Curses-1.36/CursesFunWide.c 的多字节处理代码中,重新编译,但未能覆盖系统 Curses.so我自己的 via LD_LIBRARY_PATH(为什么不呢?为什么只有一半时间一切正常?),所以直接替换了系统库(接受那个!)。

#ifdef C_GET_WCH
    wint_t wch;
    int ret = wget_wch(win, &wch);
    if (ret == OK) {
        ST(0) = sv_newmortal();
        fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
        c_wchar2sv(ST(0), wch);
        XSRETURN(1);
    } else if (ret == KEY_CODE_YES) {
        XST_mUNDEF(0);
        ST(1) = sv_newmortal();
        sv_setiv(ST(1), (IV)wch);
        XSRETURN(2);
    } else {
        XSRETURN_UNDEF;
    }
#else

这只是一个胖子 NOPE,当按下 ü 时,您会看到:

Obtained win_t 0x00fc

所以正确的编码是运行,但是数据是ISO-8859-1,不是UTF-8 .所以 wget_wch 表现不好。所以这是一个 curses 配置问题。呵呵

附录 4

让我吃惊的是,也许 ncurses 正在假设默认语言环境,即 C。要使其 ncurses 与宽字符一起工作,必须 "initialize the locale",这可能意味着从 "unset" 移动状态(从而使 ncurses 回落到 C ) 到 "set to what the system indicates" (这应该是 LANG 环境变量中的内容)。 ncurses 的手册页说:

The library uses the locale which the calling program has initialized. That is normally done with setlocale:

setlocale(LC_ALL, "");

If the locale is not initialized, the library assumes that characters are printable as in ISO-8859-1, to work with certain legacy programs. You should initialize the locale and not rely on specific details of the library when the locale has not been setup.

这也没有用,但我觉得解决方案就在这条路上。

附录 5

win_t(显然与wchar_t相同)转换代码来自CursesWide.c,转换从wint_t(此处视为wchar_twget_wch() 转换成 Perl 字符串。 SV 是 "scalar value" 类型。

另请参阅:https://perldoc.perl.org/perlguts.html

这里插入两个fprintf看看是怎么回事:

static void
c_wchar2sv(SV *    const sv,
           wchar_t const wc) {
/*----------------------------------------------------------------------------
  Set SV to a one-character (not -byte!) Perl string holding a given wide
  character
-----------------------------------------------------------------------------*/
    if (wc <= 0xff) {
        char s[] = { wc, 0 };
        fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_off(sv);
    } else {
        char s[UTF8_MAXBYTES + 1] = { 0 };
        char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
        *s_end = 0;
        fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_on(sv);
    }
}

使用 perl-Curses 测试代码

如果您尝试,请按 BACKSPACE 退出循环,因为不再解释 CTRL-C。

下面代码很多,但是临界区用----- Testing:

标示
#!/usr/bin/perl

# pmap -p PID
# shows the per process using 
# /usr/lib64/libncursesw.so.6.1
# /usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so

# Trying https://metacpan.org/release/Curses

use warnings;
use strict;
use utf8;          # Meaning "This lexical scope (i.e. file) contains utf8"

use Curses;        # On Fedora: dnf install perl-Curses

# This didn't fix it 
# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

# https://perldoc.perl.org/perllocale.html#The-setlocale-function

use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");

# ---
# Surrounds the actual program
# ---

sub setup() {
   initscr();
   raw();
   keypad(1);
   noecho();
}

sub teardown {
   endwin();
}

# ---
# Mainly for prettyprinting
# ---

my $special_keys = setup_special_keys();

# ---
# Error printing
# ---

sub mt {
   return sprintf("%i: ",time());
}

sub ae {
   my ($x,$fname) = @_;
   if ($x == ERR) { 
      printw mt();
      printw "Got error code from '$fname': $x\n"
   }
}

# ---
# Where the action is
# ---

sub announce {
   my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
   ae($res, "printw");
   return { refresh => 1 }
}

sub read1ch {
   # Read a next character, waiting until it is there.
   # Use the wide-character aware functions unless you want to deal with
   # collating individual bytes yourself!
   # Readings:
   # https://metacpan.org/pod/Curses#Wide-Character-Aware-Functions
   # https://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties
   # https://www.ahinea.com/en/tech/perl-unicode-struggle.html
   # https://hexdump.wordpress.com/2009/06/19/character-encoding-issues-part-ii-perl/
   my ($ch, $key) = getchar();
   if (defined $key) {
      # it's a function key
      printw "Function key pressed: $key"; 
      printw " with known alias '" . $$special_keys{$key} . "'" if (exists $$special_keys{$key});
      printw "\n";
      # done if backspace was hit
      return { done => ($key == KEY_BACKSPACE()) }
   }
   elsif (defined $ch) {
      # "$ch" should be a String of 1 character

      # ----- Testing

      printw "Locale: $loc\n";
      printw "Multibyte output test: öüäéèà периоду\n";
      printw sprintf("Received string '%s' of length %i with ordinal 0x%x\n", $ch, length($ch), ord($ch));

      {
         # https://perldoc.perl.org/bytes.html
         use bytes;
         printw sprintf("... length is %i\n"     , length($ch));
         printw sprintf("... contents are %vd\n" , $ch);
      }

      # ----- Testing

      return { ch => $ch }
   }
   else {
      # it's an error
      printw "getchar() failed\n";
      return {}
   }
}

sub feedback {
   my ($ch) = @_;
   printw "The pressed key is: ";
   attron(A_BOLD);
   printw("%s\n","$ch"); # do not print $txt directly to make sure escape sequences are not interpreted!
   attroff(A_BOLD);
   return { refresh => 1 }  # should refresh
}

sub do_curses_run {

   setup;

   my $done = 0;
   while (!$done) {
      my $bubl;
      $bubl = announce(); 
      refresh() if $$bubl{refresh};
      $bubl = read1ch();
      $done = $$bubl{done};
      if (defined $$bubl{ch}) {
         $bubl = feedback($$bubl{ch}); 
         refresh() if $$bubl{refresh};
      }
   }

   teardown;
}

# ---
# main
# ---

do_curses_run();


sub setup_special_keys {
   # the key codes on the left must be called once to resolve to a numeric constant!
   my $res = {
      KEY_BREAK()       => "Break key",
      KEY_DOWN()        => "Arrow down",
      KEY_UP()          => "Arrow up",
      KEY_LEFT()        => "Arrow left",
      KEY_RIGHT()       => "Arrow right",
      KEY_HOME()        => "Home key",
      KEY_BACKSPACE()   => "Backspace",
      KEY_DL()          => "Delete line",
      KEY_IL()          => "Insert line",
      KEY_DC()          => "Delete character",
      KEY_IC()          => "Insert char or enter insert mode",
      KEY_EIC()         => "Exit insert char mode",
      KEY_CLEAR()       => "Clear screen",
      KEY_EOS()         => "Clear to end of screen",
      KEY_EOL()         => "Clear to end of line",
      KEY_SF()          => "Scroll 1 line forward",
      KEY_SR()          => "Scroll 1 line backward (reverse)",
      KEY_NPAGE()       => "Next page",
      KEY_PPAGE()       => "Previous page",
      KEY_STAB()        => "Set tab",
      KEY_CTAB()        => "Clear tab",
      KEY_CATAB()       => "Clear all tabs",
      KEY_ENTER()       => "Enter or send",
      KEY_SRESET()      => "Soft (partial) reset",
      KEY_RESET()       => "Reset or hard reset",
      KEY_PRINT()       => "Print or copy",
      KEY_LL()          => "Home down or bottom (lower left)",
      KEY_A1()          => "Upper left of keypad",
      KEY_A3()          => "Upper right of keypad",
      KEY_B2()          => "Center of keypad",
      KEY_C1()          => "Lower left of keypad",
      KEY_C3 ()         => "Lower right of keypad",
      KEY_BTAB()        => "Back tab key",
      KEY_BEG()         => "Beg(inning) key",
      KEY_CANCEL()      => "Cancel key",
      KEY_CLOSE()       => "Close key",
      KEY_COMMAND()     => "Cmd (command) key",
      KEY_COPY()        => "Copy key",
      KEY_CREATE()      => "Create key",
      KEY_END()         => "End key",
      KEY_EXIT()        => "Exit key",
      KEY_FIND()        => "Find key",
      KEY_HELP()        => "Help key",
      KEY_MARK()        => "Mark key",
      KEY_MESSAGE()     => "Message key",
      KEY_MOUSE()       => "Mouse event read",
      KEY_MOVE()        => "Move key",
      KEY_NEXT()        => "Next object key",
      KEY_OPEN()        => "Open key",
      KEY_OPTIONS()     => "Options key",
      KEY_PREVIOUS()    => "Previous object key",
      KEY_REDO()        => "Redo key",
      KEY_REFERENCE()   => "Ref(erence) key",
      KEY_REFRESH()     => "Refresh key",
      KEY_REPLACE()     => "Replace key",
      KEY_RESIZE()      => "Screen resized",
      KEY_RESTART()     => "Restart key",
      KEY_RESUME()      => "Resume key",
      KEY_SAVE()        => "Save key",
      KEY_SBEG()        => "Shifted beginning key",
      KEY_SCANCEL()     => "Shifted cancel key",
      KEY_SCOMMAND()    => "Shifted command key",
      KEY_SCOPY()       => "Shifted copy key",
      KEY_SCREATE()     => "Shifted create key",
      KEY_SDC()         => "Shifted delete char key",
      KEY_SDL()         => "Shifted delete line key",
      KEY_SELECT()      => "Select key",
      KEY_SEND()        => "Shifted end key",
      KEY_SEOL()        => "Shifted clear line key",
      KEY_SEXIT()       => "Shifted exit key",
      KEY_SFIND()       => "Shifted find key",
      KEY_SHELP()       => "Shifted help key",
      KEY_SHOME()       => "Shifted home key",
      KEY_SIC()         => "Shifted input key",
      KEY_SLEFT()       => "Shifted left arrow key",
      KEY_SMESSAGE()    => "Shifted message key",
      KEY_SMOVE()       => "Shifted move key",
      KEY_SNEXT()       => "Shifted next key",
      KEY_SOPTIONS()    => "Shifted options key",
      KEY_SPREVIOUS()   => "Shifted prev key",
      KEY_SPRINT()      => "Shifted print key",
      KEY_SREDO()       => "Shifted redo key",
      KEY_SREPLACE()    => "Shifted replace key",
      KEY_SRIGHT()      => "Shifted right arrow",
      KEY_SRSUME()      => "Shifted resume key",
      KEY_SSAVE()       => "Shifted save key",
      KEY_SSUSPEND()    => "Shifted suspend key",
      KEY_SUNDO()       => "Shifted undo key",
      KEY_SUSPEND()     => "Suspend key",
      KEY_UNDO()        => "Undo key"
   };

   for (my $f = 1; $f <= 64; $f++) {
      $$res{KEY_F($f)} = "KEY_F($f)"   
   }

   return $res

}

实际上它看起来是正确的。

运行 你的带有 strace 的脚本可以提供帮助...我这样做是为了查看系统调用:

strace -fo strace.out -s 1024 ./foo

并且可以看到读取、消息等。可以使用调试库为 ncurses 获取类似的跟踪,尽管打包者在提供启用跟踪的方面并不一致。

ü在UTF-8中是34(八进制),它的Unicode值为252(十进制),或 0xfc(十六进制)。这部分问题好像漏了一点:

That's just a fat NOPE, when pressing ü one sees:

Obtained win_t 0x00fc

So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.

wget_wch returns(出于实用目的)一个 Unicode 值(不是 UTF-8 字节序列)。 ISO-8859-1 代码 160-255 碰巧(并非巧合)匹配 Unicode 代码点,尽管后者在 UTF-8 中肯定 encoded 不同。

wgetch 会 return UTF-8 字节,但 Perl 脚本只会将其用作后备(因为这会导致 Perl 脚本将 UTF-8 字符串转换为 Unicode值)。

Thomas Dickey 正确地注意到 收到了正确的数据

我花了一些时间才真正确定。

混淆是因为 Perl 的 sprintf 无法处理 UTF-8 而 Perl Curses printw 无法处理区域 0x800x7F.

这需要更长的时间才能确定。

其实我新开了一个关于这个的问题:

Are there one (or two) solid bugs in the `curses` shim for Perl?

[ 此答案假定 libncursesw 可用且正在使用。在没有宽字符支持的情况下尝试输出 "wide characters" 是没有意义的:) ]


简答

getchar 工作正常。它 returns 一串 Unicode 代码点(又名解码文本),非常理想。

printw 已损坏,但可以通过在程序中添加以下内容使其接受一串 Unicode 代码点(也称为解码文本):

{
   # Add wide character support to printw.
   # This only modifies the current package (main),
   # so it won't affect any code by ours.
   no warnings qw( redefine );
   sub printw { addstring(sprintf shift, @_) }
}


getchar有问题吗?

所以您认为 getchar 有问题。让我们尝试通过检查 getchar returns 来确认这一点。我们将通过添加以下内容来做到这一点:

printw("String received from getchar: %vX\n", $ch);

%vX 将以十六进制打印字符串的每个字符的值,由句点连接。)

  • 当按下e(U+0065),一个7位的字符,可以看到:

    String received from getchar: 65
    
  • 当按é(U+00E9)时,一个8位字符,可以看到:

    String received from getchar: E9
    
  • 当按ē(U+0113)时,一个9位字符,可以看到:

    String received from getchar: 113
    

在这三种情况下,我们得到的字符串正好是一个字符长,并且该字符由输入的 Unicode 代码点组成。[1] 这正是我们想要。应用和删除字符编码应该在外围完成,这样程序的主要逻辑就不必担心编码,并且正在这样做。

结论:getchar没有问题。


printw有问题吗?

所以问题一定出在输出上。为了确认这一点,我在您的程序中添加了以下内容:

sub _d { utf8::downgrade( my $s = shift ); $s }
sub _u { utf8::upgrade(   my $s = shift ); $s }

for (
   [ "7-bit, UTF8=0" => _d(chr(0x65)) ],   # Expect e
   [ "7-bit, UTF8=1" => _u(chr(0x65)) ],   # Expect e
   [ "8-bit, UTF8=0" => _d(chr(0xE9)) ],   # Expect é
   [ "8-bit, UTF8=1" => _u(chr(0xE9)) ],   # Expect é
   [ "9-bit, UTF8=1" => chr(0x113)    ],   # Expect ē
) {
   my ($name, $chr) = @$_;
   printw("%s: %s\n", $name, $chr);
}

输出:

7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0:
8-bit, UTF8=1: é
9-bit, UTF8=1:  S

从上面我们观察到:

  • 我们看到 _d(chr(0xE9))_u(chr(0xE9)) 的结果之间存在差异,即使两个标量包含相同的字符串(_d(chr(0xE9)) eq _u(chr(0xE9)) 为真)。因此,该函数存在 The Unicode Bug。
  • 根据 8 位测试,它似乎接受 Unicode 代码点(解码文本)而不是 UTF-8。这是理想的。
  • 根据 9 位测试,它似乎不接受 Unicode 代码点。随后的测试表明它也不接受 chr(0x113) 的 UTF-8 编码。

结论:printw存在重大问题。


解决 printw

的问题

解决 Unicode Bug 很容易,但缺乏对 0​​xFF 以上字符的支持是一个阻碍。让我们深入研究代码。

好的,我们不必寻找问题。我们看到 printw 是根据 addstr 定义的,而 addstr 早于宽字符支持。 addstring 是支持宽字符的对应项,所以让我们让 printw 使用 addstring 而不是 addstr

{
   # Add wide character support to printw.
   # This only modifies the current package (main),
   # so it won't affect any code by ours.
   no warnings qw( redefine );
   sub printw { addstring(sprintf shift, @_) }
}

输出:

7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0: é
8-bit, UTF8=1: é
9-bit, UTF8=1: ē

宾果!

从上面我们观察到:

  • 我们发现 UTF8=0 测试的结果与其对应的 UTF8=1 测试的结果没有差异。因此,此函数不会受到 Unicode Bug 的影响。
  • 它始终接受 Unicode 代码点(解码文本)字符串。值得注意的是,它不需要 UTF-8 或语言环境的编码。

这正是我们expect/desire。


  1. 具体来说,getchar 没有像您认为的那样返回输入的 iso-8859-1 编码。这种混淆是可以理解的,因为 Unicode 是 iso-8859-1 的扩展。