Perl + Curses:期望来自 getchar() 的 UTF-8 编码多字节字符,但没有得到任何
Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any
我正在试用 Bryan Henderson 的 Perl 接口到 ncurses 库:Curses
作为一个简单的练习,我尝试获取在屏幕上键入的单个字符。这直接基于 NCURSES Programming HOWTO,并进行了改编。
当我调用 Perl 库的 getchar()
时,我希望收到一个字符,可能是多字节(如 this part of the library manpage 中所述,它有点复杂,因为必须处理函数的特殊情况键,没有输入,但这只是通常的花饰)。
就是下面代码中的子程序read1ch()
。
这适用于 ASCII 字符,但不适用于 0x7F 以上的字符。例如,当点击 è
(Unicode 0x00E8, UTF-8: 0xC3, 0xA8) 时,我实际上获得了代码 0xE8 而不是 UTF-8 编码的东西。将它打印到 LANG=en_GB.UTF-8
无法正常工作的终端,无论如何我都期待 0xC3A8。
我需要更改什么才能使其正常工作,即将 è
作为正确的字符或 Perl 字符串获取?
为 getchar()
截取的 C 代码是 here 顺便说一下。也许它只是没有用 C_GET_WCH
集编译?如何查明?
附录
附录 1
尝试使用
设置binmode
binmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
这应该可以解决任何编码问题,因为终端期望并发送 UTF-8,但这没有帮助。
还尝试使用 use open 设置流编码(不太确定这与上述方法之间的区别),但这也没有帮助
use open qw(:std :encoding(UTF-8));
附录 2
Perl Curses shim 的联机帮助页说:
If wget_wch()
is not available (i.e. The Curses library does not
understand wide characters), this calls wgetch()
[get a 1-byte char
from a curses window], but returns the
values described above nonetheless. This can be a problem because with a
multibyte character encoding like UTF-8, you will receive two
one-character strings for a two-byte-character (e.g. "Ã" and "¤" for
"ä").
这里可能是这种情况,但是 wget_wch()
确实存在于这个系统上。
附录 3
试图查看 C 代码的作用,并直接将 fprintf
添加到 curses/Curses-1.36/CursesFunWide.c
的多字节处理代码中,重新编译,但未能覆盖系统 Curses.so
我自己的 via LD_LIBRARY_PATH
(为什么不呢?为什么只有一半时间一切正常?),所以直接替换了系统库(接受那个!)。
#ifdef C_GET_WCH
wint_t wch;
int ret = wget_wch(win, &wch);
if (ret == OK) {
ST(0) = sv_newmortal();
fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
c_wchar2sv(ST(0), wch);
XSRETURN(1);
} else if (ret == KEY_CODE_YES) {
XST_mUNDEF(0);
ST(1) = sv_newmortal();
sv_setiv(ST(1), (IV)wch);
XSRETURN(2);
} else {
XSRETURN_UNDEF;
}
#else
这只是一个胖子 NOPE,当按下 ü
时,您会看到:
Obtained win_t 0x00fc
所以正确的编码是运行,但是数据是ISO-8859-1,不是UTF-8 .所以 wget_wch
表现不好。所以这是一个 curses 配置问题。呵呵
附录 4
让我吃惊的是,也许 ncurses
正在假设默认语言环境,即 C
。要使其 ncurses
与宽字符一起工作,必须 "initialize the locale",这可能意味着从 "unset" 移动状态(从而使 ncurses
回落到 C
) 到 "set to what the system indicates" (这应该是 LANG
环境变量中的内容)。 ncurses
的手册页说:
The library uses the locale which the calling program has initialized.
That is normally done with setlocale:
setlocale(LC_ALL, "");
If the locale is not initialized, the library assumes that characters
are printable as in ISO-8859-1, to work with certain legacy programs.
You should initialize the locale and not rely on specific details of the
library when the locale has not been setup.
这也没有用,但我觉得解决方案就在这条路上。
附录 5
win_t
(显然与wchar_t
相同)转换代码来自CursesWide.c
,转换从wint_t
(此处视为wchar_t
) wget_wch()
转换成 Perl 字符串。 SV
是 "scalar value" 类型。
另请参阅:https://perldoc.perl.org/perlguts.html
这里插入两个fprintf
看看是怎么回事:
static void
c_wchar2sv(SV * const sv,
wchar_t const wc) {
/*----------------------------------------------------------------------------
Set SV to a one-character (not -byte!) Perl string holding a given wide
character
-----------------------------------------------------------------------------*/
if (wc <= 0xff) {
char s[] = { wc, 0 };
fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_off(sv);
} else {
char s[UTF8_MAXBYTES + 1] = { 0 };
char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
*s_end = 0;
fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_on(sv);
}
}
使用 perl-Curses 测试代码
- 尝试使用 perl-Curses-1.36-9.fc30。x86_64
- 尝试使用 perl-Curses-1.36-11.fc31。x86_64
如果您尝试,请按 BACKSPACE 退出循环,因为不再解释 CTRL-C。
下面代码很多,但是临界区用----- Testing
:
标示
#!/usr/bin/perl
# pmap -p PID
# shows the per process using
# /usr/lib64/libncursesw.so.6.1
# /usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so
# Trying https://metacpan.org/release/Curses
use warnings;
use strict;
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
use Curses; # On Fedora: dnf install perl-Curses
# This didn't fix it
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
# https://perldoc.perl.org/perllocale.html#The-setlocale-function
use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");
# ---
# Surrounds the actual program
# ---
sub setup() {
initscr();
raw();
keypad(1);
noecho();
}
sub teardown {
endwin();
}
# ---
# Mainly for prettyprinting
# ---
my $special_keys = setup_special_keys();
# ---
# Error printing
# ---
sub mt {
return sprintf("%i: ",time());
}
sub ae {
my ($x,$fname) = @_;
if ($x == ERR) {
printw mt();
printw "Got error code from '$fname': $x\n"
}
}
# ---
# Where the action is
# ---
sub announce {
my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
ae($res, "printw");
return { refresh => 1 }
}
sub read1ch {
# Read a next character, waiting until it is there.
# Use the wide-character aware functions unless you want to deal with
# collating individual bytes yourself!
# Readings:
# https://metacpan.org/pod/Curses#Wide-Character-Aware-Functions
# https://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties
# https://www.ahinea.com/en/tech/perl-unicode-struggle.html
# https://hexdump.wordpress.com/2009/06/19/character-encoding-issues-part-ii-perl/
my ($ch, $key) = getchar();
if (defined $key) {
# it's a function key
printw "Function key pressed: $key";
printw " with known alias '" . $$special_keys{$key} . "'" if (exists $$special_keys{$key});
printw "\n";
# done if backspace was hit
return { done => ($key == KEY_BACKSPACE()) }
}
elsif (defined $ch) {
# "$ch" should be a String of 1 character
# ----- Testing
printw "Locale: $loc\n";
printw "Multibyte output test: öüäéèà периоду\n";
printw sprintf("Received string '%s' of length %i with ordinal 0x%x\n", $ch, length($ch), ord($ch));
{
# https://perldoc.perl.org/bytes.html
use bytes;
printw sprintf("... length is %i\n" , length($ch));
printw sprintf("... contents are %vd\n" , $ch);
}
# ----- Testing
return { ch => $ch }
}
else {
# it's an error
printw "getchar() failed\n";
return {}
}
}
sub feedback {
my ($ch) = @_;
printw "The pressed key is: ";
attron(A_BOLD);
printw("%s\n","$ch"); # do not print $txt directly to make sure escape sequences are not interpreted!
attroff(A_BOLD);
return { refresh => 1 } # should refresh
}
sub do_curses_run {
setup;
my $done = 0;
while (!$done) {
my $bubl;
$bubl = announce();
refresh() if $$bubl{refresh};
$bubl = read1ch();
$done = $$bubl{done};
if (defined $$bubl{ch}) {
$bubl = feedback($$bubl{ch});
refresh() if $$bubl{refresh};
}
}
teardown;
}
# ---
# main
# ---
do_curses_run();
sub setup_special_keys {
# the key codes on the left must be called once to resolve to a numeric constant!
my $res = {
KEY_BREAK() => "Break key",
KEY_DOWN() => "Arrow down",
KEY_UP() => "Arrow up",
KEY_LEFT() => "Arrow left",
KEY_RIGHT() => "Arrow right",
KEY_HOME() => "Home key",
KEY_BACKSPACE() => "Backspace",
KEY_DL() => "Delete line",
KEY_IL() => "Insert line",
KEY_DC() => "Delete character",
KEY_IC() => "Insert char or enter insert mode",
KEY_EIC() => "Exit insert char mode",
KEY_CLEAR() => "Clear screen",
KEY_EOS() => "Clear to end of screen",
KEY_EOL() => "Clear to end of line",
KEY_SF() => "Scroll 1 line forward",
KEY_SR() => "Scroll 1 line backward (reverse)",
KEY_NPAGE() => "Next page",
KEY_PPAGE() => "Previous page",
KEY_STAB() => "Set tab",
KEY_CTAB() => "Clear tab",
KEY_CATAB() => "Clear all tabs",
KEY_ENTER() => "Enter or send",
KEY_SRESET() => "Soft (partial) reset",
KEY_RESET() => "Reset or hard reset",
KEY_PRINT() => "Print or copy",
KEY_LL() => "Home down or bottom (lower left)",
KEY_A1() => "Upper left of keypad",
KEY_A3() => "Upper right of keypad",
KEY_B2() => "Center of keypad",
KEY_C1() => "Lower left of keypad",
KEY_C3 () => "Lower right of keypad",
KEY_BTAB() => "Back tab key",
KEY_BEG() => "Beg(inning) key",
KEY_CANCEL() => "Cancel key",
KEY_CLOSE() => "Close key",
KEY_COMMAND() => "Cmd (command) key",
KEY_COPY() => "Copy key",
KEY_CREATE() => "Create key",
KEY_END() => "End key",
KEY_EXIT() => "Exit key",
KEY_FIND() => "Find key",
KEY_HELP() => "Help key",
KEY_MARK() => "Mark key",
KEY_MESSAGE() => "Message key",
KEY_MOUSE() => "Mouse event read",
KEY_MOVE() => "Move key",
KEY_NEXT() => "Next object key",
KEY_OPEN() => "Open key",
KEY_OPTIONS() => "Options key",
KEY_PREVIOUS() => "Previous object key",
KEY_REDO() => "Redo key",
KEY_REFERENCE() => "Ref(erence) key",
KEY_REFRESH() => "Refresh key",
KEY_REPLACE() => "Replace key",
KEY_RESIZE() => "Screen resized",
KEY_RESTART() => "Restart key",
KEY_RESUME() => "Resume key",
KEY_SAVE() => "Save key",
KEY_SBEG() => "Shifted beginning key",
KEY_SCANCEL() => "Shifted cancel key",
KEY_SCOMMAND() => "Shifted command key",
KEY_SCOPY() => "Shifted copy key",
KEY_SCREATE() => "Shifted create key",
KEY_SDC() => "Shifted delete char key",
KEY_SDL() => "Shifted delete line key",
KEY_SELECT() => "Select key",
KEY_SEND() => "Shifted end key",
KEY_SEOL() => "Shifted clear line key",
KEY_SEXIT() => "Shifted exit key",
KEY_SFIND() => "Shifted find key",
KEY_SHELP() => "Shifted help key",
KEY_SHOME() => "Shifted home key",
KEY_SIC() => "Shifted input key",
KEY_SLEFT() => "Shifted left arrow key",
KEY_SMESSAGE() => "Shifted message key",
KEY_SMOVE() => "Shifted move key",
KEY_SNEXT() => "Shifted next key",
KEY_SOPTIONS() => "Shifted options key",
KEY_SPREVIOUS() => "Shifted prev key",
KEY_SPRINT() => "Shifted print key",
KEY_SREDO() => "Shifted redo key",
KEY_SREPLACE() => "Shifted replace key",
KEY_SRIGHT() => "Shifted right arrow",
KEY_SRSUME() => "Shifted resume key",
KEY_SSAVE() => "Shifted save key",
KEY_SSUSPEND() => "Shifted suspend key",
KEY_SUNDO() => "Shifted undo key",
KEY_SUSPEND() => "Suspend key",
KEY_UNDO() => "Undo key"
};
for (my $f = 1; $f <= 64; $f++) {
$$res{KEY_F($f)} = "KEY_F($f)"
}
return $res
}
实际上它看起来是正确的。
运行 你的带有 strace 的脚本可以提供帮助...我这样做是为了查看系统调用:
strace -fo strace.out -s 1024 ./foo
并且可以看到读取、消息等。可以使用调试库为 ncurses 获取类似的跟踪,尽管打包者在提供启用跟踪的方面并不一致。
ü
在UTF-8中是34
(八进制),它的Unicode值为252
(十进制),或 0xfc
(十六进制)。这部分问题好像漏了一点:
That's just a fat NOPE, when pressing ü one sees:
Obtained win_t 0x00fc
So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.
wget_wch
returns(出于实用目的)一个 Unicode 值(不是 UTF-8 字节序列)。 ISO-8859-1 代码 160-255 碰巧(并非巧合)匹配 Unicode 代码点,尽管后者在 UTF-8 中肯定 encoded 不同。
wgetch
会 return UTF-8 字节,但 Perl 脚本只会将其用作后备(因为这会导致 Perl 脚本将 UTF-8 字符串转换为 Unicode值)。
Thomas Dickey 正确地注意到 收到了正确的数据。
我花了一些时间才真正确定。
混淆是因为 Perl 的 sprintf
无法处理 UTF-8 而 Perl Curses printw
无法处理区域 0x80
到 0x7F
.
这需要更长的时间才能确定。
其实我新开了一个关于这个的问题:
Are there one (or two) solid bugs in the `curses` shim for Perl?
[ 此答案假定 libncursesw 可用且正在使用。在没有宽字符支持的情况下尝试输出 "wide characters" 是没有意义的:) ]
简答
getchar
工作正常。它 returns 一串 Unicode 代码点(又名解码文本),非常理想。
printw
已损坏,但可以通过在程序中添加以下内容使其接受一串 Unicode 代码点(也称为解码文本):
{
# Add wide character support to printw.
# This only modifies the current package (main),
# so it won't affect any code by ours.
no warnings qw( redefine );
sub printw { addstring(sprintf shift, @_) }
}
getchar
有问题吗?
所以您认为 getchar
有问题。让我们尝试通过检查 getchar
returns 来确认这一点。我们将通过添加以下内容来做到这一点:
printw("String received from getchar: %vX\n", $ch);
(%vX
将以十六进制打印字符串的每个字符的值,由句点连接。)
当按下e
(U+0065),一个7位的字符,可以看到:
String received from getchar: 65
当按é
(U+00E9)时,一个8位字符,可以看到:
String received from getchar: E9
当按ē
(U+0113)时,一个9位字符,可以看到:
String received from getchar: 113
在这三种情况下,我们得到的字符串正好是一个字符长,并且该字符由输入的 Unicode 代码点组成。[1] 这正是我们想要。应用和删除字符编码应该在外围完成,这样程序的主要逻辑就不必担心编码,并且正在这样做。
结论:getchar
没有问题。
printw
有问题吗?
所以问题一定出在输出上。为了确认这一点,我在您的程序中添加了以下内容:
sub _d { utf8::downgrade( my $s = shift ); $s }
sub _u { utf8::upgrade( my $s = shift ); $s }
for (
[ "7-bit, UTF8=0" => _d(chr(0x65)) ], # Expect e
[ "7-bit, UTF8=1" => _u(chr(0x65)) ], # Expect e
[ "8-bit, UTF8=0" => _d(chr(0xE9)) ], # Expect é
[ "8-bit, UTF8=1" => _u(chr(0xE9)) ], # Expect é
[ "9-bit, UTF8=1" => chr(0x113) ], # Expect ē
) {
my ($name, $chr) = @$_;
printw("%s: %s\n", $name, $chr);
}
输出:
7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0:
8-bit, UTF8=1: é
9-bit, UTF8=1: S
从上面我们观察到:
- 我们看到
_d(chr(0xE9))
和 _u(chr(0xE9))
的结果之间存在差异,即使两个标量包含相同的字符串(_d(chr(0xE9)) eq _u(chr(0xE9))
为真)。因此,该函数存在 The Unicode Bug。
- 根据 8 位测试,它似乎接受 Unicode 代码点(解码文本)而不是 UTF-8。这是理想的。
- 根据 9 位测试,它似乎不接受 Unicode 代码点。随后的测试表明它也不接受
chr(0x113)
的 UTF-8 编码。
结论:printw
存在重大问题。
解决 printw
的问题
解决 Unicode Bug 很容易,但缺乏对 0xFF 以上字符的支持是一个阻碍。让我们深入研究代码。
好的,我们不必寻找问题。我们看到 printw
是根据 addstr
定义的,而 addstr
早于宽字符支持。 addstring
是支持宽字符的对应项,所以让我们让 printw
使用 addstring
而不是 addstr
。
{
# Add wide character support to printw.
# This only modifies the current package (main),
# so it won't affect any code by ours.
no warnings qw( redefine );
sub printw { addstring(sprintf shift, @_) }
}
输出:
7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0: é
8-bit, UTF8=1: é
9-bit, UTF8=1: ē
宾果!
从上面我们观察到:
- 我们发现
UTF8=0
测试的结果与其对应的 UTF8=1
测试的结果没有差异。因此,此函数不会受到 Unicode Bug 的影响。
- 它始终接受 Unicode 代码点(解码文本)字符串。值得注意的是,它不需要 UTF-8 或语言环境的编码。
这正是我们expect/desire。
- 具体来说,
getchar
没有像您认为的那样返回输入的 iso-8859-1 编码。这种混淆是可以理解的,因为 Unicode 是 iso-8859-1 的扩展。
我正在试用 Bryan Henderson 的 Perl 接口到 ncurses 库:Curses
作为一个简单的练习,我尝试获取在屏幕上键入的单个字符。这直接基于 NCURSES Programming HOWTO,并进行了改编。
当我调用 Perl 库的 getchar()
时,我希望收到一个字符,可能是多字节(如 this part of the library manpage 中所述,它有点复杂,因为必须处理函数的特殊情况键,没有输入,但这只是通常的花饰)。
就是下面代码中的子程序read1ch()
。
这适用于 ASCII 字符,但不适用于 0x7F 以上的字符。例如,当点击 è
(Unicode 0x00E8, UTF-8: 0xC3, 0xA8) 时,我实际上获得了代码 0xE8 而不是 UTF-8 编码的东西。将它打印到 LANG=en_GB.UTF-8
无法正常工作的终端,无论如何我都期待 0xC3A8。
我需要更改什么才能使其正常工作,即将 è
作为正确的字符或 Perl 字符串获取?
为 getchar()
截取的 C 代码是 here 顺便说一下。也许它只是没有用 C_GET_WCH
集编译?如何查明?
附录
附录 1
尝试使用
设置binmodebinmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
这应该可以解决任何编码问题,因为终端期望并发送 UTF-8,但这没有帮助。
还尝试使用 use open 设置流编码(不太确定这与上述方法之间的区别),但这也没有帮助
use open qw(:std :encoding(UTF-8));
附录 2
Perl Curses shim 的联机帮助页说:
If
wget_wch()
is not available (i.e. The Curses library does not understand wide characters), this callswgetch()
[get a 1-byte char from a curses window], but returns the values described above nonetheless. This can be a problem because with a multibyte character encoding like UTF-8, you will receive two one-character strings for a two-byte-character (e.g. "Ã" and "¤" for "ä").
这里可能是这种情况,但是 wget_wch()
确实存在于这个系统上。
附录 3
试图查看 C 代码的作用,并直接将 fprintf
添加到 curses/Curses-1.36/CursesFunWide.c
的多字节处理代码中,重新编译,但未能覆盖系统 Curses.so
我自己的 via LD_LIBRARY_PATH
(为什么不呢?为什么只有一半时间一切正常?),所以直接替换了系统库(接受那个!)。
#ifdef C_GET_WCH
wint_t wch;
int ret = wget_wch(win, &wch);
if (ret == OK) {
ST(0) = sv_newmortal();
fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
c_wchar2sv(ST(0), wch);
XSRETURN(1);
} else if (ret == KEY_CODE_YES) {
XST_mUNDEF(0);
ST(1) = sv_newmortal();
sv_setiv(ST(1), (IV)wch);
XSRETURN(2);
} else {
XSRETURN_UNDEF;
}
#else
这只是一个胖子 NOPE,当按下 ü
时,您会看到:
Obtained win_t 0x00fc
所以正确的编码是运行,但是数据是ISO-8859-1,不是UTF-8 .所以 wget_wch
表现不好。所以这是一个 curses 配置问题。呵呵
附录 4
让我吃惊的是,也许 ncurses
正在假设默认语言环境,即 C
。要使其 ncurses
与宽字符一起工作,必须 "initialize the locale",这可能意味着从 "unset" 移动状态(从而使 ncurses
回落到 C
) 到 "set to what the system indicates" (这应该是 LANG
环境变量中的内容)。 ncurses
的手册页说:
The library uses the locale which the calling program has initialized. That is normally done with setlocale:
setlocale(LC_ALL, "");
If the locale is not initialized, the library assumes that characters are printable as in ISO-8859-1, to work with certain legacy programs. You should initialize the locale and not rely on specific details of the library when the locale has not been setup.
这也没有用,但我觉得解决方案就在这条路上。
附录 5
win_t
(显然与wchar_t
相同)转换代码来自CursesWide.c
,转换从wint_t
(此处视为wchar_t
) wget_wch()
转换成 Perl 字符串。 SV
是 "scalar value" 类型。
另请参阅:https://perldoc.perl.org/perlguts.html
这里插入两个fprintf
看看是怎么回事:
static void
c_wchar2sv(SV * const sv,
wchar_t const wc) {
/*----------------------------------------------------------------------------
Set SV to a one-character (not -byte!) Perl string holding a given wide
character
-----------------------------------------------------------------------------*/
if (wc <= 0xff) {
char s[] = { wc, 0 };
fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_off(sv);
} else {
char s[UTF8_MAXBYTES + 1] = { 0 };
char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
*s_end = 0;
fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_on(sv);
}
}
使用 perl-Curses 测试代码
- 尝试使用 perl-Curses-1.36-9.fc30。x86_64
- 尝试使用 perl-Curses-1.36-11.fc31。x86_64
如果您尝试,请按 BACKSPACE 退出循环,因为不再解释 CTRL-C。
下面代码很多,但是临界区用----- Testing
:
#!/usr/bin/perl
# pmap -p PID
# shows the per process using
# /usr/lib64/libncursesw.so.6.1
# /usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so
# Trying https://metacpan.org/release/Curses
use warnings;
use strict;
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
use Curses; # On Fedora: dnf install perl-Curses
# This didn't fix it
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
# https://perldoc.perl.org/perllocale.html#The-setlocale-function
use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");
# ---
# Surrounds the actual program
# ---
sub setup() {
initscr();
raw();
keypad(1);
noecho();
}
sub teardown {
endwin();
}
# ---
# Mainly for prettyprinting
# ---
my $special_keys = setup_special_keys();
# ---
# Error printing
# ---
sub mt {
return sprintf("%i: ",time());
}
sub ae {
my ($x,$fname) = @_;
if ($x == ERR) {
printw mt();
printw "Got error code from '$fname': $x\n"
}
}
# ---
# Where the action is
# ---
sub announce {
my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
ae($res, "printw");
return { refresh => 1 }
}
sub read1ch {
# Read a next character, waiting until it is there.
# Use the wide-character aware functions unless you want to deal with
# collating individual bytes yourself!
# Readings:
# https://metacpan.org/pod/Curses#Wide-Character-Aware-Functions
# https://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties
# https://www.ahinea.com/en/tech/perl-unicode-struggle.html
# https://hexdump.wordpress.com/2009/06/19/character-encoding-issues-part-ii-perl/
my ($ch, $key) = getchar();
if (defined $key) {
# it's a function key
printw "Function key pressed: $key";
printw " with known alias '" . $$special_keys{$key} . "'" if (exists $$special_keys{$key});
printw "\n";
# done if backspace was hit
return { done => ($key == KEY_BACKSPACE()) }
}
elsif (defined $ch) {
# "$ch" should be a String of 1 character
# ----- Testing
printw "Locale: $loc\n";
printw "Multibyte output test: öüäéèà периоду\n";
printw sprintf("Received string '%s' of length %i with ordinal 0x%x\n", $ch, length($ch), ord($ch));
{
# https://perldoc.perl.org/bytes.html
use bytes;
printw sprintf("... length is %i\n" , length($ch));
printw sprintf("... contents are %vd\n" , $ch);
}
# ----- Testing
return { ch => $ch }
}
else {
# it's an error
printw "getchar() failed\n";
return {}
}
}
sub feedback {
my ($ch) = @_;
printw "The pressed key is: ";
attron(A_BOLD);
printw("%s\n","$ch"); # do not print $txt directly to make sure escape sequences are not interpreted!
attroff(A_BOLD);
return { refresh => 1 } # should refresh
}
sub do_curses_run {
setup;
my $done = 0;
while (!$done) {
my $bubl;
$bubl = announce();
refresh() if $$bubl{refresh};
$bubl = read1ch();
$done = $$bubl{done};
if (defined $$bubl{ch}) {
$bubl = feedback($$bubl{ch});
refresh() if $$bubl{refresh};
}
}
teardown;
}
# ---
# main
# ---
do_curses_run();
sub setup_special_keys {
# the key codes on the left must be called once to resolve to a numeric constant!
my $res = {
KEY_BREAK() => "Break key",
KEY_DOWN() => "Arrow down",
KEY_UP() => "Arrow up",
KEY_LEFT() => "Arrow left",
KEY_RIGHT() => "Arrow right",
KEY_HOME() => "Home key",
KEY_BACKSPACE() => "Backspace",
KEY_DL() => "Delete line",
KEY_IL() => "Insert line",
KEY_DC() => "Delete character",
KEY_IC() => "Insert char or enter insert mode",
KEY_EIC() => "Exit insert char mode",
KEY_CLEAR() => "Clear screen",
KEY_EOS() => "Clear to end of screen",
KEY_EOL() => "Clear to end of line",
KEY_SF() => "Scroll 1 line forward",
KEY_SR() => "Scroll 1 line backward (reverse)",
KEY_NPAGE() => "Next page",
KEY_PPAGE() => "Previous page",
KEY_STAB() => "Set tab",
KEY_CTAB() => "Clear tab",
KEY_CATAB() => "Clear all tabs",
KEY_ENTER() => "Enter or send",
KEY_SRESET() => "Soft (partial) reset",
KEY_RESET() => "Reset or hard reset",
KEY_PRINT() => "Print or copy",
KEY_LL() => "Home down or bottom (lower left)",
KEY_A1() => "Upper left of keypad",
KEY_A3() => "Upper right of keypad",
KEY_B2() => "Center of keypad",
KEY_C1() => "Lower left of keypad",
KEY_C3 () => "Lower right of keypad",
KEY_BTAB() => "Back tab key",
KEY_BEG() => "Beg(inning) key",
KEY_CANCEL() => "Cancel key",
KEY_CLOSE() => "Close key",
KEY_COMMAND() => "Cmd (command) key",
KEY_COPY() => "Copy key",
KEY_CREATE() => "Create key",
KEY_END() => "End key",
KEY_EXIT() => "Exit key",
KEY_FIND() => "Find key",
KEY_HELP() => "Help key",
KEY_MARK() => "Mark key",
KEY_MESSAGE() => "Message key",
KEY_MOUSE() => "Mouse event read",
KEY_MOVE() => "Move key",
KEY_NEXT() => "Next object key",
KEY_OPEN() => "Open key",
KEY_OPTIONS() => "Options key",
KEY_PREVIOUS() => "Previous object key",
KEY_REDO() => "Redo key",
KEY_REFERENCE() => "Ref(erence) key",
KEY_REFRESH() => "Refresh key",
KEY_REPLACE() => "Replace key",
KEY_RESIZE() => "Screen resized",
KEY_RESTART() => "Restart key",
KEY_RESUME() => "Resume key",
KEY_SAVE() => "Save key",
KEY_SBEG() => "Shifted beginning key",
KEY_SCANCEL() => "Shifted cancel key",
KEY_SCOMMAND() => "Shifted command key",
KEY_SCOPY() => "Shifted copy key",
KEY_SCREATE() => "Shifted create key",
KEY_SDC() => "Shifted delete char key",
KEY_SDL() => "Shifted delete line key",
KEY_SELECT() => "Select key",
KEY_SEND() => "Shifted end key",
KEY_SEOL() => "Shifted clear line key",
KEY_SEXIT() => "Shifted exit key",
KEY_SFIND() => "Shifted find key",
KEY_SHELP() => "Shifted help key",
KEY_SHOME() => "Shifted home key",
KEY_SIC() => "Shifted input key",
KEY_SLEFT() => "Shifted left arrow key",
KEY_SMESSAGE() => "Shifted message key",
KEY_SMOVE() => "Shifted move key",
KEY_SNEXT() => "Shifted next key",
KEY_SOPTIONS() => "Shifted options key",
KEY_SPREVIOUS() => "Shifted prev key",
KEY_SPRINT() => "Shifted print key",
KEY_SREDO() => "Shifted redo key",
KEY_SREPLACE() => "Shifted replace key",
KEY_SRIGHT() => "Shifted right arrow",
KEY_SRSUME() => "Shifted resume key",
KEY_SSAVE() => "Shifted save key",
KEY_SSUSPEND() => "Shifted suspend key",
KEY_SUNDO() => "Shifted undo key",
KEY_SUSPEND() => "Suspend key",
KEY_UNDO() => "Undo key"
};
for (my $f = 1; $f <= 64; $f++) {
$$res{KEY_F($f)} = "KEY_F($f)"
}
return $res
}
实际上它看起来是正确的。
运行 你的带有 strace 的脚本可以提供帮助...我这样做是为了查看系统调用:
strace -fo strace.out -s 1024 ./foo
并且可以看到读取、消息等。可以使用调试库为 ncurses 获取类似的跟踪,尽管打包者在提供启用跟踪的方面并不一致。
ü
在UTF-8中是34
(八进制),它的Unicode值为252
(十进制),或 0xfc
(十六进制)。这部分问题好像漏了一点:
That's just a fat NOPE, when pressing ü one sees:
Obtained win_t 0x00fc
So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.
wget_wch
returns(出于实用目的)一个 Unicode 值(不是 UTF-8 字节序列)。 ISO-8859-1 代码 160-255 碰巧(并非巧合)匹配 Unicode 代码点,尽管后者在 UTF-8 中肯定 encoded 不同。
wgetch
会 return UTF-8 字节,但 Perl 脚本只会将其用作后备(因为这会导致 Perl 脚本将 UTF-8 字符串转换为 Unicode值)。
Thomas Dickey 正确地注意到 收到了正确的数据。
我花了一些时间才真正确定。
混淆是因为 Perl 的 sprintf
无法处理 UTF-8 而 Perl Curses printw
无法处理区域 0x80
到 0x7F
.
这需要更长的时间才能确定。
其实我新开了一个关于这个的问题:
Are there one (or two) solid bugs in the `curses` shim for Perl?
[ 此答案假定 libncursesw 可用且正在使用。在没有宽字符支持的情况下尝试输出 "wide characters" 是没有意义的:) ]
简答
getchar
工作正常。它 returns 一串 Unicode 代码点(又名解码文本),非常理想。
printw
已损坏,但可以通过在程序中添加以下内容使其接受一串 Unicode 代码点(也称为解码文本):
{
# Add wide character support to printw.
# This only modifies the current package (main),
# so it won't affect any code by ours.
no warnings qw( redefine );
sub printw { addstring(sprintf shift, @_) }
}
getchar
有问题吗?
所以您认为 getchar
有问题。让我们尝试通过检查 getchar
returns 来确认这一点。我们将通过添加以下内容来做到这一点:
printw("String received from getchar: %vX\n", $ch);
(%vX
将以十六进制打印字符串的每个字符的值,由句点连接。)
当按下
e
(U+0065),一个7位的字符,可以看到:String received from getchar: 65
当按
é
(U+00E9)时,一个8位字符,可以看到:String received from getchar: E9
当按
ē
(U+0113)时,一个9位字符,可以看到:String received from getchar: 113
在这三种情况下,我们得到的字符串正好是一个字符长,并且该字符由输入的 Unicode 代码点组成。[1] 这正是我们想要。应用和删除字符编码应该在外围完成,这样程序的主要逻辑就不必担心编码,并且正在这样做。
结论:getchar
没有问题。
printw
有问题吗?
所以问题一定出在输出上。为了确认这一点,我在您的程序中添加了以下内容:
sub _d { utf8::downgrade( my $s = shift ); $s }
sub _u { utf8::upgrade( my $s = shift ); $s }
for (
[ "7-bit, UTF8=0" => _d(chr(0x65)) ], # Expect e
[ "7-bit, UTF8=1" => _u(chr(0x65)) ], # Expect e
[ "8-bit, UTF8=0" => _d(chr(0xE9)) ], # Expect é
[ "8-bit, UTF8=1" => _u(chr(0xE9)) ], # Expect é
[ "9-bit, UTF8=1" => chr(0x113) ], # Expect ē
) {
my ($name, $chr) = @$_;
printw("%s: %s\n", $name, $chr);
}
输出:
7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0:
8-bit, UTF8=1: é
9-bit, UTF8=1: S
从上面我们观察到:
- 我们看到
_d(chr(0xE9))
和_u(chr(0xE9))
的结果之间存在差异,即使两个标量包含相同的字符串(_d(chr(0xE9)) eq _u(chr(0xE9))
为真)。因此,该函数存在 The Unicode Bug。 - 根据 8 位测试,它似乎接受 Unicode 代码点(解码文本)而不是 UTF-8。这是理想的。
- 根据 9 位测试,它似乎不接受 Unicode 代码点。随后的测试表明它也不接受
chr(0x113)
的 UTF-8 编码。
结论:printw
存在重大问题。
解决 printw
解决 Unicode Bug 很容易,但缺乏对 0xFF 以上字符的支持是一个阻碍。让我们深入研究代码。
好的,我们不必寻找问题。我们看到 printw
是根据 addstr
定义的,而 addstr
早于宽字符支持。 addstring
是支持宽字符的对应项,所以让我们让 printw
使用 addstring
而不是 addstr
。
{
# Add wide character support to printw.
# This only modifies the current package (main),
# so it won't affect any code by ours.
no warnings qw( redefine );
sub printw { addstring(sprintf shift, @_) }
}
输出:
7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0: é
8-bit, UTF8=1: é
9-bit, UTF8=1: ē
宾果!
从上面我们观察到:
- 我们发现
UTF8=0
测试的结果与其对应的UTF8=1
测试的结果没有差异。因此,此函数不会受到 Unicode Bug 的影响。 - 它始终接受 Unicode 代码点(解码文本)字符串。值得注意的是,它不需要 UTF-8 或语言环境的编码。
这正是我们expect/desire。
- 具体来说,
getchar
没有像您认为的那样返回输入的 iso-8859-1 编码。这种混淆是可以理解的,因为 Unicode 是 iso-8859-1 的扩展。