在 Windows 上使用 BOM 和 CRLF 行分隔符创建 UTF-16LE
create UTF-16LE with BOM and CRLF line separator on Windows
我需要在 Windows 7 框上生成一些带有 CRLF 行分隔符的 UTF-16LE 编码文件。 (目前使用 Strawberry 5.20.1)
在获得正确的输出之前,我需要弄乱很长时间,我想知道我的解决方案是否是正确的方法,因为相对于 Perl 的其他语言,它似乎过于复杂。特别是:
- 为什么 Perl 使用
encoding(UTF-16)
制作具有正确 BOM 的有效 UTF-16 big-endian,而如果我使用 UTF-16LE
或 UTF-16BE
而没有使用额外的 BOM,则没有 BOM包裹 File::BOM
?
- 为什么开箱即用的
CRLF
处理似乎有问题(它输出为 0D 0A 00
而不是 0D 00 0A 00
)而没有一些过滤器?我怀疑对于拥有如此多用户的语言来说,这可能是一个真正的错误...
这是我尝试的评论,我认为正确的是最后的陈述
use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';
my $UTF;
my $data = "Hello, héhé, 中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese
# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;
# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE non raw incorrect
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;
# UTF16 LE + BOM + LF
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;
# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;
#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;
# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;
#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
manual BOM, but CRLF OK
是的,以下确实是正确的:
:raw:encoding(UTF-16LE):crlf + manual BOM
:raw
"clears" 现有的 :crlf
和 :encoding
层。
:encoding
在字节和代码点之间转换。
:crlf
在 CRLF 和 LF 之间转换。
所以,
Read
===================================================>
Code Code
+------+ bytes +------+ Points +-------+ Points +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+ +------+ CRLF +-------+ LF +------+
<===================================================
Write
您想对代码点(而不是字节)执行 CRLF⇔LF 转换,就像此设置一样。
CORRECT WAY?? : Automatic BOM, CRLF is OK
虽然 :raw:encoding(UTF-16LE):crlf:via(File::BOM)
可能适用于写句柄,但它看起来不正确(我希望 :raw:via(File::BOM,UTF-16LE):crlf
),并且它对于读句柄非常失败(至少对我来说是这样) Perl 5.16.3).
我刚刚看了看,:via(File::BOM)
背后的代码做了一些非常有问题的事情。我不会用它。
why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM
因为您可能不需要 BOM。
why out-of-the-box the CRLF
handling seems buggy
添加图层会将它们添加到列表的末尾。如果你想在其他地方添加一个层(就像这里的情况),你需要重建列表。
在 Perl 的开发列表中建议应该有一种区分字节层(例如 :unix
)和文本层(例如 :crlf
)的方法,并且添加一个字节或编码层应该向下挖掘并将其放置在适当的位置。但是还没有人对此采取行动。
除了简化您的代码之外,它还允许将 UTF-16*[1] 编码层添加到 STDIN
/STDOUT
/STDERR
(或其他现有句柄)。我认为目前这是不可能的。
- 从技术上讲,任何 CR != 13 或 LF != 10 的编码都有这个问题,因此 EBCDIC 也会受到影响。
我需要在 Windows 7 框上生成一些带有 CRLF 行分隔符的 UTF-16LE 编码文件。 (目前使用 Strawberry 5.20.1)
在获得正确的输出之前,我需要弄乱很长时间,我想知道我的解决方案是否是正确的方法,因为相对于 Perl 的其他语言,它似乎过于复杂。特别是:
- 为什么 Perl 使用
encoding(UTF-16)
制作具有正确 BOM 的有效 UTF-16 big-endian,而如果我使用UTF-16LE
或UTF-16BE
而没有使用额外的 BOM,则没有 BOM包裹File::BOM
? - 为什么开箱即用的
CRLF
处理似乎有问题(它输出为0D 0A 00
而不是0D 00 0A 00
)而没有一些过滤器?我怀疑对于拥有如此多用户的语言来说,这可能是一个真正的错误...
这是我尝试的评论,我认为正确的是最后的陈述
use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';
my $UTF;
my $data = "Hello, héhé, 中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese
# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;
# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE non raw incorrect
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;
# UTF16 LE + BOM + LF
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;
# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;
#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;
# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;
#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
manual BOM, but CRLF OK
是的,以下确实是正确的:
:raw:encoding(UTF-16LE):crlf + manual BOM
:raw
"clears" 现有的:crlf
和:encoding
层。:encoding
在字节和代码点之间转换。:crlf
在 CRLF 和 LF 之间转换。
所以,
Read
===================================================>
Code Code
+------+ bytes +------+ Points +-------+ Points +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+ +------+ CRLF +-------+ LF +------+
<===================================================
Write
您想对代码点(而不是字节)执行 CRLF⇔LF 转换,就像此设置一样。
CORRECT WAY?? : Automatic BOM, CRLF is OK
虽然 :raw:encoding(UTF-16LE):crlf:via(File::BOM)
可能适用于写句柄,但它看起来不正确(我希望 :raw:via(File::BOM,UTF-16LE):crlf
),并且它对于读句柄非常失败(至少对我来说是这样) Perl 5.16.3).
我刚刚看了看,:via(File::BOM)
背后的代码做了一些非常有问题的事情。我不会用它。
why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM
因为您可能不需要 BOM。
why out-of-the-box the
CRLF
handling seems buggy
添加图层会将它们添加到列表的末尾。如果你想在其他地方添加一个层(就像这里的情况),你需要重建列表。
在 Perl 的开发列表中建议应该有一种区分字节层(例如 :unix
)和文本层(例如 :crlf
)的方法,并且添加一个字节或编码层应该向下挖掘并将其放置在适当的位置。但是还没有人对此采取行动。
除了简化您的代码之外,它还允许将 UTF-16*[1] 编码层添加到 STDIN
/STDOUT
/STDERR
(或其他现有句柄)。我认为目前这是不可能的。
- 从技术上讲,任何 CR != 13 或 LF != 10 的编码都有这个问题,因此 EBCDIC 也会受到影响。