perl: utf8 <something> 不映射到 Unicode 而 <something> 似乎不存在

Question

我正在使用 MARC::Lint 对一些 MARC 记录进行 lint，但时不时地出现错误（在大约 1% 的文件上）：

utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.

问题是我尝试了不同的方法，但在文件中找不到 "\xCA"...

我的脚本是：

#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;

use open OUT => ':utf8';

my $lint = new MARC::Lint;
my $filename = shift;

my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
    $lint->check_record( $marc );
    # Print the errors that were found
    print join( "\n", $lint->warnings ), "\n";
} # while

文件可以在这里下载：http://eroux.fr/I14376.mrc

“\xCA”是否隐藏在某处？或者这是 MARC::Lint 中的错误？

Answer 1

问题与MARC::Lint无关。删除 lint 检查，您仍然会收到错误。

问题是错误的数据文件。

该文件包含 "directory" 信息在文件中的位置。以下是您提供的文件的目录的 human-readable 再现：

tagno|offset|len   # Offsets are from the start of the data portion.
001|00000|0017     # Length include the single-byte field terminator.
006|00017|0019     # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999  <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262

注意标记为 505 的字段的长度，9999。这是支持的最大值（因为长度存储为四位十进制数字）。问题是该字段的值远远大于 9,999 字节；它实际上是 28,594 字节大小。

发生的情况是模块提取了 9,999 个字节而不是 28,594 个字节。这恰好将 UTF-8 序列切成两半。（具体顺序为CA BA，ʼ的编码。）稍后，当模块尝试解码该文本时，会抛出错误。（CA 后必须跟另一个字节才有效。）

这些记录是您生成的吗？如果是这样，您需要确保没有字段需要超过 9,999 字节。

不过，模块应该能更好地处理这个问题。它可以读取直到找到一个 end-of-field 标记，而不是在它没有找到 end-of-field 标记时使用长度，它期望一个 and/or 它可以以 non-fatal 方式处理解码错误.它已经有一个机制来报告这些问题 ($marc->warnings)。

事实上，如果它没有死（假设剪切恰好发生在字符之间而不是字符中间），$marc->warnings 会返回以下消息：

field does not end in end of field character in tag 505 in record 1

perl: utf8 <something> 不映射到 Unicode 而 <something> 似乎不存在

perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present

perl

marc