如何跳过 sax 解析器 perl 中的字符

Question

我在读取 perl 中的特殊字符时遇到问题。我有以下 xml 文件，我正在使用一个 sax 解析器，它在每家酒店上循环并获取值，但是当它读取 HotelInfo 时，我们跳过文本，因为我们在 1000 mï¿½

<?xml version="1.0" encoding="UTF-8"?>
<XMLResponse>
    <ResponseType>HotelListResponse</ResponseType>
    <RequestInfo>
        <AffiliateCode>NI9373</AffiliateCode>
        <AffRequestId>2</AffRequestId>
        <AffRequestTime>2015-10-29T15:52:05</AffRequestTime>
    </RequestInfo>
    <TotalNumber>264234</TotalNumber>
    <Hotels>
        <Hotel>
            <HotelCode>AD0BFU</HotelCode>
            <OldHotelId>0</OldHotelId>
            <HotelLocation/>
            <HotelInfo>Renovated in 2001, Hotel Bringue features a 1000 mï¿½ garden and comprises 5 floors with 105 double rooms, 2 suites and 7 single rooms. Hotel Bringue is situated in the picturesque village El Serrat, boasting the most amazing mountain views in the region and just a short drive to the main ski resort of Vallnord.After an exhausting day, you can go for a relaxing swim in the pool, re-energise your body in the jacuzzi or pamper yourself in the sauna. The rooms are beautifully appointed and come with an array of modern amenities for a pleasant stay.</HotelInfo>
            <HotelTheme>Ski Hotels</HotelTheme>
        </Hotel>
    </Hotels>
</XMLResponse>

如何像 sax 解析器中的字符一样跳过。

Answer 1

如果您要修复该文件，我不确定为什么这里甚至需要 XML 解析器。

perl -i~ -pe's/\xC3\xAF\xC2\xBF\xC2\xBD//g' file.xml

Answer 2

你如何定义"special characters"？一种定义可能是：非 ASCII 字符。 ASCII 字符在 0x00 - 0x7f 范围内（尽管并非所有字符都在 XML 中有效）。因此，您可以丢弃该范围内 not 的每个字符，例如：

$data =~ s/[^\x00-\x7f]//g;

但这可能会丢弃大量非常好的数据。所有重音字符都将被丢弃（例如："Zürich" 中的“ü” - 留下 "Zrich"）。 €、£ 或 ¥（甚至 ¢）等货币符号将丢失。您还会丢失其他无害的字符，如 –、—、“、” 或 • 以及不可见的字符，如不间断空格。

那么问题来了，为什么要舍弃这些字符呢？他们在什么时候成为问题？我注意到您标记了问题 'mysql' - 当您尝试将数据插入数据库时遇到问题了吗？您是否正确声明了数据库的编码？您是否在数据库连接上启用了 mysql_enable_utf8？也许您可以在 eval 块中进行插入，并且仅在插入失败时才应用上面的正则表达式。

另一种选择可能是通过 Encoding::FixLatin 传递数据。这应该使字符串可以安全地插入到 UTF-8 数据库中，即使生成的字符与最初预期的不完全一样。

顺便说一句，我想在上面的具体实例中，数据原来是这样说的：

Hotel Bringue features a 1000 m² garden

SUPERSCRIPT TWO 字符是 Unicode U+00B2，在 UTF-8 中将被编码为两个字节：C2 B2。某个进程可能已经读取了这些字节，但将它们解码为 Latin-1 而不是 UTF-8，并且每个字节都变成了一个字符。当数据的编码声明错误或人们不理解如何使用 Unicode 字符时，这种双重编码可能会反复发生 - 导致一个字符变成许多垃圾字符。

如何跳过 sax 解析器 perl 中的字符

how o skip characters in the sax parser perl

xml

perl

special-characters

saxparser