从 Cheerio.js 内容中删除 unicode 字符

Question

我正在使用 cheeriojs 从网页上抓取内容，如下 HTML。

  <p>
     Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday.
  <br>
  <br>
    “The outcome will be made public in due course,” John said in an SMS yesterday.
  <br>
  <br>
 </p>

我可以通过class和id标签找到感兴趣的内容，如下：

$('.top-stories .line.more').each(function(i, el){
    //Do something…

    let content =  $(this).next().html();
}

捕获到感兴趣的内容后，我会使用正则表达式“清理”它，如下所示：

let cleanedContent = content.split(/<br>/).join(' \n ');

在匹配空标签 (<br>) 的地方插入换行符。到目前为止一切都很好，直到我看到下面清理后的内容：

Although the PM&apos;s office could neither confirm nor deny this, the spokesperson, Saima Shaanika said the meeting took place on Friday. 

&#x201C;The outcome will be made public in due course,&#x201D;

看来标点符号，也许还有一些其他字符，是根据它们的 unicode 代码存储的。我在这一点上可能是错的，欢迎对这一思路进行一些更正。

假设它们存储为 unicode 代码，是否有一个模块可以传递“cleanedContent”变量，将 unicode 转换为人类可读的标点符号 marks/characters？

如果这不可能，是否有更好的 cheeriojs 实现可以避免这种情况？我完全接受我没有正确使用 cherriojs 的观点，并且会喜欢一些关于我可以尝试的新方法的方向。

我能想到的一种方法是编写一个包含多个 unicode 及其对应 unicode 的模块，然后查找匹配项，并将匹配的代码替换为相应的人类可读字符。我有一些直觉，有人已经做过这个或类似的事情。我宁愿不去重新发明轮子。

提前致谢。

Answer 1

Cheerio 在内部使用 htmlparser2。

因此，您可以 在加载 HTML 字符串期间使用 htmlparser2 的 decodeEntities 选项 ，它允许您配置如何 HTML 实体应该被对待。

示例：

$ = cheerio.load('<ul id="fruits">...</ul>', {
    decodeEntities: false
});

相关文档：

从 Cheerio.js 内容中删除 unicode 字符

Remove unicode characters from Cheerio.js content

javascript

unicode

node.js

cheerio