如何抓取字面上包含“\x2d”的页面，但将该字符保存为我的项目中的“-”？

Question

我需要从页面上的脚本中抓取一些文本，并将该文本保存在一个抓取项目中，大概是一个 UTF-8 字符串。然而，我从中抓取的实际文字文本有特殊字符写成我认为是 UTF 十六进制的字符。例如“-”写成“\x2f”。如何在我的 scrapy 项目中抓取表示为“\x2f”的字符并将它们保存为“-”？

抓取页面内容摘录：

<script type="text/javascript">

[approx 100 various lines of script, omitted]

"author": "Kurt\x20Vonnegut",
"internetPrice": "799",
"inventoryType": "new",
"title": "Slaughterhouse\x2DFive",
"publishedYear": "1999",

[approx 50 additional various lines of script, removed]

</script>

我的 scrapy 脚本是这样的：

pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[@type="text/javascript"]').re(pattern_title)
item['title'] = title_raw[0]

对于这个项目，scrapy 的输出将 return:

'author': u'Kurt\x20Vonnegut', 'title': u'Slaughterhouse\x2DFive'

理想情况下，我想要：

'author': 'Kurt Vonnegut', 'title': 'Slaughterhouse Five'

我在没有改变输出的情况下尝试过的事情：

将最后一行更改为：item['title'] = title_raw[0].decode('utf-8')
将最后一行更改为：item['title'] = title_raw[0].encode('latin1').decode('utf-8')

最后，如果需要明确说明，我无法控制这些信息在我正在抓取的网站上的显示方式。

Answer 1

您可以使用urllib's unquote功能。

在 Python 3.x:

from urllib.parse importe unquote
unquote("Kurt\x20Vonnegut")

在 Python 2.7:

from urllib import unquote
unquote("Kurt\x20Vonnegut")

查看 Item Loaders and Input Processors，这样您就可以对所有抓取的字段执行此操作。

Answer 2

受到Converting \x escaped string to UTF-8的启发，我使用.decode('string-escape')解决了这个问题，如下：

pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[@type="text/javascript"]').re(pattern_title)
title_raw[0] = title_raw[0].decode('string-escape')
item['title'] = title_raw[0]

如何抓取字面上包含“\x2d”的页面，但将该字符保存为我的项目中的“-”？

How can a scrape a page that literally contains "\x2d", but save that character as "-" in my item?

regex

unicode

unicode-string

scrapy

python-2.7