Scrapy：如何从网页中仅提取 html 标签

Question

我只想从网页中提取 html 标记及其文本。

但是有一个条件

元，脚本标签应该被排除在外。无论如何，页面上可见的标签及其父标签必须被删除，以保持其树结构。

谢谢。

Answer 1

您很可能可以使用简单的 xpath:

items = response.xpath("//*[not(self::script)][not(self::meta)]")
for item in items:
    tag_name = item.xpath("name()").extract_first()
    tag_text = item.xpath("text()").extract_first()
    print(tag_name)
    print(tag_text)

这将提取所有标签及其文本。

Scrapy：如何从网页中仅提取 html 标签

Scrapy: How to extract only html tags from webpages

html

tags

tree

scrapy