scrapy scrape html 源代码

Question

我正在使用 scrapy 来抓取和抓取网站。我需要整个 html 而不是组件。我们可以使用 xpath 选择器轻松提取组件，但是有什么方法可以提取给定 class 的整个 html 块。例如在下面的 html 代码中，我需要整个 div 块 prod-basic-info 的确切 html 源代码。无论如何我可以做到这一点吗？

<div class="block prod-basic-info">
 <h2>Product information</h2>
 <p class="product-info-label">Category</p>
  <p>
   <a href="xyz.html"</a>
 </p>
</div>

Answer 1

只需将 xpath 表达式或 CSS 选择器指向该元素，然后 extract() 它：

response.xpath('//div[contains(@class, "prod-basic-info")]').extract()[0]
response.css('div.prod-basic-info').extract()[0]

scrapy scrape html 源代码

scrapy scrape html source code

html

python

scrapy

python-2.7