Scrapy xpath 给出所有匹配的元素

Question

我有一个 HTML 文件，我想从中提取特定 DIV 下的锚 href 值。 HTML 文件看起来像这样

<html>
<head>
    <title>Test page Vikrant </title>
</head>
<body>
        <div class="mainContainer">
                <a href="https://india.net" class="logoShape">India</a>
                    <nav id="vik1">
                    <a href="https://aarushmay.com" class="closemobilemenu">home</a>
            <ul class="mainNav">
                    <li class="hide-submenu">
                        <a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion </a>
                </li>
            </ul>
        </nav>
                <a href="https://maharashtra.net" class="logoShape">Maharashtra</a>
    </div>
</body>

蜘蛛代码如下

import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
  name = "test"
  localfile_folder="localfiles"
  def start_requests(self):
    testFile = f'{self.localfile_folder}/t1.html'
    absoluteFileName = os.path.abspath(testFile)
    yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
  def parse(self, response):
    hrefElements = response.xpath('//nav[@id="vik1"]').xpath('//a/@href').getall()
    self.log(f'total records = {len(hrefElements)}')

我得到的输出是 4 个锚元素。而我期望它是 2。所以我使用了“选择器”并将 Div 元素存储在其中，然后尝试提取锚元素的值。效果很好。

    import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
  name = "test"
  localfile_folder="localfiles"
  def start_requests(self):
    testFile = f'{self.localfile_folder}/t1.html'
    absoluteFileName = os.path.abspath(testFile)
    yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
  def parse(self, response):
    listingDataSel = response.xpath('//nav[@id="vik1"]')
    exactElement = Selector(text=listingDataSel.get())
    hrefElements = exactElement.xpath('//a/@href').getall()
    self.log(f'total records = {len(hrefElements)}')

我的问题是为什么我需要使用中间选择器变量来存储提取的 Div 元素？

Answer 1

您是否已尝试定位 class div 名称？例如，要从您的 HTML 代码中的锚元素获取文本，如下所示。

response.xpath('//div[@class = "mainContainer"]/a/text()').extract()

从那里开始，您只需定位 Href 就可以了。

查看文档here

Answer 2

当你这样做的时候：

exactElement = Selector(text=listingDataSel.get())

您正在创建一个选择器，其中仅包含您在 listingDataSel.get() 中提取的内容，但如下所示：

<html>
  <body>
    <nav id="vik1">                    
      <a href="https://aarushmay.com" class="closemobilemenu">home
      </a>            
      <ul class="mainNav">                    
        <li class="hide-submenu">                        
          <a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion 
          </a>                
        </li>            
      </ul>        
    </nav>
  </body>
</html>

当您使用 text 参数时，您创建了一个新的 HTML 文档，这就是您只获得两个锚元素的原因。您可以在此 link.

查看一些示例

在您的第一个代码中，您获得了 4 个锚元素，因为您正在处理原始文档。你也可以试试这个：

response.xpath('//div/nav[@id="vik1"]//a/@href').extract()

也可以得到同样的结果

Scrapy xpath 给出所有匹配的元素

Scrapy xpath giving all matching elements

scrapy