无法使用scrapy获取元素

Question

我写了一个蜘蛛程序来从网站上抓取一些元素，但问题是我无法获取一些元素，有些元素工作正常。请帮助我正确的方向。

这是我的爬虫代码：

from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ScrapyScraper.items import ScrapyscraperItem

class ScrapyscraperSpider(CrawlSpider) :
    name = "rs"
    allowed_domains = ["mega.pk"]
    start_urls = ["http://www.mega.pk/mobiles/"]

    rules = (
        Rule(SgmlLinkExtractor(allow = ("http://www\.mega\.pk/mobiles_products/[0-9]+\/[a-zA-Z-0-9.]+",)), callback = 'parse_item', follow = True),
    )

    def parse_item(self, response) :
        sel = Selector(response)
        item = ScrapyscraperItem()

        item['Heading'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()').extract()
        item['Content'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()').extract()
        item['Price'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()').extract()
        item['WiFi'] = sel.xpath('//*[@id="laptop_detail"]/tbody/tr/td[contains(. ,"Wireless")]/text()').extract()

        return item

现在我可以获取标题、内容和价格，但无法获取 Wifi returns。我完全困惑的一点是，相同的 xpath 在 chrome 中工作，而不在 python(scrapy) 中工作。

Answer 1

虽然我想我可能会看到你的问题，但我仍在学习自己。

我想您正在查找 wifi 状态 - 在这种情况下，您需要下一个元素的跨度文本：

import urllib2
import lxml.html as LH 

url = 'http://www.mega.pk/laptop_products/13242/Apple-MacBook-Pro-with-Retina-Display-Z0RG0000V.html'
response = urllib2.urlopen(url)
html = response.read()
doc=LH.fromstring(html)
heading = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()')
content = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()')
price = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()')
wifi_location = doc.xpath('//*[@id="laptop_detail"]//tr/td[contains(. ,"Wireless")]')[0]
wifi_status = wifi_location.getnext().find('span').text

我只检查了一个页面，但希望这对您有所帮助。我不确定为什么 xpath 不起作用。我会做更多的阅读，但我经常发现包含 tbody 在此设置中无法正常运行。我通常选择通过 // 跳到 td。

编辑

找到原因了，貌似chrome会在原来的html中不包含的情况下插入tbody。 Scrapy 正在尝试解析没有此功能的原始 HTML。

Extracting lxml xpath for html table

无法使用scrapy获取元素

Unable to fetch element using scrapy

python

scrapy

python-2.7

scrapy-spider