Python, 收入报告中解析表格时的 Scrapy 问题

Question

我试图从每个收入报告下的table（余额sheet）中解析一些数据。这里我以AMD为例，但不限于AMD。

我现在遇到的问题是我无法获得任何读数 - 我的蜘蛛总是 returns EMPTY 结果。我用scrapy shell "http://example.com"测试了我直接从Google Chrome Inspector复制过来的xpath，还是不行。

这是我的 xpath（Chrome 浏览器提供）：

//*[@id="newsroom-copy"]/div[2]/div[8]/table/tbody/tr[9]/td[4]/text()

这是我的代码：

import scrapy

class ESItem(scrapy.Item):
    Rev = scrapy.Field()

class ESSpider(scrapy.Spider):
    name = "es"
    start_urls = [
        'http://www.marketwired.com/press-release/amd-reports-2016-second-quarter-results-nasdaq-amd-2144535.htm',
    ]

    def parse(self, response):
        item = ESItem()
        for earning in response.xpath('//*[@id="newsroom-copy"]/div[2]/div[8]/table/tbody'):
            item['Rev'] = earning.xpath('tr[9]/td[4]/text()').extract_first()
            yield item

我要从报告底部的 table 中检索 "revenue numbers"。

谢谢！

我运行我的代码使用这个命令：

scrapy runspider ***.py -o ***.json

代码运行很好，没有错误，只是没有 return 我真正想要的。

UPDATE：我有点想通了......我必须从 XPATH 中删除那个 "tbody" 标签，我不明白......谁能稍微解释一下？

Answer 1

chrome中检查工具提供的html是浏览器对服务器发送给您的浏览器的实际代码的解释结果。

tbody 标签就是一个很好的例子。如果您查看网站的页面源代码，您会看到这样的结构

<table>
    <tr>
        <td></td>
    </tr>
</table>

现在，如果您检查页面，就会发生这种情况

<table>
    <tbody>
        <tr>
            <td></td>
        </tr>
    </tbody>
</table>

scrapy 获取的是页面源代码而不是 "inspector" 因此，无论何时您尝试 select 页面中的某些内容，请确保它存在于页面源代码中。

另一个例子是当您在加载页面时尝试 select 某些由 javascript 生成的元素。 Scrapy 也不会得到这个，所以你需要使用其他东西来解释它，比如 scrapy-splash 或 selenium。

作为旁注，花时间学习 xpath 和 css selectors。当您知道如何恰到好处地查询元素时，这将节省您的时间。

//*[@id='newsroom-copy']/div[2]/div[8]/table/tr[9]/td[4]/text()

相当于

//table/tr[td/text()='Net revenue']/td[4]/text()

看看它看起来好多了？

Python, 收入报告中解析表格时的 Scrapy 问题

Python, Scrapy problems when parsing tables in Earning Reports

python

scrapy

python-2.7

scrapy-spider