Scrapy 爬取了 0 页
Scrapy crawled 0 pages
我正在学习 Scrapy。我在 https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/ 完成了教程,一切顺利。
然后我开始了一个新的简单项目来从维基百科中提取数据,这是输出
C:\Users\Leo\Documenti\PROGRAMMAZIONE\SORGENTI\Python\wikiScraper>scrapy crawl w
iki
2015-09-07 02:28:59 [scrapy] INFO: Scrapy 1.0.3 started (bot: wikiScraper)
2015-09-07 02:28:59 [scrapy] INFO: Optional features available: ssl, http11
2015-09-07 02:28:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wi
kiScraper.spiders', 'SPIDER_MODULES': ['wikiScraper.spiders'], 'BOT_NAME': 'wiki
Scraper'}
2015-09-07 02:28:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
2015-09-07 02:28:59 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-09-07 02:28:59 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-09-07 02:28:59 [scrapy] INFO: Enabled item pipelines:
2015-09-07 02:28:59 [scrapy] INFO: Spider opened
2015-09-07 02:28:59 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-09-07 02:28:59 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-09-07 02:29:00 [scrapy] DEBUG: Crawled (200) <GET https://it.wikipedia.org/
wiki/Serie_A_2015-2016> (referer: None)
[]
2015-09-07 02:29:00 [scrapy] INFO: Closing spider (finished)
2015-09-07 02:29:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 55474,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 7, 0, 29, 0, 355000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 9, 7, 0, 28, 59, 671000)}
2015-09-07 02:29:00 [scrapy] INFO: Spider closed (finished)
这是我的 wiki_spider.py:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from wikiScraper.items import WikiItem
class WikiSpider(Spider):
name = "wiki"
allowed_domains = ["wikipedia.it"]
start_urls = [
"http://it.wikipedia.org/wiki/Serie_A_2015-2016",
]
def parse(self, response):
questions = Selector(response).xpath('//*[@id="mw-content-text"]/center/table/tbody/tr')
print(questions)
for question in questions:
item = WikiItem()
item['position'] = question.xpath(
'td[2]/text()').extract()
item['team'] = question.xpath(
'td[3]/a/text()').extract()
item['point'] = question.xpath(
'td[4]/b/text()').extract()
yield item
在 Chrome 的开发者工具上,我可以成功地 select 我想从 XPath select 中提取的所有数据。
但如果我尝试:
response.xpath('//*[@id="mw-content-text"]/center/table/tbody/tr')
在命令提示符中,如果我打印(问题),它会给我一个
[]
谢谢!任何帮助表示赞赏!
实际问题是包含 tbody
的 XPath 表达式 - 这个元素是由浏览器添加的,不存在于你通过 Scrapy 获得的 HTML 中。我也会依靠 Classifica
文本来获得所需的 table 和 Seria A 当前排名。更新代码:
def parse(self, response):
questions = response.xpath('//h2[span = "Classifica"]/following-sibling::center/table//tr')[1:]
for question in questions:
item = StackItem()
item['position'] = question.xpath('td[2]/text()').extract()[0]
item['team'] = question.xpath('td[3]/a/text()').extract()[0]
item['point'] = question.xpath('td[4]/b/text()').extract()[0]
yield item
它产生:
{'position': u'1.', 'point': u'6', 'team': u'Chievo'}
{'position': u'1.', 'point': u'6', 'team': u'Torino'}
{'position': u'1.', 'point': u'6', 'team': u'Inter'}
{'position': u'1.', 'point': u'6', 'team': u'Sassuolo'}
{'position': u'1.', 'point': u'6', 'team': u'Palermo'}
{'position': u'6.', 'point': u'4', 'team': u'Sampdoria'}
{'position': u'6.', 'point': u'4', 'team': u'Roma'}
{'position': u'8.', 'point': u'3', 'team': u'Atalanta'}
{'position': u'8.', 'point': u'3', 'team': u'Genoa'}
{'position': u'8.', 'point': u'3', 'team': u'Fiorentina'}
{'position': u'8.', 'point': u'3', 'team': u'Udinese'}
{'position': u'8.', 'point': u'3', 'team': u'Milan'}
{'position': u'8.', 'point': u'3', 'team': u'Lazio'}
{'position': u'14.', 'point': u'1', 'team': u'Napoli'}
{'position': u'14.', 'point': u'1', 'team': u'Verona'}
{'position': u'16.', 'point': u'0', 'team': u'Bologna'}
{'position': u'16.', 'point': u'0', 'team': u'Juventus'}
{'position': u'16.', 'point': u'0', 'team': u'Empoli'}
{'position': u'16.', 'point': u'0', 'team': u'Frosinone'}
{'position': u'16.', 'point': u'0', 'team': u'Carpi'}
我正在学习 Scrapy。我在 https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/ 完成了教程,一切顺利。 然后我开始了一个新的简单项目来从维基百科中提取数据,这是输出
C:\Users\Leo\Documenti\PROGRAMMAZIONE\SORGENTI\Python\wikiScraper>scrapy crawl w
iki
2015-09-07 02:28:59 [scrapy] INFO: Scrapy 1.0.3 started (bot: wikiScraper)
2015-09-07 02:28:59 [scrapy] INFO: Optional features available: ssl, http11
2015-09-07 02:28:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wi
kiScraper.spiders', 'SPIDER_MODULES': ['wikiScraper.spiders'], 'BOT_NAME': 'wiki
Scraper'}
2015-09-07 02:28:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
2015-09-07 02:28:59 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-09-07 02:28:59 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-09-07 02:28:59 [scrapy] INFO: Enabled item pipelines:
2015-09-07 02:28:59 [scrapy] INFO: Spider opened
2015-09-07 02:28:59 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-09-07 02:28:59 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-09-07 02:29:00 [scrapy] DEBUG: Crawled (200) <GET https://it.wikipedia.org/
wiki/Serie_A_2015-2016> (referer: None)
[]
2015-09-07 02:29:00 [scrapy] INFO: Closing spider (finished)
2015-09-07 02:29:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 55474,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 7, 0, 29, 0, 355000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 9, 7, 0, 28, 59, 671000)}
2015-09-07 02:29:00 [scrapy] INFO: Spider closed (finished)
这是我的 wiki_spider.py:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from wikiScraper.items import WikiItem
class WikiSpider(Spider):
name = "wiki"
allowed_domains = ["wikipedia.it"]
start_urls = [
"http://it.wikipedia.org/wiki/Serie_A_2015-2016",
]
def parse(self, response):
questions = Selector(response).xpath('//*[@id="mw-content-text"]/center/table/tbody/tr')
print(questions)
for question in questions:
item = WikiItem()
item['position'] = question.xpath(
'td[2]/text()').extract()
item['team'] = question.xpath(
'td[3]/a/text()').extract()
item['point'] = question.xpath(
'td[4]/b/text()').extract()
yield item
在 Chrome 的开发者工具上,我可以成功地 select 我想从 XPath select 中提取的所有数据。
但如果我尝试:
response.xpath('//*[@id="mw-content-text"]/center/table/tbody/tr')
在命令提示符中,如果我打印(问题),它会给我一个
[]
谢谢!任何帮助表示赞赏!
实际问题是包含 tbody
的 XPath 表达式 - 这个元素是由浏览器添加的,不存在于你通过 Scrapy 获得的 HTML 中。我也会依靠 Classifica
文本来获得所需的 table 和 Seria A 当前排名。更新代码:
def parse(self, response):
questions = response.xpath('//h2[span = "Classifica"]/following-sibling::center/table//tr')[1:]
for question in questions:
item = StackItem()
item['position'] = question.xpath('td[2]/text()').extract()[0]
item['team'] = question.xpath('td[3]/a/text()').extract()[0]
item['point'] = question.xpath('td[4]/b/text()').extract()[0]
yield item
它产生:
{'position': u'1.', 'point': u'6', 'team': u'Chievo'}
{'position': u'1.', 'point': u'6', 'team': u'Torino'}
{'position': u'1.', 'point': u'6', 'team': u'Inter'}
{'position': u'1.', 'point': u'6', 'team': u'Sassuolo'}
{'position': u'1.', 'point': u'6', 'team': u'Palermo'}
{'position': u'6.', 'point': u'4', 'team': u'Sampdoria'}
{'position': u'6.', 'point': u'4', 'team': u'Roma'}
{'position': u'8.', 'point': u'3', 'team': u'Atalanta'}
{'position': u'8.', 'point': u'3', 'team': u'Genoa'}
{'position': u'8.', 'point': u'3', 'team': u'Fiorentina'}
{'position': u'8.', 'point': u'3', 'team': u'Udinese'}
{'position': u'8.', 'point': u'3', 'team': u'Milan'}
{'position': u'8.', 'point': u'3', 'team': u'Lazio'}
{'position': u'14.', 'point': u'1', 'team': u'Napoli'}
{'position': u'14.', 'point': u'1', 'team': u'Verona'}
{'position': u'16.', 'point': u'0', 'team': u'Bologna'}
{'position': u'16.', 'point': u'0', 'team': u'Juventus'}
{'position': u'16.', 'point': u'0', 'team': u'Empoli'}
{'position': u'16.', 'point': u'0', 'team': u'Frosinone'}
{'position': u'16.', 'point': u'0', 'team': u'Carpi'}