Scrapy 基本爬虫不工作?
Scrapy Basic Crawler not working?
所以我最近刚开始为一个项目尝试 Scrapy,我对各种旧语法(SgmlLinkExtractor 等)感到非常困惑,但我以某种方式设法将我认为是有意义的易读代码组合在一起我。但是,这不会遍历网站中的每个页面,而只会转到 start_urls 页面并且不会生成输出文件。有人可以解释一下我错过了什么吗?
import scrapy
import csv
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RLSpider(CrawlSpider):
name = "RL"
allowed_domains='ralphlauren.com/product/'
start_urls=[
'http://www.ralphlauren.com/'
]
rules = (
Rule(LinkExtractor(),callback="parse_item",follow=True),
)
def parse_item(self, response):
name = response.xpath('//h1/text()').extract_first()
price = response.xpath('//span[@class="reg-price"]/span/text()').extract_first()
image=response.xpath('//input[@name="enh_0"]/@value').extract_first()
print("Rules=",rules)
tup=(name,price,image)
csvF=open('data.csv','w')
csvWrite = csv.writer(csvF)
csvWrite.writerow(tup)
return []
def parse(self,response):
pass
我正在尝试从网站中提取数据并将其写入来自 /product/
下所有页面的 csv 文件
这是日志:
2016-12-07 19:46:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: P35Crawler)
2016-12-07 19:46:49 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'P35Crawler
', 'NEWSPIDER_MODULE': 'P35Crawler.spiders', 'SPIDER_MODULES': ['P35Crawler.spid
ers']}
2016-12-07 19:46:49 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-07 19:46:50 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-07 19:46:50 [scrapy] INFO: Spider opened
2016-12-07 19:46:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-12-07 19:46:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-07 19:46:51 [scrapy] DEBUG: Redirecting (302) to <GET http://www.ralphla
uren.com/home/index.jsp?ab=Geo_iIN_rUS_dUS> from <GET http://www.ralphlauren.com
/>
2016-12-07 19:46:51 [scrapy] DEBUG: Crawled (200) <GET http://www.ralphlauren.co
m/home/index.jsp?ab=Geo_iIN_rUS_dUS> (referer: None)
2016-12-07 19:46:51 [scrapy] INFO: Closing spider (finished)
2016-12-07 19:46:51 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 497,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 20766,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 7, 14, 16, 51, 973406),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 12, 7, 14, 16, 50, 287464)}
2016-12-07 19:46:51 [scrapy] INFO: Spider closed (finished)
您不应该用空方法覆盖 parse()
方法。所以只需删除该方法的声明即可。请让我知道这是否有帮助。
更新
关于您对使用 scrapy
解析 JS
的评论,有不同的方法可以做到这一点。您需要一个浏览器来解析 JS
。假设您想尝试 Firefox
并用 Selenium
控制它。
IMO 的最佳方式是实现下载处理程序,正如我在 this answer. You could, otherwise, implement a downloader middleware
, as explained 上解释的那样。 middleware
与 handler
相比有一些缺点,因为 download handler
允许您使用默认的 cache
、retry
等。
获得使用 Firefox
的基本脚本后,只需更改几行即可切换到 PhantomJS
。 PhantomJS
是无头浏览器,也就是说不需要加载所有的浏览器界面。所以速度快多了。
其他解决方案包括使用 Docker
和 Splash
,但我最终认为这是一种矫枉过正,因为您需要 运行 和 VM
来控制浏览器.
所以总结一下,最好的解决方案是实现一个 download handler
,它利用 Selenium
和 PhantomJS
。
所以我最近刚开始为一个项目尝试 Scrapy,我对各种旧语法(SgmlLinkExtractor 等)感到非常困惑,但我以某种方式设法将我认为是有意义的易读代码组合在一起我。但是,这不会遍历网站中的每个页面,而只会转到 start_urls 页面并且不会生成输出文件。有人可以解释一下我错过了什么吗?
import scrapy
import csv
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RLSpider(CrawlSpider):
name = "RL"
allowed_domains='ralphlauren.com/product/'
start_urls=[
'http://www.ralphlauren.com/'
]
rules = (
Rule(LinkExtractor(),callback="parse_item",follow=True),
)
def parse_item(self, response):
name = response.xpath('//h1/text()').extract_first()
price = response.xpath('//span[@class="reg-price"]/span/text()').extract_first()
image=response.xpath('//input[@name="enh_0"]/@value').extract_first()
print("Rules=",rules)
tup=(name,price,image)
csvF=open('data.csv','w')
csvWrite = csv.writer(csvF)
csvWrite.writerow(tup)
return []
def parse(self,response):
pass
我正在尝试从网站中提取数据并将其写入来自 /product/
下所有页面的 csv 文件这是日志:
2016-12-07 19:46:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: P35Crawler)
2016-12-07 19:46:49 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'P35Crawler
', 'NEWSPIDER_MODULE': 'P35Crawler.spiders', 'SPIDER_MODULES': ['P35Crawler.spid
ers']}
2016-12-07 19:46:49 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-07 19:46:50 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-07 19:46:50 [scrapy] INFO: Spider opened
2016-12-07 19:46:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-12-07 19:46:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-07 19:46:51 [scrapy] DEBUG: Redirecting (302) to <GET http://www.ralphla
uren.com/home/index.jsp?ab=Geo_iIN_rUS_dUS> from <GET http://www.ralphlauren.com
/>
2016-12-07 19:46:51 [scrapy] DEBUG: Crawled (200) <GET http://www.ralphlauren.co
m/home/index.jsp?ab=Geo_iIN_rUS_dUS> (referer: None)
2016-12-07 19:46:51 [scrapy] INFO: Closing spider (finished)
2016-12-07 19:46:51 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 497,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 20766,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 7, 14, 16, 51, 973406),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 12, 7, 14, 16, 50, 287464)}
2016-12-07 19:46:51 [scrapy] INFO: Spider closed (finished)
您不应该用空方法覆盖 parse()
方法。所以只需删除该方法的声明即可。请让我知道这是否有帮助。
更新
关于您对使用 scrapy
解析 JS
的评论,有不同的方法可以做到这一点。您需要一个浏览器来解析 JS
。假设您想尝试 Firefox
并用 Selenium
控制它。
IMO 的最佳方式是实现下载处理程序,正如我在 this answer. You could, otherwise, implement a downloader middleware
, as explained middleware
与 handler
相比有一些缺点,因为 download handler
允许您使用默认的 cache
、retry
等。
获得使用 Firefox
的基本脚本后,只需更改几行即可切换到 PhantomJS
。 PhantomJS
是无头浏览器,也就是说不需要加载所有的浏览器界面。所以速度快多了。
其他解决方案包括使用 Docker
和 Splash
,但我最终认为这是一种矫枉过正,因为您需要 运行 和 VM
来控制浏览器.
所以总结一下,最好的解决方案是实现一个 download handler
,它利用 Selenium
和 PhantomJS
。