报错网页抓取页面没有重新连接,但可以重新启动
Error Web scraping page does not reconnect, but can be started again
我正在抓取一个网站,有时它会向我发送此消息,但不会重新连接到目标网页
2020-08-18 22:37:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 35 pages/min), scraped 116421 items (at 35 items/min)
2020-08-18 22:38:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:39:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:40:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:41:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:42:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:43:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:43:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
我使用轮换代理,每小时刷新一次。用另一个蜘蛛尝试代理,它在同一页面上工作正常。
可能是什么问题?,如何挽救已经被抓取的数据?
代码:
import scrapy
class Pool(scrapy.Spider):
name = 'pool'
start_urls = [l.strip() for l in open("D:\links.txt").readlines()]
def parse(self,response):
pool1 = response.xpath("/html/[6]").get('').strip()
url = response.url
yield {
'Pool1': pool1,
'Url ': url ,
}
设置:
BOT_NAME = 'Pool'
SPIDER_MODULES = ['Pool.spiders']
NEWSPIDER_MODULE = 'Pool.spiders'
ROBOTSTXT_OBEY = False
FEED_EXPORTERS = {
'xlsx': 'scrapy_xlsx.XlsxItemExporter',
}
DOWNLOAD_TIMEOUT = 3600
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
COOKIES_ENABLED = False
ROTATING_PROXY_LIST = [
'IPproxyhttp',
]
我想可能是页面或所有代理同时断开,正在等待DOWNLOAD_TIMEOUT
我正在抓取一个网站,有时它会向我发送此消息,但不会重新连接到目标网页
2020-08-18 22:37:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 35 pages/min), scraped 116421 items (at 35 items/min)
2020-08-18 22:38:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:39:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:40:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:41:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:42:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:43:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:43:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
我使用轮换代理,每小时刷新一次。用另一个蜘蛛尝试代理,它在同一页面上工作正常。 可能是什么问题?,如何挽救已经被抓取的数据?
代码:
import scrapy
class Pool(scrapy.Spider):
name = 'pool'
start_urls = [l.strip() for l in open("D:\links.txt").readlines()]
def parse(self,response):
pool1 = response.xpath("/html/[6]").get('').strip()
url = response.url
yield {
'Pool1': pool1,
'Url ': url ,
}
设置:
BOT_NAME = 'Pool'
SPIDER_MODULES = ['Pool.spiders']
NEWSPIDER_MODULE = 'Pool.spiders'
ROBOTSTXT_OBEY = False
FEED_EXPORTERS = {
'xlsx': 'scrapy_xlsx.XlsxItemExporter',
}
DOWNLOAD_TIMEOUT = 3600
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
COOKIES_ENABLED = False
ROTATING_PROXY_LIST = [
'IPproxyhttp',
]
我想可能是页面或所有代理同时断开,正在等待DOWNLOAD_TIMEOUT