带有 Scrapy 的 Celery 不解析 CSV 文件
Celery with Scrapy don't parse CSV file
任务本身会立即启动,但会尽快结束,而且我看不到任务的结果,它根本就没有进入管道。当我编写代码并使用 scrapy crawl <spider_name>
命令 运行 时,一切正常。我在使用 Celery 时遇到了这个问题。
我的 Celery 工人日志:
[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.7 (default, Jan 12 2021, 17:06:28) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling',
'DOWNLOAD_TIMEOUT': 600,
'DOWNLOAD_WARNSIZE': 267386880,
'NEWSPIDER_MODULE': 'crawling.crawling.spiders',
'SPIDER_MODULES': ['crawling.crawling.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM, shutting down gracefully. Send again to force
我有以下蜘蛛:
class CopartSpider(CSVFeedSpider):
name = '<spider_name>'
allowed_domains = ['<allowed_domain>']
start_urls = [
'file:///code/autotracker/crawling/data/salesdata.cgi'
]
我的部分Scrapy设置(没有其他和Scrapy直接相关的):
BOT_NAME = 'crawling'
SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'
ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 600 # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024 # 255 mb
DEFAULT_REQUEST_HEADERS = {
'Accept': '*/*',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'crawling.pipelines.AutoPipeline': 1,
}
我有两个 Celery 配置文件:
celery.py
from celery import Celery
from celery.schedules import crontab
BROKER_URL = 'redis://redis:6379/0'
app = Celery('crawling', broker=BROKER_URL)
app.conf.beat_schedule = {
'scrape-every-20-minutes': {
'task': 'crawling.crawling.tasks.start_crawler_process',
'schedule': crontab(minute='*/5'),
}
}
tasks.py
@app.task
def start_crawler_process():
process = CrawlerProcess(get_project_settings())
process.crawl('<spider_name>')
process.start()
原因: Scrapy不允许运行其他进程。
解决方法:我用的是自己的脚本-https://github.com/dtalkachou/scrapy-crawler-script
任务本身会立即启动,但会尽快结束,而且我看不到任务的结果,它根本就没有进入管道。当我编写代码并使用 scrapy crawl <spider_name>
命令 运行 时,一切正常。我在使用 Celery 时遇到了这个问题。
我的 Celery 工人日志:
[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.7 (default, Jan 12 2021, 17:06:28) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling',
'DOWNLOAD_TIMEOUT': 600,
'DOWNLOAD_WARNSIZE': 267386880,
'NEWSPIDER_MODULE': 'crawling.crawling.spiders',
'SPIDER_MODULES': ['crawling.crawling.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM, shutting down gracefully. Send again to force
我有以下蜘蛛:
class CopartSpider(CSVFeedSpider):
name = '<spider_name>'
allowed_domains = ['<allowed_domain>']
start_urls = [
'file:///code/autotracker/crawling/data/salesdata.cgi'
]
我的部分Scrapy设置(没有其他和Scrapy直接相关的):
BOT_NAME = 'crawling'
SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'
ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 600 # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024 # 255 mb
DEFAULT_REQUEST_HEADERS = {
'Accept': '*/*',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'crawling.pipelines.AutoPipeline': 1,
}
我有两个 Celery 配置文件:
celery.py
from celery import Celery
from celery.schedules import crontab
BROKER_URL = 'redis://redis:6379/0'
app = Celery('crawling', broker=BROKER_URL)
app.conf.beat_schedule = {
'scrape-every-20-minutes': {
'task': 'crawling.crawling.tasks.start_crawler_process',
'schedule': crontab(minute='*/5'),
}
}
tasks.py
@app.task
def start_crawler_process():
process = CrawlerProcess(get_project_settings())
process.crawl('<spider_name>')
process.start()
原因: Scrapy不允许运行其他进程。
解决方法:我用的是自己的脚本-https://github.com/dtalkachou/scrapy-crawler-script