如何在 scrapy settings.py 中启用覆盖输出文件?
How to enable overwriting output files in scrapy settings.py?
正如可以在 docs 中找到的那样,它指出:
New in version 2.4.0.
overwrite: whether to overwrite the file if it already exists (True)
or append to its content (False).
我在 settings.py
- 我的 scrapy
项目的文件中插入了以下内容:
FEEDS = {"overwrite": True}
这导致在执行 scrapy crawl quotes_splash -o Outputs/quotes_splash.json
时出现以下错误输出:
(scrapy_course) andylu@andylu-Lubuntu-PC:~$ scrapy crawl quotes_splash -o Outputs/quotes_splash.json
2020-12-02 18:11:59 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: quotes_spider_splash)
2020-12-02 18:11:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.0 (default, Nov 22 2020, 23:12:14) - [GCC 5.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.4.0-56-generic-x86_64-with-glibc2.31
2020-12-02 18:11:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-12-02 18:11:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'quotes_spider_splash',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'quotes_spider_splash.spiders',
'SPIDER_MODULES': ['quotes_spider_splash.spiders']}
2020-12-02 18:11:59 [scrapy.extensions.telnet] INFO: Telnet Password: ...
Traceback (most recent call last):
File "/home/andylu/.virtualenvs/scrapy_course/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 145, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 153, in _run_command
cmd.run(args, opts)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 22, in run
crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 191, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 224, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 229, in _create_crawler
return Crawler(spidercls, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 72, in __init__
self.extensions = ExtensionManager.from_crawler(self)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/middleware.py", line 35, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/misc.py", line 167, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 247, in from_crawler
exporter = cls(crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 277, in __init__
self.feeds[uri] = feed_complete_default_values_from_settings(feed_options, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/conf.py", line 118, in feed_complete_default_values_from_settings
out = feed.copy()
AttributeError: 'bool' object has no attribute 'copy'
如何防止输出文件 Outputs/quotes_splash.json
被追加?
我希望它每次都被完全覆盖。
PS:
受到下面 Georgiy 的回答的启发,我发现命令行帮助选项会显示输出标志 -o
和 -O
确实有所不同。 scrapy crawl -h
产量:
(scrapy_course) andylu@andylu-Lubuntu-PC:~$ scrapy crawl -h
Usage
=====
scrapy crawl [options] <spider>
Run a spider
Options
=======
--help, -h show this help message and exit
-a NAME=VALUE set spider argument (may be repeated)
--output=FILE, -o FILE append scraped items to the end of FILE (use - for
stdout)
--overwrite-output=FILE, -O FILE
dump scraped items into FILE, overwriting any existing
file
--output-format=FORMAT, -t FORMAT
format to use for dumping items
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
I inserted in the settings.py - file of my scrapy-project the
following:
FEEDS = {"overwrite": True}
根据 docs(v.2.4) 用法 FEEDS
设置它应该是这样的:(它不是错误的来源,因为它被命令行参数替换(它具有更高的优先级):
FEEDS = {
"quotes_splash.json": {
"format": "json",
"overwrite": True
}
}
从命令行启用覆盖你需要使用-O
(大写字母)参数
而不是小-o
(它是不同的)。
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 277, in __init__
self.feeds[uri] = feed_complete_default_values_from_settings(feed_options, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/conf.py", line 118, in feed_complete_default_values_from_settings
out = feed.copy()
看起来你的错误的真正来源是在解析文件名 feed uri 期间Outputs/quotes_splash.json
(它与覆盖功能无关)
Scrapy命令行工具的覆盖命令需要是这样的:
scrapy crawl quotes_splash -O quotes_splash.json
正如可以在 docs 中找到的那样,它指出:
New in version 2.4.0.
overwrite: whether to overwrite the file if it already exists (True) or append to its content (False).
我在 settings.py
- 我的 scrapy
项目的文件中插入了以下内容:
FEEDS = {"overwrite": True}
这导致在执行 scrapy crawl quotes_splash -o Outputs/quotes_splash.json
时出现以下错误输出:
(scrapy_course) andylu@andylu-Lubuntu-PC:~$ scrapy crawl quotes_splash -o Outputs/quotes_splash.json
2020-12-02 18:11:59 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: quotes_spider_splash)
2020-12-02 18:11:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.0 (default, Nov 22 2020, 23:12:14) - [GCC 5.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.4.0-56-generic-x86_64-with-glibc2.31
2020-12-02 18:11:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-12-02 18:11:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'quotes_spider_splash',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'quotes_spider_splash.spiders',
'SPIDER_MODULES': ['quotes_spider_splash.spiders']}
2020-12-02 18:11:59 [scrapy.extensions.telnet] INFO: Telnet Password: ...
Traceback (most recent call last):
File "/home/andylu/.virtualenvs/scrapy_course/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 145, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/cmdline.py", line 153, in _run_command
cmd.run(args, opts)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 22, in run
crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 191, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 224, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 229, in _create_crawler
return Crawler(spidercls, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/crawler.py", line 72, in __init__
self.extensions = ExtensionManager.from_crawler(self)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/middleware.py", line 35, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/misc.py", line 167, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 247, in from_crawler
exporter = cls(crawler)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 277, in __init__
self.feeds[uri] = feed_complete_default_values_from_settings(feed_options, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/conf.py", line 118, in feed_complete_default_values_from_settings
out = feed.copy()
AttributeError: 'bool' object has no attribute 'copy'
如何防止输出文件 Outputs/quotes_splash.json
被追加?
我希望它每次都被完全覆盖。
PS:
受到下面 Georgiy 的回答的启发,我发现命令行帮助选项会显示输出标志 -o
和 -O
确实有所不同。 scrapy crawl -h
产量:
(scrapy_course) andylu@andylu-Lubuntu-PC:~$ scrapy crawl -h
Usage
=====
scrapy crawl [options] <spider>
Run a spider
Options
=======
--help, -h show this help message and exit
-a NAME=VALUE set spider argument (may be repeated)
--output=FILE, -o FILE append scraped items to the end of FILE (use - for
stdout)
--overwrite-output=FILE, -O FILE
dump scraped items into FILE, overwriting any existing
file
--output-format=FORMAT, -t FORMAT
format to use for dumping items
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
I inserted in the settings.py - file of my scrapy-project the following:
FEEDS = {"overwrite": True}
根据 docs(v.2.4) 用法 FEEDS
设置它应该是这样的:(它不是错误的来源,因为它被命令行参数替换(它具有更高的优先级):
FEEDS = {
"quotes_splash.json": {
"format": "json",
"overwrite": True
}
}
从命令行启用覆盖你需要使用-O
(大写字母)参数
而不是小-o
(它是不同的)。
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/extensions/feedexport.py", line 277, in __init__
self.feeds[uri] = feed_complete_default_values_from_settings(feed_options, self.settings)
File "/home/andylu/.virtualenvs/scrapy_course/lib/python3.9/site-packages/scrapy/utils/conf.py", line 118, in feed_complete_default_values_from_settings
out = feed.copy()
看起来你的错误的真正来源是在解析文件名 feed uri 期间Outputs/quotes_splash.json
(它与覆盖功能无关)
Scrapy命令行工具的覆盖命令需要是这样的:
scrapy crawl quotes_splash -O quotes_splash.json