如何在scrapy中动态更改下载文件夹?
How to dynamically change download folder in scrapy?
我正在使用 scrapy 从网站下载一些 HTML 文件,但所有下载都存储在一个文件夹下。我宁愿将它们动态地存储在不同的文件夹中,比如第 1 页的 HTML 文件进入 folder_1 等等...
这是我的蜘蛛的样子
import scrapy
class LearnSpider(scrapy.Spider):
name = "learn"
start_urls = ["someUrlWithIndexstart="+chr(i) for i in range(ord('a'), ord('z')+1)]
def parse(self, response):
for song in response.css('.entity-title'):
songs = song.css('a ::attr(href)').get()
yield{
'file_urls': [songs+".html"]
}
理想情况下,我想做的是 HTML 从每个字母中刮下,进入每个字母的子文件夹。
以下是我的设置文件。
BOT_NAME = 'learn'
SPIDER_MODULES = ['learn.spiders']
NEWSPIDER_MODULE = 'learn.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloaded_files'
任何solution/idea都会有帮助,谢谢。
创建管道:
pipelines.py:
import os
from itemadapter import ItemAdapter
from urllib.parse import unquote
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
class ProcessPipeline(FilesPipeline):
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.files_urls_field, [])
return [Request(u) for u in urls]
def file_path(self, request, response=None, info=None, *, item=None):
file_name = os.path.basename(unquote(request.url))
return item['path'] + file_name
将设置中的 ITEM_PIPELINES 更改为此 class (ITEM_PIPELINES = {'projectsname.pipelines.ProcessPipeline': 1}
)
当您放弃该项目时,还要添加您要下载到的目录的路径:
yield {
'file_urls': [songs+".html"]
'path': f'folder{page}/' # ofcourse you'll need to provide the page variable
}
我正在使用 scrapy 从网站下载一些 HTML 文件,但所有下载都存储在一个文件夹下。我宁愿将它们动态地存储在不同的文件夹中,比如第 1 页的 HTML 文件进入 folder_1 等等...
这是我的蜘蛛的样子
import scrapy
class LearnSpider(scrapy.Spider):
name = "learn"
start_urls = ["someUrlWithIndexstart="+chr(i) for i in range(ord('a'), ord('z')+1)]
def parse(self, response):
for song in response.css('.entity-title'):
songs = song.css('a ::attr(href)').get()
yield{
'file_urls': [songs+".html"]
}
理想情况下,我想做的是 HTML 从每个字母中刮下,进入每个字母的子文件夹。
以下是我的设置文件。
BOT_NAME = 'learn'
SPIDER_MODULES = ['learn.spiders']
NEWSPIDER_MODULE = 'learn.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloaded_files'
任何solution/idea都会有帮助,谢谢。
创建管道:
pipelines.py:
import os
from itemadapter import ItemAdapter
from urllib.parse import unquote
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
class ProcessPipeline(FilesPipeline):
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.files_urls_field, [])
return [Request(u) for u in urls]
def file_path(self, request, response=None, info=None, *, item=None):
file_name = os.path.basename(unquote(request.url))
return item['path'] + file_name
将设置中的 ITEM_PIPELINES 更改为此 class (ITEM_PIPELINES = {'projectsname.pipelines.ProcessPipeline': 1}
)
当您放弃该项目时,还要添加您要下载到的目录的路径:
yield {
'file_urls': [songs+".html"]
'path': f'folder{page}/' # ofcourse you'll need to provide the page variable
}