如何在 scrapy 项目导出中每次都覆盖文件?
How to enable overwriting a file everytime in scrapy item export?
我正在抓取 urls
列表中 returns 的网站。
示例 - scrapy crawl xyz_spider -o urls.csv
它工作得非常好,现在我想要的是创建新的 urls.csv
而不是将 data
附加到文件中。我可以通过任何参数来启用它吗?
遗憾的是,scrapy 目前无法做到这一点。
不过,在 github 上提出了一项增强功能:https://github.com/scrapy/scrapy/issues/547
但是您可以轻松地将输出重定向到标准输出并将其重定向到文件:
scrapy crawl myspider -t json --nolog -o - > output.json
-o -
表示输出到负号,在这种情况下负号表示标准输出。
你也可以在运行 scrapy之前做一些别名来删除文件,比如:
alias sc='-rm output.csv && scrapy crawl myspider -o output.csv'
我通常通过 运行 Scrapy 作为 python 脚本处理自定义文件导出,并在调用 Spider Class 之前打开一个文件。这为处理和格式化 csv 文件提供了更大的灵活性,甚至 运行 它们作为网络应用程序的扩展或 运行 云。以下几行内容:
import csv
if __name__ == '__main__':
process = CrawlerProcess()
with open('Output.csv','wb') as output_file:
mywriter = csv.write(output_file)
process.crawl(Spider_Class, start_urls = start_urls)
process.start()
process.close()
您可以打开文件并关闭它,这样它将删除文件的内容。
class RestaurantDetailSpider(scrapy.Spider):
file = open('./restaurantsLink.csv','w')
file.close()
urls = list(open('./restaurantsLink.csv'))
urls = urls[1:]
print "Url List Found : " + str(len(urls))
name = "RestaurantDetailSpider"
start_urls = urls
def safeStr(self, obj):
try:
if obj == None:
return obj
return str(obj)
except UnicodeEncodeError as e:
return obj.encode('utf8', 'ignore').decode('utf8')
return ""
def parse(self, response):
try :
detail = RestaurantDetailItem()
HEADING = self.safeStr(response.css('#HEADING::text').extract_first())
if HEADING is not None:
if ',' in HEADING:
HEADING = "'" + HEADING + "'"
detail['Name'] = HEADING
CONTACT_INFO = self.safeStr(response.css('.directContactInfo *::text').extract_first())
if CONTACT_INFO is not None:
if ',' in CONTACT_INFO:
CONTACT_INFO = "'" + CONTACT_INFO + "'"
detail['Phone'] = CONTACT_INFO
ADDRESS_LIST = response.css('.headerBL .address *::text').extract()
if ADDRESS_LIST is not None:
ADDRESS = ', '.join([self.safeStr(x) for x in ADDRESS_LIST])
ADDRESS = ADDRESS.replace(',','')
detail['Address'] = ADDRESS
EMAIL = self.safeStr(response.css('#RESTAURANT_DETAILS .detailsContent a::attr(href)').extract_first())
if EMAIL is not None:
EMAIL = EMAIL.replace('mailto:','')
detail['Email'] = EMAIL
TYPE_LIST = response.css('.rating_and_popularity .header_links *::text').extract()
if TYPE_LIST is not None:
TYPE = ', '.join([self.safeStr(x) for x in TYPE_LIST])
TYPE = TYPE.replace(',','')
detail['Type'] = TYPE
yield detail
except Exception as e:
print "Error occure"
yield None
scrapy crawl RestaurantMainSpider -t csv -o restaurantsLink.csv
这将创建 restaurantsLink.csv 文件
我正在我的下一个蜘蛛中使用它 RestaurantDetailSpider
.
所以你可以运行下面的命令——它将删除并创建一个新文件restaurantsLink.csv,我们将在上面的蜘蛛中使用它,它会在我们运行蜘蛛:
rm restaurantsLink.csv && scrapy crawl RestaurantMainSpider -o restaurantsLink.csv -t csv
我正在抓取 urls
列表中 returns 的网站。
示例 - scrapy crawl xyz_spider -o urls.csv
它工作得非常好,现在我想要的是创建新的 urls.csv
而不是将 data
附加到文件中。我可以通过任何参数来启用它吗?
遗憾的是,scrapy 目前无法做到这一点。
不过,在 github 上提出了一项增强功能:https://github.com/scrapy/scrapy/issues/547
但是您可以轻松地将输出重定向到标准输出并将其重定向到文件:
scrapy crawl myspider -t json --nolog -o - > output.json
-o -
表示输出到负号,在这种情况下负号表示标准输出。
你也可以在运行 scrapy之前做一些别名来删除文件,比如:
alias sc='-rm output.csv && scrapy crawl myspider -o output.csv'
我通常通过 运行 Scrapy 作为 python 脚本处理自定义文件导出,并在调用 Spider Class 之前打开一个文件。这为处理和格式化 csv 文件提供了更大的灵活性,甚至 运行 它们作为网络应用程序的扩展或 运行 云。以下几行内容:
import csv
if __name__ == '__main__':
process = CrawlerProcess()
with open('Output.csv','wb') as output_file:
mywriter = csv.write(output_file)
process.crawl(Spider_Class, start_urls = start_urls)
process.start()
process.close()
您可以打开文件并关闭它,这样它将删除文件的内容。
class RestaurantDetailSpider(scrapy.Spider):
file = open('./restaurantsLink.csv','w')
file.close()
urls = list(open('./restaurantsLink.csv'))
urls = urls[1:]
print "Url List Found : " + str(len(urls))
name = "RestaurantDetailSpider"
start_urls = urls
def safeStr(self, obj):
try:
if obj == None:
return obj
return str(obj)
except UnicodeEncodeError as e:
return obj.encode('utf8', 'ignore').decode('utf8')
return ""
def parse(self, response):
try :
detail = RestaurantDetailItem()
HEADING = self.safeStr(response.css('#HEADING::text').extract_first())
if HEADING is not None:
if ',' in HEADING:
HEADING = "'" + HEADING + "'"
detail['Name'] = HEADING
CONTACT_INFO = self.safeStr(response.css('.directContactInfo *::text').extract_first())
if CONTACT_INFO is not None:
if ',' in CONTACT_INFO:
CONTACT_INFO = "'" + CONTACT_INFO + "'"
detail['Phone'] = CONTACT_INFO
ADDRESS_LIST = response.css('.headerBL .address *::text').extract()
if ADDRESS_LIST is not None:
ADDRESS = ', '.join([self.safeStr(x) for x in ADDRESS_LIST])
ADDRESS = ADDRESS.replace(',','')
detail['Address'] = ADDRESS
EMAIL = self.safeStr(response.css('#RESTAURANT_DETAILS .detailsContent a::attr(href)').extract_first())
if EMAIL is not None:
EMAIL = EMAIL.replace('mailto:','')
detail['Email'] = EMAIL
TYPE_LIST = response.css('.rating_and_popularity .header_links *::text').extract()
if TYPE_LIST is not None:
TYPE = ', '.join([self.safeStr(x) for x in TYPE_LIST])
TYPE = TYPE.replace(',','')
detail['Type'] = TYPE
yield detail
except Exception as e:
print "Error occure"
yield None
scrapy crawl RestaurantMainSpider -t csv -o restaurantsLink.csv
这将创建 restaurantsLink.csv 文件
我正在我的下一个蜘蛛中使用它 RestaurantDetailSpider
.
所以你可以运行下面的命令——它将删除并创建一个新文件restaurantsLink.csv,我们将在上面的蜘蛛中使用它,它会在我们运行蜘蛛:
rm restaurantsLink.csv && scrapy crawl RestaurantMainSpider -o restaurantsLink.csv -t csv