Ruby:CSV 解析器在我的数据中被双引号绊倒
Ruby: CSV parser tripping over double quotation in my data
我正在处理每日计划的 rake 任务,该任务将下载每天自动发送到 Dropbox 的 CSV,解析它并保存到数据库。我无法控制将数据输入到为此生成 CSV 报告的程序中的方式,因此我无法避免在某些数据中使用双引号。但是,我想知道是否有一种方法可以在 rake 任务中用单引号去除或替换它们,或者以某种方式通知解析器,这样它就不会抛出这个错误。
耙任务代码:
require 'net/http'
require 'csv'
require 'open-uri'
namespace :fp_import do
desc "download abc_relations from dropbox, save as csv, create or update record in db"
task :fp => :environment do
data = URI.parse("<<file's dropbox link>>").read
File.open(Rails.root.join('lib/assets', 'fp_relation.csv'), 'w') do |file|
file.write(data)
end
file= Rails.root.join('lib/assets', 'fp_relation.csv')
CSV.foreach(file) do |row|
div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = row
fg_sku = fg_style + "-" + fg_color
part_sku = part_style + "-" + part_color
relation = FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
if relation == false
FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
end
end
end
end
此 CSV 文件中大约有 35,000 行。下面是一个示例。您可以在示例的第 4 行中看到双引号。
示例数据:
"01","502210","018","ZH","5931","001","M","","UPHOLSTERED GLIDER A","RM"
"01","502310","053","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502310","065","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502312","424","ZH","25332","NO","O","","UPHOLSTERED GLIDER"AUS"","BAG"
"01","503210","277","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","503310","076","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","018","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","467","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","507610","932","AZ","25332","NO","O","","GLIDER","BAG"
"01","507610","932","AZ","5936","001","M","","GLIDER","RM"
CSV 无效,应转义引号。如果不需要其他特殊处理,您可以逐行读取文件,按 ,
拆分并删除 leading/trailing "
:
File.foreach(path) do |line|
columns = line.split(',').map do |column|
column[1...-1]
end
do_something_with_data(columns)
end
更新版本
file = Kernel.open(File.join(__dir__, 'input.almost_csv'))
file.each do |line|
values = line.split(',')
values = values.map do |value|
value[1...-1] # Remove leading and trailing double-quote
end
div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = values
fg_sku = fg_style + "-" + fg_color
part_sku = part_style + "-" + part_color
if !FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
end
end
注意:
- 你不需要
@
局部作用域变量就足够了。
- 如果您还想删除字符串中的引号,您可以操作
map
块中的值
- 只有在值
中没有列分隔符 ,
时才有效
源 CSV 格式错误,引号应在前面转义。
我会在用 CSV 解析文件之前编辑文件并删除逗号之间的引号,并将双引号替换为简单的引号,如果您不想编辑原始文件,您可以创建一个新文件。
def fix_csv(file)
out = File.open("fixed_"+file, 'w')
File.readlines(file).each do |line|
line = line[1...-2] #remove beggining and end quotes
line.gsub!(/","/,",") #remove all quotes between commas
line.gsub!(/"/,"'") #replace double quotes to single
out << line +"\n" #add the line plus endline to output
end
out.close
return "fixed_"+file
end
如果你想修改同一个 CSV 文件,你可以这样做:
require 'tempfile'
require 'fileutils'
def modify_csv(file)
temp_file = Tempfile.new('temp')
begin
File.readlines(file).each do |line|
line = line[1...-2]
line.gsub!(/","/,",")
line.gsub!(/"/,"'")
temp_file << line +"\n"
end
temp_file.close
FileUtils.mv(temp_file.path, file)
ensure
temp_file.close
temp_file.unlink
end
end
对此进行了解释 here 如果您想查看,这将修复或清理您的原始 CSV 文件
我正在处理每日计划的 rake 任务,该任务将下载每天自动发送到 Dropbox 的 CSV,解析它并保存到数据库。我无法控制将数据输入到为此生成 CSV 报告的程序中的方式,因此我无法避免在某些数据中使用双引号。但是,我想知道是否有一种方法可以在 rake 任务中用单引号去除或替换它们,或者以某种方式通知解析器,这样它就不会抛出这个错误。
耙任务代码:
require 'net/http'
require 'csv'
require 'open-uri'
namespace :fp_import do
desc "download abc_relations from dropbox, save as csv, create or update record in db"
task :fp => :environment do
data = URI.parse("<<file's dropbox link>>").read
File.open(Rails.root.join('lib/assets', 'fp_relation.csv'), 'w') do |file|
file.write(data)
end
file= Rails.root.join('lib/assets', 'fp_relation.csv')
CSV.foreach(file) do |row|
div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = row
fg_sku = fg_style + "-" + fg_color
part_sku = part_style + "-" + part_color
relation = FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
if relation == false
FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
end
end
end
end
此 CSV 文件中大约有 35,000 行。下面是一个示例。您可以在示例的第 4 行中看到双引号。
示例数据:
"01","502210","018","ZH","5931","001","M","","UPHOLSTERED GLIDER A","RM"
"01","502310","053","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502310","065","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502312","424","ZH","25332","NO","O","","UPHOLSTERED GLIDER"AUS"","BAG"
"01","503210","277","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","503310","076","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","018","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","467","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","507610","932","AZ","25332","NO","O","","GLIDER","BAG"
"01","507610","932","AZ","5936","001","M","","GLIDER","RM"
CSV 无效,应转义引号。如果不需要其他特殊处理,您可以逐行读取文件,按 ,
拆分并删除 leading/trailing "
:
File.foreach(path) do |line|
columns = line.split(',').map do |column|
column[1...-1]
end
do_something_with_data(columns)
end
更新版本
file = Kernel.open(File.join(__dir__, 'input.almost_csv'))
file.each do |line|
values = line.split(',')
values = values.map do |value|
value[1...-1] # Remove leading and trailing double-quote
end
div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = values
fg_sku = fg_style + "-" + fg_color
part_sku = part_style + "-" + part_color
if !FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
end
end
注意:
- 你不需要
@
局部作用域变量就足够了。 - 如果您还想删除字符串中的引号,您可以操作
map
块中的值 - 只有在值 中没有列分隔符
,
时才有效
源 CSV 格式错误,引号应在前面转义。
我会在用 CSV 解析文件之前编辑文件并删除逗号之间的引号,并将双引号替换为简单的引号,如果您不想编辑原始文件,您可以创建一个新文件。
def fix_csv(file)
out = File.open("fixed_"+file, 'w')
File.readlines(file).each do |line|
line = line[1...-2] #remove beggining and end quotes
line.gsub!(/","/,",") #remove all quotes between commas
line.gsub!(/"/,"'") #replace double quotes to single
out << line +"\n" #add the line plus endline to output
end
out.close
return "fixed_"+file
end
如果你想修改同一个 CSV 文件,你可以这样做:
require 'tempfile'
require 'fileutils'
def modify_csv(file)
temp_file = Tempfile.new('temp')
begin
File.readlines(file).each do |line|
line = line[1...-2]
line.gsub!(/","/,",")
line.gsub!(/"/,"'")
temp_file << line +"\n"
end
temp_file.close
FileUtils.mv(temp_file.path, file)
ensure
temp_file.close
temp_file.unlink
end
end
对此进行了解释 here 如果您想查看,这将修复或清理您的原始 CSV 文件