Ruby:CSV 解析器在我的数据中被双引号绊倒

Ruby: CSV parser tripping over double quotation in my data

我正在处理每日计划的 rake 任务,该任务将下载每天自动发送到 Dropbox 的 CSV,解析它并保存到数据库。我无法控制将数据输入到为此生成 CSV 报告的程序中的方式,因此我无法避免在某些数据中使用双引号。但是,我想知道是否有一种方法可以在 rake 任务中用单引号去除或替换它们,或者以某种方式通知解析器,这样它就不会抛出这个错误。

耙任务代码:

require 'net/http'
require 'csv'
require 'open-uri'

namespace :fp_import do
    desc "download abc_relations from dropbox, save as csv, create or update record in db"
    task :fp => :environment do
        data = URI.parse("<<file's dropbox link>>").read

       File.open(Rails.root.join('lib/assets', 'fp_relation.csv'), 'w') do |file|
         file.write(data)
       end

       file= Rails.root.join('lib/assets', 'fp_relation.csv')

        CSV.foreach(file) do |row|
            div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = row
            fg_sku = fg_style + "-" + fg_color
            part_sku = part_style + "-" + part_color

            relation = FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
            if relation == false

                FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
            end
        end
    end
end

此 CSV 文件中大约有 35,000 行。下面是一个示例。您可以在示例的第 4 行中看到双引号。

示例数据:

"01","502210","018","ZH","5931","001","M","","UPHOLSTERED GLIDER A","RM"
"01","502310","053","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502310","065","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","502312","424","ZH","25332","NO","O","","UPHOLSTERED GLIDER"AUS"","BAG"
"01","503210","277","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","503310","076","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","018","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","506210","467","ZH","25332","NO","O","","UPHOLSTERED GLIDER","BAG"
"01","507610","932","AZ","25332","NO","O","","GLIDER","BAG"
"01","507610","932","AZ","5936","001","M","","GLIDER","RM"

CSV 无效,应转义引号。如果不需要其他特殊处理,您可以逐行读取文件,按 , 拆分并删除 leading/trailing ":

File.foreach(path) do |line|
  columns = line.split(',').map do |column|
    column[1...-1]
  end
  do_something_with_data(columns)
end

更新版本

file = Kernel.open(File.join(__dir__, 'input.almost_csv'))
file.each do |line|
  values = line.split(',')
  values = values.map do |value|
    value[1...-1] # Remove leading and trailing double-quote
  end

  div, fg_style, fg_color, factory, part_style, part_color, comp_code, vendor, design_no, comp_type = values
  fg_sku = fg_style + "-" + fg_color
  part_sku = part_style + "-" + part_color

  if !FgPart.where('part_sku LIKE ? AND fg_sku LIKE?', "%#{part_sku}%", "%#{fg_sku}%").exists?
    FgPart.create(fg_style: fg_style, fg_color: fg_color, fg_sku: fg_sku, factory: factory, part_style: part_style, part_color: part_color, part_sku: part_sku, comp_code: comp_code, comp_type: comp_type, design_no: design_no)
  end

end

注意:

  • 你不需要 @ 局部作用域变量就足够了。
  • 如果您还想删除字符串中的引号,您可以操作 map 块中的值
  • 只有在值
  • 中没有列分隔符 , 时才有效

源 CSV 格式错误,引号应在前面转义。

我会在用 CSV 解析文件之前编辑文件并删除逗号之间的引号,并将双引号替换为简单的引号,如果您不想编辑原始文件,您可以创建一个新文件。

def fix_csv(file)
  out = File.open("fixed_"+file, 'w')
  File.readlines(file).each do |line|
    line = line[1...-2] #remove beggining and end quotes
    line.gsub!(/","/,",") #remove all quotes between commas
    line.gsub!(/"/,"'") #replace double quotes to single
    out << line +"\n" #add the line plus endline to output
  end

  out.close
  return "fixed_"+file
end

如果你想修改同一个 CSV 文件,你可以这样做:

require 'tempfile'
require 'fileutils'

def modify_csv(file)
  temp_file = Tempfile.new('temp')
  begin
    File.readlines(file).each do |line|
      line = line[1...-2]
      line.gsub!(/","/,",")
      line.gsub!(/"/,"'")
      temp_file << line +"\n"
    end
    temp_file.close
    FileUtils.mv(temp_file.path, file)
  ensure
    temp_file.close
    temp_file.unlink
  end
end

对此进行了解释 here 如果您想查看,这将修复或清理您的原始 CSV 文件