如何解析包含引号的制表符分隔行?

How do I parse a tab-delimited line that contains a quote?

我正在使用 Ruby 2.4。如何解析包含引号字符的制表符分隔行?这就是我现在正在发生的事情......

2.4.0 :003 > line = "11\tDave\tO\"malley"
 => "11\tDave\tO\"malley" 
2.4.0 :004 > CSV.parse(line, col_sep: "\t")
CSV::MalformedCSVError: Illegal quoting in line 1.
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1912:in `block (2 levels) in shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `each'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `block in shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `loop'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1770:in `each'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `to_a'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `read'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1324:in `parse'
    from (irb):4
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in    `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>'

虽然这个例子说明了我的观点,但我无法轻易控制输入。因此,尽管答案可能是< "Remove all quotes from teh string before parsing,"我想尽可能地保留数据。

如果您试图遵守 CSV 标准,那么这是一份格式错误的文档。 Instad 你可能只是 brute-force 它并祈祷数据本身没有标签:

line.split(/\t/)

当您处理这样的数据时,CSV 解析库会派上用场:

"1\t2\t\"3a\t3b\"\t4"

更新: 如果您准备稍微滥用 CSV 库,那么您可以这样做:

CSV.parse("11\tDave\tO\"malley", col_sep: "\t", quote_char: "[=12=]")

这基本上会扼杀引号检测,因此如果有其他数据依赖于被正确处理的数据,这可能无法解决。

"11\tDave\tO\"malley" 不是有效的 CSV 数据。奇怪的是,答案是使用两个 double-quotes,并对每个元素加双引号

2.3.1 :001 > require 'csv'
 => true 
2.3.1 :002 > line = "\"11\"\t\"Dave\"\t\"O\"\"malley\""
 => "\"11\"\t\"Dave\"\t\"O\"\"malley\"" 
2.3.1 :003 > puts line # for clarity
"11"    "Dave"  "O""malley"
 => nil 
2.3.1 :004 > CSV.parse(line, col_sep: "\t")
 => [["11", "Dave", "O\"malley"]]