在带引号的字符串 csv 文件中有多个双引号

Having multiple double quotes inside quoted string csv file

我有一个 csv 文件,每个字段都有引号。

有一些字段,里面可以有多个双引号。我想用额外的双引号转义它们中的每一个。

","ABC "XYZ" PQRS","
","ABC "XYZ"","
","ABC "A" "B" TEST","
","ABC 2.5" "C" Test","

我从 link 那里得到了帮助,并且能够使用正则表达式 [regex]$r='(","[^"]+"[^"]+?",")' 涵盖内容中包含单双引号的场景。但是,遇到了内容中有多个双引号的情况。

[regex]$r='(","[^"]+"[^"]+"",")' # Not working
get-content C:\Projects\MyProject\testRegexFordoublequotes.csv | foreach {

  #save each line to a variable to make it easier to track

  $line=$_

  #look for a regex match

  $find=$r.matches($line)
  
  if ($find[0].Success) { 

      foreach ($match in $find) {

        #the original string we matched on

        $found=$match.value

        #replace the substring

        $replace= '","'+  $found.Trim('","').Replace('""','"').Replace('"','""')+ '","'

        #replace the full string and write to the pipeline

        $line -replace $found,$replace

      } #foreach
       

  } #if

  else {

        #no match so write the line to pipeline

        $line

    }

 } | Set-Content C:\Projects\MyProject\modified.csv -Force

你能帮我定义正则表达式吗,这将有助于字段内的多个双引号。

搜索 valid 分隔符(例如 "\s*,\s*")并将行拆分为字段可能比简单地更正每个(无效的)单双更容易在每个字段中引用 2 个引号。
通过用双引号将字段括起来并将它们与 csv(逗号)分隔符

连接起来,将字段重建为记录

输入

$Csv = @'
"Field","ABC "XYZ" PQRS","Field"
"Field","ABC "XYZ"","Field"
"Field","ABC "A" "B" TEST","Field"
"Field","ABC 2.5" "C" Test","Field"
'@ -Split '[\r\n]+'

脚本

$Csv | # replace with: get-content .\testRegexFordoublequotes.csv |
Foreach-Object {
    $Line = $_ -Replace '^\s*"' -Replace '"\s*$' # Strip outer double quotes
    $Fields = $Line -Split '"\s*,\s*"'           # Split line into fields
    $Fields = $Fields -Replace '"', '""'         # Escape each " in each field
    '"' + ($Fields -Join '","') + '"'            # Rejoin the fields to line
} # append: | Set-Content .\modified.csv -Force

输出

"Field","ABC ""XYZ"" PQRS","Field"
"Field","ABC ""XYZ""","Field"
"Field","ABC ""A"" ""B"" TEST","Field"
"Field","ABC 2.5"" ""C"" Test","Field"

根据我们在 post 评论中的对话,这些文件是不一致的 CSV 文件,因此 CSV 解析器没有帮助。

请注意,如果单个单元格恰好有一个 some textext","more text,则您有一个未定义的案例。由于未转义引号,该单元格将被视为两个单元格。

现在是正则表达式。你可以找到一个前瞻和后视的正则表达式,但我认为盲目地将所有引号加倍更容易,然后清理不需要的引号,例如在行首和行尾,以及单元格之间。

我不熟悉 powershell,但这里有一个 JavaScript/pseudo 代码,您可以轻松将其转换为 powershell 语法。我正在使用包含您所有陈述的测试用例的单行;您将遍历文件中的行:

/* assume $line is:
"Start","ABC "XYZ" PQRS","ABC "XYZ"","ABC "A" "B" TEST","ABC 2.5" "C" Test","End"
*/

$fixed = $line.replace(/"/g, '""')
              .replace(/"",""/g, '","')
              .replace(/^""/, '"')
              .replace(/""$/, '"')

/* $fixed is:
"Start","ABC ""XYZ"" PQRS","ABC ""XYZ""","ABC ""A"" ""B"" TEST","ABC 2.5"" ""C"" Test","End"
*/

解释:

  • .replace(/"/g, '""') - 盲目地将所有引号加倍
  • .replace(/"",""/g, '","') - 将 "","" 恢复为 ","
  • .replace(/^""/, '"') - 将行首的 "" 恢复为 "
  • .replace(/""$/, '"') - 将行尾的 "" 恢复为 "

您可以执行以下操作以查看更改内容:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""'

您可以简单地通过管道传输到 Set-Content 以保存新内容:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""' |
    Set-Content file.csv

解释:

(?<!^|",) 是对任何不是行首 (^) 或 ", 的先前位置的负向后视。 (?!,"|$) 是对不是行尾 ($) 或 ," 的任何下一个位置的负前瞻。如果满足这些环视条件," 将替换为 ""