CSV 格式 - 从特定字段中去除限定符
CSV formatting - strip qualifier from specific fields
如果之前有人问过这个问题,我很抱歉,但我找不到类似的问题。
我收到的 CSV 输出使用 "
作为每个字段的文本限定符。我正在寻找一种优雅的解决方案来重新格式化这些,以便只有特定的(字母数字字段)具有这些限定符。
我收到的示例:
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
我想要的输出是这样的:
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25
非常感谢任何建议或帮助!
根据下面的请求找到示例文件的前五行:
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"
请注意,这只是一个示例,并非所有文件都适用于 Tri-Mountain。
这个问题提出了将引号与逗号分隔字段分开的困难,其中字段本身包含嵌入的逗号。 (例如:"Recruit-Navy,XL#28-75"
)从 shell 的角度(while read
、awk
等)有很多方法可以解决这个问题,但大多数最终会偶然发现嵌入式逗号。
一种成功的方法是对该行进行暴力 character-by-character
解析。 (下)这不是一个优雅的解决方案,但它会让您入门。 shell 程序的另一种选择是编译语言,例如 C,其中字符处理更健壮一些。如果您有任何问题,请发表评论。
#!/bin/bash
declare -a arr
declare -i ct=0
## fill array with separated fields (preserving comma in fields)
# Note: the following is a single-line (w/continuations for readability)
arr=( $( line='"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"'; \
for ((i=0; i < ${#line}; i++)); do \
if test "${line:i:1}" == ',' ; then \
if test "${line:i+1:1}" == '"' -o "${line:i-1:1}" == '"' ; then \
printf " "; \
else \
printf "%c" ${line:i:1}; \
fi; \
else \
printf "%c" ${line:i:1}; \
fi; \
done; \
printf "\n" ) )
## remove quotes from non-numeric fields
for i in "${arr[@]}"; do
if [[ "${i:0:1}" == '"' ]] && [[ ${i:1:1} == [0123456789] ]]; then
arr[$ct]="${i//\"/}"
else
arr[$ct]="$i"
fi
if test "$ct" -eq 0 ; then
printf "%s" "${arr[ct]}"
else
printf ",%s" "${arr[ct]}"
fi
((ct++))
done
printf "\n"
exit 0
输出
$ bash sepquoted.sh
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25
原创
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
由于您没有指定 OS 或语言,这里是 PowerShell 版本。
由于您的 CSV 文件不标准,我放弃了之前使用 Import-CSV
的尝试,转而使用原始文件处理。应该也快得多。
拆分 CSV 的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes
将此脚本另存为 StripQuotes.ps1
。它接受以下参数:
- InPath:从中读取 CSV 的文件夹。如果未指定,则使用当前目录。
- OutPath:用于保存已处理的 CSV 的文件夹。将创建,如果不存在。
- 编码: 如果未指定,脚本将使用系统当前的 ANSI 代码页来读取文件。您可以在 PowerShell 控制台中为您的系统获取其他有效编码,如下所示:
[System.Text.Encoding]::GetEncodings()
- 详细:脚本将通过
Write-Verbose
消息告诉您发生了什么。
示例(运行 来自 PowerShell 控制台)。
处理文件夹 C:\CSVs_are_here
中的所有 CSV,将处理后的 CSV 保存到文件夹 C:\Processed_CSVs
,详细说明:
.\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose
StripQuotes.ps1
脚本:
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default'
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Matches a single code point in the category "letter".
$AlphaNumRegex = '\p{L}'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip double quotes, if any
$item = $_.Trim($DQuotes)
if($_ -match $AlphaNumRegex)
{
# If field has at least one letter - wrap in quotes
$DQuotes + $item + $DQuotes
}
else
{
# Else, pass it as is
$item
}
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
如果之前有人问过这个问题,我很抱歉,但我找不到类似的问题。
我收到的 CSV 输出使用 "
作为每个字段的文本限定符。我正在寻找一种优雅的解决方案来重新格式化这些,以便只有特定的(字母数字字段)具有这些限定符。
我收到的示例:
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
我想要的输出是这样的:
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25
非常感谢任何建议或帮助!
根据下面的请求找到示例文件的前五行:
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"
请注意,这只是一个示例,并非所有文件都适用于 Tri-Mountain。
这个问题提出了将引号与逗号分隔字段分开的困难,其中字段本身包含嵌入的逗号。 (例如:"Recruit-Navy,XL#28-75"
)从 shell 的角度(while read
、awk
等)有很多方法可以解决这个问题,但大多数最终会偶然发现嵌入式逗号。
一种成功的方法是对该行进行暴力 character-by-character
解析。 (下)这不是一个优雅的解决方案,但它会让您入门。 shell 程序的另一种选择是编译语言,例如 C,其中字符处理更健壮一些。如果您有任何问题,请发表评论。
#!/bin/bash
declare -a arr
declare -i ct=0
## fill array with separated fields (preserving comma in fields)
# Note: the following is a single-line (w/continuations for readability)
arr=( $( line='"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"'; \
for ((i=0; i < ${#line}; i++)); do \
if test "${line:i:1}" == ',' ; then \
if test "${line:i+1:1}" == '"' -o "${line:i-1:1}" == '"' ; then \
printf " "; \
else \
printf "%c" ${line:i:1}; \
fi; \
else \
printf "%c" ${line:i:1}; \
fi; \
done; \
printf "\n" ) )
## remove quotes from non-numeric fields
for i in "${arr[@]}"; do
if [[ "${i:0:1}" == '"' ]] && [[ ${i:1:1} == [0123456789] ]]; then
arr[$ct]="${i//\"/}"
else
arr[$ct]="$i"
fi
if test "$ct" -eq 0 ; then
printf "%s" "${arr[ct]}"
else
printf ",%s" "${arr[ct]}"
fi
((ct++))
done
printf "\n"
exit 0
输出
$ bash sepquoted.sh
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25
原创
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
由于您没有指定 OS 或语言,这里是 PowerShell 版本。
由于您的 CSV 文件不标准,我放弃了之前使用 Import-CSV
的尝试,转而使用原始文件处理。应该也快得多。
拆分 CSV 的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes
将此脚本另存为 StripQuotes.ps1
。它接受以下参数:
- InPath:从中读取 CSV 的文件夹。如果未指定,则使用当前目录。
- OutPath:用于保存已处理的 CSV 的文件夹。将创建,如果不存在。
- 编码: 如果未指定,脚本将使用系统当前的 ANSI 代码页来读取文件。您可以在 PowerShell 控制台中为您的系统获取其他有效编码,如下所示:
[System.Text.Encoding]::GetEncodings()
- 详细:脚本将通过
Write-Verbose
消息告诉您发生了什么。
示例(运行 来自 PowerShell 控制台)。
处理文件夹 C:\CSVs_are_here
中的所有 CSV,将处理后的 CSV 保存到文件夹 C:\Processed_CSVs
,详细说明:
.\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose
StripQuotes.ps1
脚本:
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default'
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Matches a single code point in the category "letter".
$AlphaNumRegex = '\p{L}'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip double quotes, if any
$item = $_.Trim($DQuotes)
if($_ -match $AlphaNumRegex)
{
# If field has at least one letter - wrap in quotes
$DQuotes + $item + $DQuotes
}
else
{
# Else, pass it as is
$item
}
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}