修改 CSV 文件内容的更有效方法
More-efficient way to modify a CSV file's content
我正在尝试删除 SSMS 2012 在将查询结果导出为 CSV 时生成的一些碎片。
例如,它包含 null
值的单词 'NULL' 并向 datetime
值添加毫秒:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0
不幸的是,Excel 不会自动 将 datetime
值正确格式化为小数秒,这会导致客户混淆('What happened to the date field that I requested?') 和更多的工作(必须将 CSV 转换为 XLSX 并在分发之前正确格式化列)。
目标是去除 CSV 文件中的 NULL
和 .000
值:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
Excel 将打开此文件并正确格式化,无需进一步的技术帮助。
为此,我写道:
Function Invoke-CsvCleanser {
[CmdletBinding()]
Param(
[parameter(Mandatory=$true)]
[String]
$Path,
[switch]
$Nulls,
[switch]
$Milliseconds
)
PROCESS {
# open the file
$data = Import-Csv $path
# process each row
$data | Foreach-Object {
# process each column
Foreach ($property in $_.PSObject.Properties) {
# if column contains 'NULL', replace it with ''
if ($Nulls -and ($property.Value -eq 'NULL')) {
$property.Value = $property.Value -replace 'NULL', ''
}
# if column contains a date/time value, remove milliseconds
elseif ( $Milliseconds -and (isDate($property.Value)) ) {
$property.Value = $property.Value -replace '.000', ''
}
}
}
# save file
$data | Export-Csv -Path $Path -NoTypeInformation
}
}
function IsDate($object) {
[Boolean]($object -as [DateTime])
}
PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop[=12=]00.csv' -Nulls -Milliseconds
这在文件较小时效果很好,但对于大文件来说效率很低。理想情况下,Invoke-CsvCleanser
会利用管道。
有更好的方法吗?
Import-CSV
总是在内存中加载整个文件,所以速度很慢。这是我对这个问题的回答修改后的脚本:.
它使用原始文件处理,因此速度应该快得多。 NULL
s 和毫秒是 matched\replaced 使用正则表达式。脚本能够批量转换 CSV。
拆分 CSV 的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes
将此脚本另存为 Invoke-CsvCleanser.ps1
。它接受以下参数:
- InPath:从中读取 CSV 的文件夹。如果未指定,则使用当前目录。
- OutPath: 用于保存已处理的 CSV 的文件夹。将创建,如果不存在。
- 编码: 如果未指定,脚本将使用系统当前的 ANSI 代码页来读取文件。您可以在 PowerShell 控制台中为您的系统获取其他有效编码,如下所示:
[System.Text.Encoding]::GetEncodings()
- DoubleQuotes: 开关,如果指定,周围的双引号将从值中剥离
- Nulls: 开关,如果指定,
NULL
字符串将从值中剥离
- 毫秒:开关,如果指定,
.000
个字符串将从值中剥离
- 详细:脚本将通过
Write-Verbose
消息告诉您发生了什么。
示例:
处理文件夹C:\CSVs_are_here
中的所有CSV,去除NULL和毫秒,将处理后的CSV保存到文件夹C:\Processed_CSVs
,详细:
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose
Invoke-CsvCleanser.ps1
脚本:
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[switch]$Nulls,
[switch]$Milliseconds,
[switch]$DoubleQuotes
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip surrounding quotes
if($DoubleQuotes)
{
$_ = $_.Trim($DQuotes)
}
# Strip NULL strings
if($Nulls)
{
$_ = $_ -replace $NullRegex, ''
}
# Strip milliseconds
if($Milliseconds)
{
$_ = $_ -replace $MillisecondsRegex, ''
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
结果:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
It would be interesting to see if one could pass lambdas as a way
to make the file processing more flexible. Each lambda would perform a
specific activity (removing NULLs, upper-casing, normalizing text,
etc.)
此版本提供了对 CSV 处理的完全控制。只需按照您希望它们执行的顺序将脚本块传递给 Action
参数。
示例:去掉NULL
s,去掉毫秒,然后去掉双引号。
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', ''}, {$_ = $_.Trim('"')}
Invoke-CsvCleanser.ps1
与 "lambdas":
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[scriptblock[]]$Action
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Process each item
foreach($scriptblock in $Action) {
. $scriptblock
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
我正在尝试删除 SSMS 2012 在将查询结果导出为 CSV 时生成的一些碎片。
例如,它包含 null
值的单词 'NULL' 并向 datetime
值添加毫秒:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0
不幸的是,Excel 不会自动 将 datetime
值正确格式化为小数秒,这会导致客户混淆('What happened to the date field that I requested?') 和更多的工作(必须将 CSV 转换为 XLSX 并在分发之前正确格式化列)。
目标是去除 CSV 文件中的 NULL
和 .000
值:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
Excel 将打开此文件并正确格式化,无需进一步的技术帮助。
为此,我写道:
Function Invoke-CsvCleanser {
[CmdletBinding()]
Param(
[parameter(Mandatory=$true)]
[String]
$Path,
[switch]
$Nulls,
[switch]
$Milliseconds
)
PROCESS {
# open the file
$data = Import-Csv $path
# process each row
$data | Foreach-Object {
# process each column
Foreach ($property in $_.PSObject.Properties) {
# if column contains 'NULL', replace it with ''
if ($Nulls -and ($property.Value -eq 'NULL')) {
$property.Value = $property.Value -replace 'NULL', ''
}
# if column contains a date/time value, remove milliseconds
elseif ( $Milliseconds -and (isDate($property.Value)) ) {
$property.Value = $property.Value -replace '.000', ''
}
}
}
# save file
$data | Export-Csv -Path $Path -NoTypeInformation
}
}
function IsDate($object) {
[Boolean]($object -as [DateTime])
}
PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop[=12=]00.csv' -Nulls -Milliseconds
这在文件较小时效果很好,但对于大文件来说效率很低。理想情况下,Invoke-CsvCleanser
会利用管道。
有更好的方法吗?
Import-CSV
总是在内存中加载整个文件,所以速度很慢。这是我对这个问题的回答修改后的脚本:
它使用原始文件处理,因此速度应该快得多。 NULL
s 和毫秒是 matched\replaced 使用正则表达式。脚本能够批量转换 CSV。
拆分 CSV 的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes
将此脚本另存为 Invoke-CsvCleanser.ps1
。它接受以下参数:
- InPath:从中读取 CSV 的文件夹。如果未指定,则使用当前目录。
- OutPath: 用于保存已处理的 CSV 的文件夹。将创建,如果不存在。
- 编码: 如果未指定,脚本将使用系统当前的 ANSI 代码页来读取文件。您可以在 PowerShell 控制台中为您的系统获取其他有效编码,如下所示:
[System.Text.Encoding]::GetEncodings()
- DoubleQuotes: 开关,如果指定,周围的双引号将从值中剥离
- Nulls: 开关,如果指定,
NULL
字符串将从值中剥离 - 毫秒:开关,如果指定,
.000
个字符串将从值中剥离 - 详细:脚本将通过
Write-Verbose
消息告诉您发生了什么。
示例:
处理文件夹C:\CSVs_are_here
中的所有CSV,去除NULL和毫秒,将处理后的CSV保存到文件夹C:\Processed_CSVs
,详细:
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose
Invoke-CsvCleanser.ps1
脚本:
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[switch]$Nulls,
[switch]$Milliseconds,
[switch]$DoubleQuotes
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip surrounding quotes
if($DoubleQuotes)
{
$_ = $_.Trim($DQuotes)
}
# Strip NULL strings
if($Nulls)
{
$_ = $_ -replace $NullRegex, ''
}
# Strip milliseconds
if($Milliseconds)
{
$_ = $_ -replace $MillisecondsRegex, ''
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
结果:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
It would be interesting to see if one could pass lambdas as a way to make the file processing more flexible. Each lambda would perform a specific activity (removing NULLs, upper-casing, normalizing text, etc.)
此版本提供了对 CSV 处理的完全控制。只需按照您希望它们执行的顺序将脚本块传递给 Action
参数。
示例:去掉NULL
s,去掉毫秒,然后去掉双引号。
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', ''}, {$_ = $_.Trim('"')}
Invoke-CsvCleanser.ps1
与 "lambdas":
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[scriptblock[]]$Action
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
#
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Process each item
foreach($scriptblock in $Action) {
. $scriptblock
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}