如何优化此 Powershell 脚本,将 JSON 转换为 CSV?
How can I optimize this Powershell script, converting JSON to CSV?
我有一个非常大的 JSON 行文件,有 4.000.000 行,我需要从每一行转换几个事件。生成的 CSV 文件包含 15.000.000 行。我该如何优化这个脚本?
我使用的是 Powershell core 7,完成转换大约需要 50 个小时。
我的 Powershell 脚本:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = @()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
这是JSON的结构。
{
"id": "111111111",
"name": {
"name": "Test Company GmbH",
"legalForm": "GmbH"
},
"address": {
"street": "Berlinstr.",
"postalCode": "11111",
"city": "Berlin"
},
"status": "liquidation",
"events": {
"items": [{
"type": "Liquidation",
"date": "2001-01-01",
"description": "Liquidation"
}, {
"type": "NewCompany",
"date": "2000-01-01",
"description": "Neueintragung"
}, {
"type": "ControlChange",
"date": "2002-01-01",
"description": "Tested Company GmbH"
}]
},
"relatedCompanies": {
"items": [{
"company": {
"id": "2222222",
"name": {
"name": "Test GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}, {
"company": {
"id": "33333",
"name": {
"name": "Test2 GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}]
}
}
根据评论:.
请改用 PowerShell 管道,例如:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
Get-Content $Importfile -Encoding $encoding | Foreach-Object {
$json = $_ | ConvertFrom-Json
$json | ConvertFrom-Json | Foreach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 30000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
更新2020-05-07
根据问题的评论和额外信息,我编写了一个可重复使用的小 cmdlet,它使用 PowerShell 管道读取 .jsonl
(Json Lines) file. It collects each line till it find a closing '}' character then it checks for a valid json string (using Test-Json
,因为可能存在嵌入式对象。如果有效,它会中间释放管道中的提取对象并再次开始收集行:
Function ConvertFrom-JsonLines {
[CmdletBinding()][OutputType([Object[]])]Param (
[Parameter(ValueFromPipeLine = $True, Mandatory = $True)][String]$Line
)
Begin { $JsonLines = [System.Collections.Generic.List[String]]@() }
Process {
$JsonLines.Add($Line)
If ( $Line.Trim().EndsWith('}') ) {
$Json = $JsonLines -Join [Environment]::NewLine
If ( Test-Json $Json -ErrorAction SilentlyContinue ) {
$Json | ConvertFrom-Json
$JsonLines.Clear()
}
}
}
}
你可以这样使用它:
Get-Content .\file.jsonl | ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8
我可以通过做两个小的改变让它快 40%:1. 使用 Get-Content -ReadCount
并解压缓冲线和 2. 将管道更改为 'flow' $json=+foreach 部分。
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "$psscriptroot\input2.jsonl"
$Exportfile = "$psscriptroot\output.csv"
if (Test-Path $Exportfile) {
Remove-Item -Path $Exportfile
}
# Changed the next few lines
Get-Content $Importfile -Encoding $encoding -ReadCount 10000 |
ForEach-Object {
$_
} | ConvertFrom-Json | ForEach-Object {
$json = $_
$json.events.items | ForEach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 10000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100 / $percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ';' -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
我有一个非常大的 JSON 行文件,有 4.000.000 行,我需要从每一行转换几个事件。生成的 CSV 文件包含 15.000.000 行。我该如何优化这个脚本?
我使用的是 Powershell core 7,完成转换大约需要 50 个小时。
我的 Powershell 脚本:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = @()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
这是JSON的结构。
{
"id": "111111111",
"name": {
"name": "Test Company GmbH",
"legalForm": "GmbH"
},
"address": {
"street": "Berlinstr.",
"postalCode": "11111",
"city": "Berlin"
},
"status": "liquidation",
"events": {
"items": [{
"type": "Liquidation",
"date": "2001-01-01",
"description": "Liquidation"
}, {
"type": "NewCompany",
"date": "2000-01-01",
"description": "Neueintragung"
}, {
"type": "ControlChange",
"date": "2002-01-01",
"description": "Tested Company GmbH"
}]
},
"relatedCompanies": {
"items": [{
"company": {
"id": "2222222",
"name": {
"name": "Test GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}, {
"company": {
"id": "33333",
"name": {
"name": "Test2 GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}]
}
}
根据评论:
请改用 PowerShell 管道,例如:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
Get-Content $Importfile -Encoding $encoding | Foreach-Object {
$json = $_ | ConvertFrom-Json
$json | ConvertFrom-Json | Foreach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 30000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
更新2020-05-07
根据问题的评论和额外信息,我编写了一个可重复使用的小 cmdlet,它使用 PowerShell 管道读取 .jsonl
(Json Lines) file. It collects each line till it find a closing '}' character then it checks for a valid json string (using Test-Json
,因为可能存在嵌入式对象。如果有效,它会中间释放管道中的提取对象并再次开始收集行:
Function ConvertFrom-JsonLines {
[CmdletBinding()][OutputType([Object[]])]Param (
[Parameter(ValueFromPipeLine = $True, Mandatory = $True)][String]$Line
)
Begin { $JsonLines = [System.Collections.Generic.List[String]]@() }
Process {
$JsonLines.Add($Line)
If ( $Line.Trim().EndsWith('}') ) {
$Json = $JsonLines -Join [Environment]::NewLine
If ( Test-Json $Json -ErrorAction SilentlyContinue ) {
$Json | ConvertFrom-Json
$JsonLines.Clear()
}
}
}
}
你可以这样使用它:
Get-Content .\file.jsonl | ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8
我可以通过做两个小的改变让它快 40%:1. 使用 Get-Content -ReadCount
并解压缓冲线和 2. 将管道更改为 'flow' $json=+foreach 部分。
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "$psscriptroot\input2.jsonl"
$Exportfile = "$psscriptroot\output.csv"
if (Test-Path $Exportfile) {
Remove-Item -Path $Exportfile
}
# Changed the next few lines
Get-Content $Importfile -Encoding $encoding -ReadCount 10000 |
ForEach-Object {
$_
} | ConvertFrom-Json | ForEach-Object {
$json = $_
$json.events.items | ForEach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 10000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100 / $percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ';' -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()