Powershell csv 行列转置和操作
Powershell csv row column transpose and manipulation
我是 Powershell 的新手。我尝试针对中等大小的基于 csv 的记录(大约 10000 行)处理/转置 row-column。原始 CSV 包含大约 10000 行和 3 列 ("Time","Id","IOT")
,如下所示:
"Time","Id","IOT"
"00:03:56","23","26"
"00:03:56","24","0"
"00:03:56","25","0"
"00:03:56","26","1"
"00:03:56","27","0"
"00:03:56","28","0"
"00:03:56","29","0"
"00:03:56","30","1953"
"00:03:56","31","22"
"00:03:56","32","39"
"00:03:56","33","8"
"00:03:56","34","5"
"00:03:56","35","269"
"00:03:56","36","5"
"00:03:56","37","0"
"00:03:56","38","0"
"00:03:56","39","0"
"00:03:56","40","1251"
"00:03:56","41","103"
"00:03:56","42","0"
"00:03:56","43","0"
"00:03:56","44","0"
"00:03:56","45","0"
"00:03:56","46","38"
"00:03:56","47","14"
"00:03:56","48","0"
"00:03:56","49","0"
"00:03:56","2013","0"
"00:03:56","2378","0"
"00:03:56","2380","32"
"00:03:56","2758","0"
"00:03:56","3127","0"
"00:03:56","3128","0"
"00:09:16","23","22"
"00:09:16","24","0"
"00:09:16","25","0"
"00:09:16","26","2"
"00:09:16","27","0"
"00:09:16","28","0"
"00:09:16","29","21"
"00:09:16","30","48"
"00:09:16","31","0"
"00:09:16","32","4"
"00:09:16","33","4"
"00:09:16","34","7"
"00:09:16","35","382"
"00:09:16","36","12"
"00:09:16","37","0"
"00:09:16","38","0"
"00:09:16","39","0"
"00:09:16","40","1882"
"00:09:16","41","42"
"00:09:16","42","0"
"00:09:16","43","3"
"00:09:16","44","0"
"00:09:16","45","0"
"00:09:16","46","24"
"00:09:16","47","22"
"00:09:16","48","0"
"00:09:16","49","0"
"00:09:16","2013","0"
"00:09:16","2378","0"
"00:09:16","2380","19"
"00:09:16","2758","0"
"00:09:16","3127","0"
"00:09:16","3128","0"
...
...
...
我尝试使用基于从 https://gallery.technet.microsoft.com/scriptcenter/Powershell-Script-to-7c8368be
下载的 powershell 脚本的代码进行转置
基本上我的 powershell 代码如下:
$b = @()
foreach ($Time in $a.Time | Select -Unique) {
$Props = [ordered]@{ Time = $time }
foreach ($Id in $a.Id | Select -Unique){
$IOT = ($a.where({ $_.Id -eq $Id -and $_.time -eq $time })).IOT
$Props += @{ $Id = $IOT }
}
$b += New-Object -TypeName PSObject -Property $Props
}
$b | FT -AutoSize
$b | Out-GridView
上面的代码可以给我预期的结果,所有 "Id"
值都将成为列 headers 而所有 "Time"
值将成为唯一行和 "IOT"
值作为 "Id"
x "Time"
的交集,如下所示:
"Time","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","2013","2378","2380","2758","3127","3128"
"00:03:56","26","0","0","1","0","0","0","1953","22","39","8","5","269","5","0","0","0","1251","103","0","0","0","0","38","14","0","0","0","0","32","0","0","0"
"00:09:16","22","0","0","2","0","0","21","48","0","4","4","7","382","12","0","0","0","1882","42","0","3","0","0","24","22","0","0","0","0","19","0","0","0"
虽然只涉及几百行,结果出来的速度和预期的一样快,但是现在的问题是处理整个csv文件有10000行,上面的脚本'keep executing'好像不行完成很长时间(数小时)并且无法吐出任何结果。
因此,如果来自 Whosebug 的一些 powershell 专家可以帮助评估上面的代码并且可能可以帮助修改以加快结果?
非常感谢您的建议
10000 条记录很多,但我认为建议 streamreader* 并手动解析 CSV 还不够。不过,最不利于您的是以下行:
$b += New-Object -TypeName PSObject -Property $Props
PowerShell 在这里所做的是创建一个新数组并将该元素附加到它。这是一个非常占用内存的操作,您要重复 1000 次。在这种情况下,更好的做法是利用管道发挥优势。
$data = Import-Csv -Path "D:\temp\data.csv"
$headers = $data.ID | Sort-Object {[int]$_} -Unique
$data | Group-Object Time | ForEach-Object{
$props = [ordered]@{Time = $_.Name}
foreach($header in $headers){
$props."$header" = ($_.Group | Where-Object{$_.ID -eq $header}).IOT
}
[pscustomobject]$props
} | export-csv d:\temp\testing.csv -NoTypeInformation
$data
将是内存中的整个文件 object。需要获取将成为列 headers 的所有 $headers
。
按每个 Time
对数据进行分组。然后在每次 object 中我们得到每个 ID 的值。如果该 ID 在此期间不存在,则该条目将显示为空。
这不是最好的方法,但应该比你的方法更快。我 运行 在不到一分钟的时间内记录了 10000 条记录(3 次传球平均 51 秒)。如果可以的话,我会向您展示基准。
我只是用我自己的数据 运行 你的代码一次,花了 13 分钟。我认为可以肯定地说我的性能更快。
虚拟数据是用这个逻辑制作的,仅供参考
1..100 | %{
$time = get-date -Format "hh:mm:ss"
sleep -Seconds 1
1..100 | % {
[pscustomobject][ordered]@{
time = $time
id = $_
iot = Get-Random -Minimum 0 -Maximum 7
}
}
} | Export-Csv d:\temp\data.csv -notypeinformation
*
对于您的 streamreader 案例来说,这不是一个很好的例子。只是指出它是为了表明它是读取大文件的更好方法。只需要逐行解析字符串。
我是 Powershell 的新手。我尝试针对中等大小的基于 csv 的记录(大约 10000 行)处理/转置 row-column。原始 CSV 包含大约 10000 行和 3 列 ("Time","Id","IOT")
,如下所示:
"Time","Id","IOT"
"00:03:56","23","26"
"00:03:56","24","0"
"00:03:56","25","0"
"00:03:56","26","1"
"00:03:56","27","0"
"00:03:56","28","0"
"00:03:56","29","0"
"00:03:56","30","1953"
"00:03:56","31","22"
"00:03:56","32","39"
"00:03:56","33","8"
"00:03:56","34","5"
"00:03:56","35","269"
"00:03:56","36","5"
"00:03:56","37","0"
"00:03:56","38","0"
"00:03:56","39","0"
"00:03:56","40","1251"
"00:03:56","41","103"
"00:03:56","42","0"
"00:03:56","43","0"
"00:03:56","44","0"
"00:03:56","45","0"
"00:03:56","46","38"
"00:03:56","47","14"
"00:03:56","48","0"
"00:03:56","49","0"
"00:03:56","2013","0"
"00:03:56","2378","0"
"00:03:56","2380","32"
"00:03:56","2758","0"
"00:03:56","3127","0"
"00:03:56","3128","0"
"00:09:16","23","22"
"00:09:16","24","0"
"00:09:16","25","0"
"00:09:16","26","2"
"00:09:16","27","0"
"00:09:16","28","0"
"00:09:16","29","21"
"00:09:16","30","48"
"00:09:16","31","0"
"00:09:16","32","4"
"00:09:16","33","4"
"00:09:16","34","7"
"00:09:16","35","382"
"00:09:16","36","12"
"00:09:16","37","0"
"00:09:16","38","0"
"00:09:16","39","0"
"00:09:16","40","1882"
"00:09:16","41","42"
"00:09:16","42","0"
"00:09:16","43","3"
"00:09:16","44","0"
"00:09:16","45","0"
"00:09:16","46","24"
"00:09:16","47","22"
"00:09:16","48","0"
"00:09:16","49","0"
"00:09:16","2013","0"
"00:09:16","2378","0"
"00:09:16","2380","19"
"00:09:16","2758","0"
"00:09:16","3127","0"
"00:09:16","3128","0"
...
...
...
我尝试使用基于从 https://gallery.technet.microsoft.com/scriptcenter/Powershell-Script-to-7c8368be
下载的 powershell 脚本的代码进行转置
基本上我的 powershell 代码如下:
$b = @()
foreach ($Time in $a.Time | Select -Unique) {
$Props = [ordered]@{ Time = $time }
foreach ($Id in $a.Id | Select -Unique){
$IOT = ($a.where({ $_.Id -eq $Id -and $_.time -eq $time })).IOT
$Props += @{ $Id = $IOT }
}
$b += New-Object -TypeName PSObject -Property $Props
}
$b | FT -AutoSize
$b | Out-GridView
上面的代码可以给我预期的结果,所有 "Id"
值都将成为列 headers 而所有 "Time"
值将成为唯一行和 "IOT"
值作为 "Id"
x "Time"
的交集,如下所示:
"Time","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","2013","2378","2380","2758","3127","3128"
"00:03:56","26","0","0","1","0","0","0","1953","22","39","8","5","269","5","0","0","0","1251","103","0","0","0","0","38","14","0","0","0","0","32","0","0","0"
"00:09:16","22","0","0","2","0","0","21","48","0","4","4","7","382","12","0","0","0","1882","42","0","3","0","0","24","22","0","0","0","0","19","0","0","0"
虽然只涉及几百行,结果出来的速度和预期的一样快,但是现在的问题是处理整个csv文件有10000行,上面的脚本'keep executing'好像不行完成很长时间(数小时)并且无法吐出任何结果。 因此,如果来自 Whosebug 的一些 powershell 专家可以帮助评估上面的代码并且可能可以帮助修改以加快结果?
非常感谢您的建议
10000 条记录很多,但我认为建议 streamreader* 并手动解析 CSV 还不够。不过,最不利于您的是以下行:
$b += New-Object -TypeName PSObject -Property $Props
PowerShell 在这里所做的是创建一个新数组并将该元素附加到它。这是一个非常占用内存的操作,您要重复 1000 次。在这种情况下,更好的做法是利用管道发挥优势。
$data = Import-Csv -Path "D:\temp\data.csv"
$headers = $data.ID | Sort-Object {[int]$_} -Unique
$data | Group-Object Time | ForEach-Object{
$props = [ordered]@{Time = $_.Name}
foreach($header in $headers){
$props."$header" = ($_.Group | Where-Object{$_.ID -eq $header}).IOT
}
[pscustomobject]$props
} | export-csv d:\temp\testing.csv -NoTypeInformation
$data
将是内存中的整个文件 object。需要获取将成为列 headers 的所有 $headers
。
按每个 Time
对数据进行分组。然后在每次 object 中我们得到每个 ID 的值。如果该 ID 在此期间不存在,则该条目将显示为空。
这不是最好的方法,但应该比你的方法更快。我 运行 在不到一分钟的时间内记录了 10000 条记录(3 次传球平均 51 秒)。如果可以的话,我会向您展示基准。
我只是用我自己的数据 运行 你的代码一次,花了 13 分钟。我认为可以肯定地说我的性能更快。
虚拟数据是用这个逻辑制作的,仅供参考
1..100 | %{
$time = get-date -Format "hh:mm:ss"
sleep -Seconds 1
1..100 | % {
[pscustomobject][ordered]@{
time = $time
id = $_
iot = Get-Random -Minimum 0 -Maximum 7
}
}
} | Export-Csv d:\temp\data.csv -notypeinformation
*
对于您的 streamreader 案例来说,这不是一个很好的例子。只是指出它是为了表明它是读取大文件的更好方法。只需要逐行解析字符串。