根据与第一列的匹配更快地合并 2 个 csv 文件
Faster way to merge 2 csv files based upon match with first column
目前,
我正在尝试合并两个 csv 文件。第一个文件大约有 3000 多行。第二个文件,大约有超过 400,000 行。
为了测试这个,我使用了这两个...
第一个 csv 文件:
Csv1ColumnOne,Csv1ColumnTwo,Csv1ColumnThree,Csv1ColumnFour
1234,Value1,Value1,Value1
2345,Value2,Value1,Value1
3456,Value1,Value2,Value1
4567,Value1,Value1,Value2
7645,Value3,Value3,Value3
第二个 csv 文件:
Csv2ColumnOne,Csv2ColumnTwo,Csv2ColumnThree
1234,abc,Value1
2345,asd,Value1
3456,qwe,Value1
4567,mnb,Value1
最终结果文件应如下所示:
"Csv1ColumnOne","Csv1ColumnTwo","Csv1ColumnThree","Csv1ColumnFour","Csv2ColumnOne"
"1234","Value1","Value1","Value1","abc"
"2345","Value2","Value1","Value1","asd"
"3456","Value1","Value2","Value1","qwe"
"4567","Value1","Value1","Value2","mnb"
"7645","Value3","Value3","Value3","Not Found"
这是我现在拥有的代码(目前有效):
Function GetFirstColumnNameFromFile
{
Param ($CsvFileWithPath)
$FirstFileFirstColumnTitle = ((Get-Content $CsvFileWithPath -TotalCount 2 | ConvertFrom-Csv).psobject.properties | ForEach-Object {$_.name})[0]
Write-Output $FirstFileFirstColumnTitle
}
Function CreateMergedFileWithCsv2ColumnOneColumn
{
Param ($firstColumnFirstFile, $FirstFileFirstColumnTitle, $firstFile, $secondFile, $resultsFile)
Write-Host "Creating hash table with columns values `"Csv2ColumnOne`" `"Csv2ColumnTwo`" From $secondFile"
$hashColumnOneColumnTwo2ndFile = @{}
Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}
Write-Host "Complete."
Write-Host "Creating Merge file with file $firstFile
and column `"Csv2ColumnTwo`" from file $secondFile"
Import-Csv $firstFile | Select-Object *, @{n='Csv2ColumnOne'; e={
if ($hashColumnOneColumnTwo2ndFile.ContainsKey($_.$FirstFileFirstColumnTitle)) {
$hashColumnOneColumnTwo2ndFile[$_.$FirstFileFirstColumnTitle]
} Else {
'Not Found'
}}} | Export-Csv $resultsFile -NoType
Write-Host "Complete."
}
Function MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile
{
Param ($firstFile, $secondFile, $resultsFile)
[string]$FirstFileFirstColumnTitle = GetFirstColumnNameFromFile $firstFile
$FirstFileFirstColumn = Import-Csv $firstFile | Where-Object {$_.$FirstFileFirstColumnTitle} | Select-Object -ExpandProperty $FirstFileFirstColumnTitle
CreateMergedFileWithCsv2ColumnOneColumn $FirstFileFirstColumn $FirstFileFirstColumnTitle $firstFile $secondFile $resultsFile
}
Function Main
{
$firstFile = 'C:\Scripts\Tests\test1.csv'
$secondFile = 'C:\Scripts\Tests\test2.csv'
$resultsFile = 'C:\Scripts\Tests\testResults.csv'
MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile $firstFile $secondFile $resultsFile
}
Main
对于以下行:
Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}
这大约需要 30 分钟(每列 - 每列 10 列)。这意味着仅在 2 个 csv 文件之间合并 3,000 行就需要大约 5-7 个小时(当我添加代码以在最终结果文件中添加其他列时)。有没有更快的方法从超过 400,000 行的第二个文件创建散列 table?
我不是百分百确定我在关注你的问题 - 但我 运行 针对你的测试文件做了以下检查:
$file1 = Import-Csv .\file1.csv
$file2 = Import-Csv .\file2.csv
$file1 | ForEach-Object {
$f1 = $_
$f1 | Add-Member -MemberType NoteProperty -Name csv2columnone -Value ""
$file2 | ForEach-Object {
if($f1.csv1columnone -eq $_.csv2columnone) {
if($_.csv2columntwo -ne $null) {
$f1.csv2columnone = $_.csv2columntwo
}
}
}
if([String]::IsNullOrEmpty($f1.csv2columnone)) {
$f1.csv2columnone = "Not found"
}
Write-Output $f1
} | ft
得到结果:
Csv1ColumnOne Csv1ColumnTwo Csv1ColumnThree Csv1ColumnFour csv2columnone
------------- ------------- --------------- -------------- -------------
1234 Value1 Value1 Value1 abc
2345 Value2 Value1 Value1 asd
3456 Value1 Value2 Value1 qwe
4567 Value1 Value1 Value2 mnb
7645 Value3 Value3 Value3 Not found
运行 measure-command(对于 运行 时间)导致 运行 时间为 20 毫秒。
看看这是否会更快地构建您的哈希 table:
$ht = @{}
Get-Content test1.csv -ReadCount 1000 |
foreach {
$ht += convertfrom-stringdata $($_ -replace '"?(.+?)"?,"?(.+?)"?,.+','=' | out-string)
}
目前,
我正在尝试合并两个 csv 文件。第一个文件大约有 3000 多行。第二个文件,大约有超过 400,000 行。
为了测试这个,我使用了这两个...
第一个 csv 文件:
Csv1ColumnOne,Csv1ColumnTwo,Csv1ColumnThree,Csv1ColumnFour
1234,Value1,Value1,Value1
2345,Value2,Value1,Value1
3456,Value1,Value2,Value1
4567,Value1,Value1,Value2
7645,Value3,Value3,Value3
第二个 csv 文件:
Csv2ColumnOne,Csv2ColumnTwo,Csv2ColumnThree
1234,abc,Value1
2345,asd,Value1
3456,qwe,Value1
4567,mnb,Value1
最终结果文件应如下所示:
"Csv1ColumnOne","Csv1ColumnTwo","Csv1ColumnThree","Csv1ColumnFour","Csv2ColumnOne"
"1234","Value1","Value1","Value1","abc"
"2345","Value2","Value1","Value1","asd"
"3456","Value1","Value2","Value1","qwe"
"4567","Value1","Value1","Value2","mnb"
"7645","Value3","Value3","Value3","Not Found"
这是我现在拥有的代码(目前有效):
Function GetFirstColumnNameFromFile
{
Param ($CsvFileWithPath)
$FirstFileFirstColumnTitle = ((Get-Content $CsvFileWithPath -TotalCount 2 | ConvertFrom-Csv).psobject.properties | ForEach-Object {$_.name})[0]
Write-Output $FirstFileFirstColumnTitle
}
Function CreateMergedFileWithCsv2ColumnOneColumn
{
Param ($firstColumnFirstFile, $FirstFileFirstColumnTitle, $firstFile, $secondFile, $resultsFile)
Write-Host "Creating hash table with columns values `"Csv2ColumnOne`" `"Csv2ColumnTwo`" From $secondFile"
$hashColumnOneColumnTwo2ndFile = @{}
Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}
Write-Host "Complete."
Write-Host "Creating Merge file with file $firstFile
and column `"Csv2ColumnTwo`" from file $secondFile"
Import-Csv $firstFile | Select-Object *, @{n='Csv2ColumnOne'; e={
if ($hashColumnOneColumnTwo2ndFile.ContainsKey($_.$FirstFileFirstColumnTitle)) {
$hashColumnOneColumnTwo2ndFile[$_.$FirstFileFirstColumnTitle]
} Else {
'Not Found'
}}} | Export-Csv $resultsFile -NoType
Write-Host "Complete."
}
Function MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile
{
Param ($firstFile, $secondFile, $resultsFile)
[string]$FirstFileFirstColumnTitle = GetFirstColumnNameFromFile $firstFile
$FirstFileFirstColumn = Import-Csv $firstFile | Where-Object {$_.$FirstFileFirstColumnTitle} | Select-Object -ExpandProperty $FirstFileFirstColumnTitle
CreateMergedFileWithCsv2ColumnOneColumn $FirstFileFirstColumn $FirstFileFirstColumnTitle $firstFile $secondFile $resultsFile
}
Function Main
{
$firstFile = 'C:\Scripts\Tests\test1.csv'
$secondFile = 'C:\Scripts\Tests\test2.csv'
$resultsFile = 'C:\Scripts\Tests\testResults.csv'
MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile $firstFile $secondFile $resultsFile
}
Main
对于以下行:
Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}
这大约需要 30 分钟(每列 - 每列 10 列)。这意味着仅在 2 个 csv 文件之间合并 3,000 行就需要大约 5-7 个小时(当我添加代码以在最终结果文件中添加其他列时)。有没有更快的方法从超过 400,000 行的第二个文件创建散列 table?
我不是百分百确定我在关注你的问题 - 但我 运行 针对你的测试文件做了以下检查:
$file1 = Import-Csv .\file1.csv
$file2 = Import-Csv .\file2.csv
$file1 | ForEach-Object {
$f1 = $_
$f1 | Add-Member -MemberType NoteProperty -Name csv2columnone -Value ""
$file2 | ForEach-Object {
if($f1.csv1columnone -eq $_.csv2columnone) {
if($_.csv2columntwo -ne $null) {
$f1.csv2columnone = $_.csv2columntwo
}
}
}
if([String]::IsNullOrEmpty($f1.csv2columnone)) {
$f1.csv2columnone = "Not found"
}
Write-Output $f1
} | ft
得到结果:
Csv1ColumnOne Csv1ColumnTwo Csv1ColumnThree Csv1ColumnFour csv2columnone
------------- ------------- --------------- -------------- -------------
1234 Value1 Value1 Value1 abc
2345 Value2 Value1 Value1 asd
3456 Value1 Value2 Value1 qwe
4567 Value1 Value1 Value2 mnb
7645 Value3 Value3 Value3 Not found
运行 measure-command(对于 运行 时间)导致 运行 时间为 20 毫秒。
看看这是否会更快地构建您的哈希 table:
$ht = @{}
Get-Content test1.csv -ReadCount 1000 |
foreach {
$ht += convertfrom-stringdata $($_ -replace '"?(.+?)"?,"?(.+?)"?,.+','=' | out-string)
}