PowerShell：比较 2 个大型 CSV 文件以查找其中一个文件中不存在的用户

Question

我有 2 个 csv 文件，每个文件大约有 10,000 个用户。我需要计算有多少用户出现在 csv1 而不是 csv2 中。目前我有下面的代码。但是我知道这可能非常低效，因为它可能会循环多达 10,000 个用户 10,000 次。该代码永远需要运行，我相信一定有更有效的方法。感谢任何帮助或建议我对 Powershell 还很陌生

foreach ($csv1User in $csv1) {
        $found = $false
        foreach ($csv2User in $csv2) {
            if ($csv1User.identifier -eq $csv2User.identifier)
            {
                $found = $true
                break
            }
        }
        if ($found -ne $true){
            $count++
        }
    }

Answer 1

如果您只是在寻找计数，那么这应该快得多。

$csv2 = Import-Csv $csvfile2

Import-Csv $csvfile1 |
    Where-Object identifier -in $csv2.identifier |
        Measure-Object | Select-Object -ExpandProperty Count

这是一个小例子

$csvfile1 = New-TemporaryFile
$csvfile2 = New-TemporaryFile

@'
identifier
bob
sally
john
sue
'@ | Set-Content $csvfile1 -Encoding UTF8

@'
identifier
bill
sally
john
stan
'@ | Set-Content $csvfile2 -Encoding UTF8

$csv2 = Import-Csv $csvfile2

Import-Csv $csvfile1 |
    Where-Object identifier -in $csv2.identifier |
        Measure-Object | Select-Object -ExpandProperty Count

输出很简单

Answer 2

如果您用 2 个 HashSet 替换嵌套循环，您将有两种计算两者之间异常的方法：

使用`SymmetricExceptWith()`

HashSet<T>.SymmetricExceptWith() 函数允许我们计算存在于任一集合中但不存在于两者中的术语子集：

# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)

# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)

# Now we have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier

使用集合跟踪重复项

同样，我们可以使用哈希集来跟踪哪些术语至少被观察到一次，哪些被观察到不止一次：

# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()

# Loop through whole superset of records
foreach($record in @($csv1;$csv2)){
  # Always attempt to add to the $seenOnce set
  if(!$seenOnce.Add($record.identifier)){
    # We've already seen this identifier once, add it to $seenTwice
    [void]$seenTwice.Add($record.identifier)
  }
}

# Just like the previous example, we now have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier

使用哈希 table 作为分组结构

您还可以使用字典类型（例如 [hashtable]）根据标识符对来自两个 csv 文件的记录进行分组，然后过滤每个字典条目中记录值的数量：

# Groups records on their identifier value
$groupsById = @{}
foreach($record in @($csv1;$csv2)){
  if(-not $groupsById.ContainsKey($record.identifier)){
    $groupsById[$record.identifier] = @()
  }
  $groupsById[$record.identifier] += $record
}

# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value

PowerShell：比较 2 个大型 CSV 文件以查找其中一个文件中不存在的用户

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

csv

powershell

performance

compare

processing-efficiency

使用`SymmetricExceptWith()`

使用集合跟踪重复项

使用哈希 table 作为分组结构

PowerShell：比较 2 个大型 CSV 文件以查找其中一个文件中不存在的用户

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

csv

powershell

performance

compare

processing-efficiency

使用SymmetricExceptWith()

使用集合跟踪重复项

使用哈希 table 作为分组结构

使用`SymmetricExceptWith()`