PowerShell：从文件中删除相似的行

Question

考虑文件 tbl.txt（150 万行），构建方式如下：

Num1 ; Num2 ; 'Value' ; 'Attribute'

所以 tbl.txt 看起来像：

  63; 193; 'Green'; 'Color'
 152; 162; 'Tall'; 'Size'
 230; 164; '130 磅' ; 'Weight'
 249; 175; 'Green'; 'Color' *在 'Value' 和 'Attribute' 上重复*
 420; 178; '8' ; 'Shoesize'
 438; 172; 'Tall'; 'Size' *在 'Value' 和 'Attribute' 上重复*

我怎样才能保持 'Value' 和 'Attribute' 上的第一个唯一行并删除 'Value' 和 'Attribute' ?

上的以下重复行

结果应如下所示：

  63; 193; 'Green'; 'Color'
 152; 162; 'Tall'; 'Size'
 230; 164; '130 磅' ; 'Weight'
 420; 178; '8' ; 'Shoesize'

非常感谢任何帮助。

Answer 1

通过 Get-Content 遍历文本文件，通过字符串操作分隔列 'Value' ; 'Attribute'，然后使用 hashmap 检查是否已经处理了类似的行——如果没有, 输出该行一次。在代码中：

$map = @{};
Get-Content tbl.txt | ` 
             %{ $key = $_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1); `
                If(-not $map.ContainsKey($key)) { $_; $map[$key] = 1 } `
              }

或者，如评论中所述，您可以使用 group 并应用相同的子字符串作为分组条件，最后取每个组的第一个元素：

Get-Content tbl.txt | group {$_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1)} `
                    | %{$_.Group[0]}

Answer 2

假设你的数据没有headers:

Import-CSV "C:\folder\data.txt" –Delimiter ";" -Header Num1,Num2,Value,Attribute | Sort-Object -Property Value -Unique

给出你想要的输出：

Num1 Num2 Value     Attribute 
---- ---- -----     --------- 
230  164  '130lbs'  'Weight'
420  178  '8'       'Shoesize'
63   193  'Green'   'Color'
152  162  'Tall'    'Size'

您可以使用 Export-CSV 导出您的结果：

Import-CSV "C:\folder\data.txt" –Delimiter ";" -Header Num1,Num2,Value,Attribute | Sort-Object -Property Value -Unique | Export-CSV "C:\folder\data2.txt" –Delimiter ";" -NoTypeInformation

PowerShell：从文件中删除相似的行

PowerShell: Delete similar lines from file

regex

powershell

text-manipulation