比较并打印 awk ou perl 中 2 个文件的 2 列

Question

我有 2 个 200 万行的文件。我需要比较 2 个不同文件中的 2 列，我想打印 2 个文件中具有相同项目的行。此 awk 代码有效，但它不打印来自 2 个文件的行：

awk 'NR == FNR {a[]; next} in a' file1.txt file2.txt

file1.txt

0001 00000001 084010800001080
0001 00000010 041140000100004

file2.txt

2451 00000009 401208008004000
2451 00000010 084010800001080

期望的输出：

file1[]-file2[] file1[]-file2[]  ( same on both files )

0001-2451 00000001-00000010 084010800001080

如何在 awk 或 perl 中执行此操作？

Answer 1

使用您显示的示例，请尝试以下 awk 代码。公平警告我还没有用数百万行测试它。

awk '
FNR == NR{
  arr1[]=[=10=]
  next
}
( in arr1){
  split(arr1[],arr2)
  print (arr2[1]"-",arr2[2]"-",)
  delete arr2
}
' file1.txt file2.txt

说明： 为以上添加详细说明。

awk '                                  ##Starting awk program from here.
FNR == NR{                             ##checking condition which will be TRUE when first Input_file is being read.
  arr1[]=[=11=]                          ##Creating arr1 array with value of  OFS  and 
  next                                 ##next will skip all further statements from here.
}
( in arr1){                          ##checking if  is present in arr1 then do following.
  split(arr1[],arr2)             ##Splitting value of arr1 into arr2.
  print (arr2[1]"-",arr2[2]"-",) ##printing values as per requirement of OP.
  delete arr2                          ##Deleting arr2 array here.
}
' file1.txt file2.txt                  ##Mentioning Input_file names here.

Answer 2

假设您的 </code> 值在每个输入文件中都是唯一的，如示例中所示 input/output:</p> <pre><code>$ cat tst.awk NR==FNR { foos[] = bars[] = next } in foos { print foos[] "-" , bars[] "-" , }

$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080

我将数组命名为 foos[] 和 bars[]，因为我不知道您输入的前两列实际代表什么 - 选择一个更有意义的名称。

Answer 3

如果您的文件那么大，您可能希望避免将数据存储在内存中。这是一大堆比较，200 万行乘以 200 万行 = 4 * 10¹² 比较。

use strict;
use warnings;
use feature 'say';

my $file1 = shift;
my $file2 = shift;

open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";

while (<$fh1>) {
    my @F = split;
    open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
    # for each line of file1 file2 is reopened and read again
    while (my $cmp = <$fh2>) {
        my @C = split ' ', $cmp;
        if ($F[2] eq $C[2]) {       # check string equality
            say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
        }
    }
}

使用你相当有限的测试集，我得到以下输出：

0001-2451 00000001-00000010 084010800001080

Answer 4

如果您有两个大文件，您可能希望使用 sort、join 和 awk 来生成输出，而不必将第一个文件大部分存储在内存中。

根据您的示例，此管道可以做到这一点：

join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",,,,,)}'

打印：

0001-2451 00000001-00000010 084010800001080

Answer 5

Python：测试每个文件 2.000.000 行

d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
  for line in f1:
    if not line: break
    c0,c1,c2 = line.split()
    d[(c2)] = (c0,c1)

  for line in f2:
    if not line: break
    c0,c1,c2 = line.split()
    if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))

$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096

real    0m3.555s
user    0m3.234s
sys     0m0.321s

比较并打印 awk ou perl 中 2 个文件的 2 列

compare and print 2 columns from 2 files in awk ou perl

awk