通过通用字符串组合来自不同文件的两列

Question

我有两个制表符分隔的文件。

文件-1


NODE_1_length_59711_cov_84.026979_g0_i0_1	K02377
NODE_1_length_59711_cov_84.026979_g0_i0_2
NODE_2_length_39753_cov_84.026979_g0_i0_1	K02377
NODE_2_length_49771_cov_84.026979_g0_i0_2	K16554

................................ ......................共391443行

file2


NODE_1_length_59711_cov_84.026979_g0_i0_1	56.54
NODE_1_length_59711_cov_84.026979_g0_i0_2	51.0
NODE_2_length_39753_cov_84.026979_g0_i0_1	12.6
NODE_2_length_49771_cov_84.026979_g0_i0_2	18.9

................................ ......................共391249行

我想合并这两个文件，保持第一列不变。


NODE_1_length_59711_cov_84.026979_g0_i0_1	K02377	56.54
NODE_1_length_59711_cov_84.026979_g0_i0_2		51.0
NODE_2_length_39753_cov_84.026979_g0_i0_1	K02377	12.6
NODE_2_length_49771_cov_84.026979_g0_i0_2	K16554	18.9

问题是第一个文件有将近 190 行，我不能直接合并它们，因为它会给出错误的输出。有什么方法可以通过第一列中的公共字符串组合这些文件吗？

Answer 1

这可能对你有用（加入 GNU）：

join -t$'\t' file1 file2

N.B。一些 shell 可能不接受 $'\t' 在这种情况下使用文字选项卡，这可以通过终端 cntrl-v 选项卡键序列输入。

Answer 2

使用 GNU awk：

awk 'NR==FNR { map[]=;next } { map1[]= } END { PROCINFO["sorted_in"]="@ind_str_asc";for (i in map) { print i"\t"map[i]"\t"map1[i] } }' file-1 file2

解释：

awk 'NR==FNR { 
               map[]=;                                  # Process the first file only and set up an array called map with the first space separated field as the index and the second the value
               next 
             } 
             { 
               map1[]=                                  # When processing the second file, set up an second array called map1 and use the first field as the index and the second the value.
             } 
         END { 
               PROCINFO["sorted_in"]="@ind_str_asc";         # Set the index ordering
               for (i in map) { 
                 print i"\t"map[i]"\t"map1[i]                # Loop through the map array and print the values along with the values in map1.
               } 
              }' file-1 file2

通过通用字符串组合来自不同文件的两列

Combining two columns from different files by common strings

unix

awk

sed

echo