通过通用字符串组合来自不同文件的两列
Combining two columns from different files by common strings
我有两个制表符分隔的文件。
文件-1
NODE_1_length_59711_cov_84.026979_g0_i0_1
K02377
NODE_1_length_59711_cov_84.026979_g0_i0_2
NODE_2_length_39753_cov_84.026979_g0_i0_1
K02377
NODE_2_length_49771_cov_84.026979_g0_i0_2
K16554
................................
......................共391443行
file2
NODE_1_length_59711_cov_84.026979_g0_i0_1
56.54
NODE_1_length_59711_cov_84.026979_g0_i0_2
51.0
NODE_2_length_39753_cov_84.026979_g0_i0_1
12.6
NODE_2_length_49771_cov_84.026979_g0_i0_2
18.9
................................
......................共391249行
我想合并这两个文件,保持第一列不变。
NODE_1_length_59711_cov_84.026979_g0_i0_1
K02377
56.54
NODE_1_length_59711_cov_84.026979_g0_i0_2
51.0
NODE_2_length_39753_cov_84.026979_g0_i0_1
K02377
12.6
NODE_2_length_49771_cov_84.026979_g0_i0_2
K16554
18.9
问题是第一个文件有将近 190 行,我不能直接合并它们,因为它会给出错误的输出。有什么方法可以通过第一列中的公共字符串组合这些文件吗?
这可能对你有用(加入 GNU):
join -t$'\t' file1 file2
N.B。一些 shell 可能不接受 $'\t'
在这种情况下使用文字选项卡,这可以通过终端 cntrl-v 选项卡键序列输入。
使用 GNU awk:
awk 'NR==FNR { map[]=;next } { map1[]= } END { PROCINFO["sorted_in"]="@ind_str_asc";for (i in map) { print i"\t"map[i]"\t"map1[i] } }' file-1 file2
解释:
awk 'NR==FNR {
map[]=; # Process the first file only and set up an array called map with the first space separated field as the index and the second the value
next
}
{
map1[]= # When processing the second file, set up an second array called map1 and use the first field as the index and the second the value.
}
END {
PROCINFO["sorted_in"]="@ind_str_asc"; # Set the index ordering
for (i in map) {
print i"\t"map[i]"\t"map1[i] # Loop through the map array and print the values along with the values in map1.
}
}' file-1 file2
我有两个制表符分隔的文件。
文件-1
NODE_1_length_59711_cov_84.026979_g0_i0_1 | K02377 |
NODE_1_length_59711_cov_84.026979_g0_i0_2 | |
NODE_2_length_39753_cov_84.026979_g0_i0_1 | K02377 |
NODE_2_length_49771_cov_84.026979_g0_i0_2 | K16554 |
................................ ......................共391443行
file2
NODE_1_length_59711_cov_84.026979_g0_i0_1 | 56.54 |
NODE_1_length_59711_cov_84.026979_g0_i0_2 | 51.0 |
NODE_2_length_39753_cov_84.026979_g0_i0_1 | 12.6 |
NODE_2_length_49771_cov_84.026979_g0_i0_2 | 18.9 |
................................ ......................共391249行
我想合并这两个文件,保持第一列不变。
NODE_1_length_59711_cov_84.026979_g0_i0_1 | K02377 | 56.54 |
NODE_1_length_59711_cov_84.026979_g0_i0_2 | 51.0 | |
NODE_2_length_39753_cov_84.026979_g0_i0_1 | K02377 | 12.6 |
NODE_2_length_49771_cov_84.026979_g0_i0_2 | K16554 | 18.9 |
问题是第一个文件有将近 190 行,我不能直接合并它们,因为它会给出错误的输出。有什么方法可以通过第一列中的公共字符串组合这些文件吗?
这可能对你有用(加入 GNU):
join -t$'\t' file1 file2
N.B。一些 shell 可能不接受 $'\t'
在这种情况下使用文字选项卡,这可以通过终端 cntrl-v 选项卡键序列输入。
使用 GNU awk:
awk 'NR==FNR { map[]=;next } { map1[]= } END { PROCINFO["sorted_in"]="@ind_str_asc";for (i in map) { print i"\t"map[i]"\t"map1[i] } }' file-1 file2
解释:
awk 'NR==FNR {
map[]=; # Process the first file only and set up an array called map with the first space separated field as the index and the second the value
next
}
{
map1[]= # When processing the second file, set up an second array called map1 and use the first field as the index and the second the value.
}
END {
PROCINFO["sorted_in"]="@ind_str_asc"; # Set the index ordering
for (i in map) {
print i"\t"map[i]"\t"map1[i] # Loop through the map array and print the values along with the values in map1.
}
}' file-1 file2