如何根据多次出现的公共标识符（在另一个字段中）将一个字段中的数据替换为来自不同文件的数据

Question

我仍然是一个初学者，我发现所有听起来像我的问题的线程都是 SQL 相关的，或者答案是连接或合并，这不起作用，因为标识符在文件 2 中多次出现。我有两个制表符分隔的文件。一个包含两列，一列具有唯一的数字标识符，另一列具有分类。他们有成千上万。第二个包含我的数据，其中一个字段包含一个标识符。有数百行，因此许多标识符不存在，而其他标识符多次出现。我需要将存储在 file2 中的数据与存储在 file1 中的分类连接起来，基于两个文件中的标识符。

文件 1：

12345   kitchen; furniture; table
12346   kitchen; furniture; chair
12347   living room; furniture; sofa
12348   living room; furniture; table
12349   bed room; furniture; bed

文件 2：

stuff1  mo_restuff  somenumbers anotherfield    12348
stuff2  morestuff   othernumbers    anotherfield    12346
stuff3  more_stuff  somenumbers anotherfield    12347
stuff4  morestuff   somenumbers yetanotherfield 12347
stuff5  morest.uff  alsonumbers anotherfield    12345

结果应如下所示：

stuff1  mo_restuff  somenumbers anotherfield    living room; furniture; table
stuff2  morestuff   othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff  somenumbers anotherfield    living room; furniture; sofa
stuff4  morestuff   somenumbers yetanotherfield living room; furniture; sofa
stuff5  morest.uff  alsonumbers anotherfield    kitchen; furniture; table

我试过了（还有很多）

awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[]=[=13=] ; next } () in a  { print a,[=13=] } ' file1 file2 > out

但这只是打印了 file2。

我在 Unix 上工作，最好是 bash 中的解决方案，但 python 也可以。

也感谢您对之前帮助过我的其他问题的所有回答！

Answer 1

您的脚本很接近，但操作 { print a,[=12=] } 很奇怪。请您尝试以下操作：

awk '
    BEGIN {FS = OFS = "\t"}
    NR==FNR {a[] = ; next}
    { = a[]}
1' file1 file2 > out

输出：

stuff1  mo_restuff      somenumbers     anotherfield    living room; furniture; table
stuff2  morestuff       othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff      somenumbers     anotherfield    living room; furniture; sofa
stuff4  morestuff       somenumbers     yetanotherfield living room; furniture; sofa
stuff5  morest.uff      alsonumbers     anotherfield    kitchen; furniture; table

在条件 FR==FNR 的第一个块中，存储 </code> 就足够了，而不是 <code>[=15=]。
在 file2 的下一个块中，只需修改第 5 个字段，将其替换为 </code> 索引的 <code>a 的值。
最后的1告诉awk打印[=15=]，其中第5个字段修改如上

如何根据多次出现的公共标识符（在另一个字段中）将一个字段中的数据替换为来自不同文件的数据

How to substitute data in one field with data from different file based on common identifier (in another field) present multiple times

python

csv

bash

substitution