根据字段加入文件，字段中有 redundancy/missing 个值

Question

我有两个制表符分隔的文本文件，我希望根据某个字段（例如 field1）加入。在其中一个文件中，该字段存在冗余，例如：

field1  field2  field3
A   gene1   0.01
A   gene2   0.001
A   gene3   0.02
B   gene4   0.01
B   gene5   0.03
C   gene6   0.004

而在另一个方面，没有冗余：

field1  name    pathway
A   A_name  A_pathway
B   B_name  B_pathway
C   C_name  C_pathway
D   D_name  D_pathway
E   E_name  E pathway

第二个文件还包含第一个文件中不存在的要加入的字段中的值。是否可以使用连接命令连接这些文件，这样生成的文件将是：

field1  field2  field3  name    pathway
A   gene1   0.01    A_name  A_pathway
A   gene2   0.001   A_name  A_pathway
A   gene3   0.02    A_name  A_pathway
B   gene4   0.01    B_name  B_pathway
B   gene5   0.03    B_name  B_pathway
C   gene6   0.004   C_name  C_pathway

我试着查看有关加入的手册页并试了一下，但似乎无法正常工作。

Answer 1

由于您对 SQLite 有一定的了解，因此使用此 SQL 工具来处理您的问题可能最有意义。首先，使用以下命令将您的两个 CSV 文件导入 SQLite：

sqlite> create table table1 (field1 text, field2 text, field3 real);
sqlite> .separator "\t"
sqlite> .import table1.csv table1

对第二个做同样的事情table:

sqlite> create table table2 (field1 text, name text, pathway text);
sqlite> .separator "\t"
sqlite> .import table2.csv table2

现在您的数据在 SQLite 中，您可以执行以下简单连接以获得所需的结果集：

SELECT t1.field1,
       t1.field2,
       t1.field3,
       t2.name,
       t2.pathway
FROM table1 t1
INNER JOIN table2 t2
    ON t1.field1 = t2.field1

根据字段加入文件，字段中有 redundancy/missing 个值

join files based on field with redundancy/missing values in field

field

join