如果它不包含在另一个数据框列中,如何删除 R 中的一行

How to delete a line in R if it isn't included in another data frames columns

我有一个数据框如下所示:

SNP             A1      A2      EFF                     FRQ
rs12565286      C       G       -0.00225985777786465    .04354
rs11804171      A       T       -0.00530020318295282    .04485
rs3094315       A       G       -0.0042551489236695     .8364
rs12562034      A       G       -0.00911972489527125    .09763
rs12124819      A       G       0.0250148724382224      .7744
rs2980319       A       T       0.0178927256033542      .1306
rs4040617       A       G       -0.0173263263037023     .8707
rs2905062       A       G       -0.00799024279381536    .8668

SNP     CLST    A1      A2      FRQ     IMP     POS     CHR     BVAL
rs12565286      Brahui  C       G       0       1       711153  1       982
rs12565286      Balochi C       G       0       1       711153  1       982
rs12565286      Hazara  C       G       0       1       711153  1       982
rs12565286      Makrani C       G       0       1       711153  1       982
rs11804171      Brahui  G       C       0.02    1       713682  1       982
rs11804171      Balochi G       C       0       1       713682  1       982
rs11804171      Hazara  G       C       0.0227273       1       713682  1       982
rs11804171      Makrani G       C       0       1       713682  1       982
rs3094315       Brahui  G       A       0.26    0       742429  1       976
rs3094315       Balochi G       A       0.166667        0       742429  1       976
rs3094315       Hazara  G       A       0.181818        0       742429  1       976
rs3094315       Makrani G       A       0.28    0       742429  1       976
rs12562034      Brahui  G       T       0.76    0       758311  1       976
rs12562034      Balochi G       T       0.75    0       758311  1       976
rs12562034      Hazara  G       T       0.795455        0       758311  1       976
rs12562034      Makrani G       T       0.8     0       758311  1       976

我希望给定 SNP 的 A1 和 A2 列与第二列中给定 SNP 的 A1 和 A2 列相匹配。顺序对我来说并不重要。例如:SNP rs3094315 在第一个数据帧中先是 A,然后是 G,但在第二个数据帧中先是 G,然后是 A。那很好。我只想从第一个数据框中删除没有匹配对的行。例如:SNP rs11804171 在数据帧一中先是 A,然后是 T。但是在数据框二中有 G 然后是 C。另一个例子:SNP rs12562034 在数据帧一中有 A 然后是 G,但在数据帧二中有 G 然后是 T,所以它们不匹配。我想删除所有不匹配的行。我想要的输出是:

SNP             A1      A2      EFF                     FRQ
rs12565286      C       G       -0.00225985777786465    .04354
rs3094315       A       G       -0.0042551489236695     .8364

解决此问题的最佳方法是使用 dplyr。困难在于匹配具有不同名称的列,因为顺序对于 A1 和 A2 并不重要。在下面的代码中,我做了两个单独的 semi_joins,一个在两个 dfs 中的列名相同,一个是 A1=A2 和 A2=A1.

数据

df1 <-read.table(text="
SNP,A1,A2,EFF,FRQ
rs12565286,C,G,-0.00225985777786465,.04354
rs11804171,A,T,-0.00530020318295282,.04485
rs3094315,A,G,-0.0042551489236695,.8364
rs12562034,A,G,-0.00911972489527125,.09763
rs12124819,A,G,0.0250148724382224,.7744
rs2980319,A,T,0.0178927256033542,.1306
rs4040617,A,G,-0.0173263263037023,.8707
rs2905062,A,G,-0.00799024279381536,.8668
", header = TRUE, sep=",", as.is=TRUE)

df2 <-read.table(text="
SNP,CLST,A1,A2,FRQ,IMP,POS,CHR,BVAL
rs12565286,Brahui,C,G,0,1,711153,1,982
rs12565286,Balochi,C,G,0,1,711153,1,982
rs12565286,Hazara,C,G,0,1,711153,1,982
rs12565286,Makrani,C,G,0,1,711153,1,982
rs11804171,Brahui,G,C,0.02,1,713682,1,982
rs11804171,Balochi,G,C,0,1,713682,1,982
rs11804171,Hazara,G,C,0.0227273,1,713682,1,982
rs11804171,Makrani,G,C,0,1,713682,1,982
rs3094315,Brahui,G,A,0.26,0,742429,1,976
rs3094315,Balochi,G,A,0.166667,0,742429,1,976
rs3094315,Hazara,G,A,0.181818,0,742429,1,976
rs3094315,Makrani,G,A,0.28,0,742429,1,976
rs12562034,Brahui,G,T,0.76,0,758311,1,976
rs12562034,Balochi,G,T,0.75,0,758311,1,976
rs12562034,Hazara,G,T,0.795455,0,758311,1,976
rs12562034,Makrani,G,T,0.8,0,758311,1,976
", header = TRUE, sep=",", as.is=TRUE)

require(dplyr)
order1 <-semi_join(df1, df2, by = c("SNP","A1","A2"))
order2 <-semi_join(df1, df2, by = c("SNP","A1" = "A2","A2" = "A1"))
rbind(order1,order2)

         SNP A1 A2          EFF     FRQ
1 rs12565286  C  G -0.002259858 0.04354
2  rs3094315  A  G -0.004255149 0.83640