如果它不包含在另一个数据框列中,如何删除 R 中的一行
How to delete a line in R if it isn't included in another data frames columns
我有一个数据框如下所示:
SNP A1 A2 EFF FRQ
rs12565286 C G -0.00225985777786465 .04354
rs11804171 A T -0.00530020318295282 .04485
rs3094315 A G -0.0042551489236695 .8364
rs12562034 A G -0.00911972489527125 .09763
rs12124819 A G 0.0250148724382224 .7744
rs2980319 A T 0.0178927256033542 .1306
rs4040617 A G -0.0173263263037023 .8707
rs2905062 A G -0.00799024279381536 .8668
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
rs12565286 Brahui C G 0 1 711153 1 982
rs12565286 Balochi C G 0 1 711153 1 982
rs12565286 Hazara C G 0 1 711153 1 982
rs12565286 Makrani C G 0 1 711153 1 982
rs11804171 Brahui G C 0.02 1 713682 1 982
rs11804171 Balochi G C 0 1 713682 1 982
rs11804171 Hazara G C 0.0227273 1 713682 1 982
rs11804171 Makrani G C 0 1 713682 1 982
rs3094315 Brahui G A 0.26 0 742429 1 976
rs3094315 Balochi G A 0.166667 0 742429 1 976
rs3094315 Hazara G A 0.181818 0 742429 1 976
rs3094315 Makrani G A 0.28 0 742429 1 976
rs12562034 Brahui G T 0.76 0 758311 1 976
rs12562034 Balochi G T 0.75 0 758311 1 976
rs12562034 Hazara G T 0.795455 0 758311 1 976
rs12562034 Makrani G T 0.8 0 758311 1 976
我希望给定 SNP 的 A1 和 A2 列与第二列中给定 SNP 的 A1 和 A2 列相匹配。顺序对我来说并不重要。例如:SNP rs3094315 在第一个数据帧中先是 A,然后是 G,但在第二个数据帧中先是 G,然后是 A。那很好。我只想从第一个数据框中删除没有匹配对的行。例如:SNP rs11804171 在数据帧一中先是 A,然后是 T。但是在数据框二中有 G 然后是 C。另一个例子:SNP rs12562034 在数据帧一中有 A 然后是 G,但在数据帧二中有 G 然后是 T,所以它们不匹配。我想删除所有不匹配的行。我想要的输出是:
SNP A1 A2 EFF FRQ
rs12565286 C G -0.00225985777786465 .04354
rs3094315 A G -0.0042551489236695 .8364
解决此问题的最佳方法是使用 dplyr
。困难在于匹配具有不同名称的列,因为顺序对于 A1 和 A2 并不重要。在下面的代码中,我做了两个单独的 semi_joins,一个在两个 dfs 中的列名相同,一个是 A1=A2 和 A2=A1.
数据
df1 <-read.table(text="
SNP,A1,A2,EFF,FRQ
rs12565286,C,G,-0.00225985777786465,.04354
rs11804171,A,T,-0.00530020318295282,.04485
rs3094315,A,G,-0.0042551489236695,.8364
rs12562034,A,G,-0.00911972489527125,.09763
rs12124819,A,G,0.0250148724382224,.7744
rs2980319,A,T,0.0178927256033542,.1306
rs4040617,A,G,-0.0173263263037023,.8707
rs2905062,A,G,-0.00799024279381536,.8668
", header = TRUE, sep=",", as.is=TRUE)
df2 <-read.table(text="
SNP,CLST,A1,A2,FRQ,IMP,POS,CHR,BVAL
rs12565286,Brahui,C,G,0,1,711153,1,982
rs12565286,Balochi,C,G,0,1,711153,1,982
rs12565286,Hazara,C,G,0,1,711153,1,982
rs12565286,Makrani,C,G,0,1,711153,1,982
rs11804171,Brahui,G,C,0.02,1,713682,1,982
rs11804171,Balochi,G,C,0,1,713682,1,982
rs11804171,Hazara,G,C,0.0227273,1,713682,1,982
rs11804171,Makrani,G,C,0,1,713682,1,982
rs3094315,Brahui,G,A,0.26,0,742429,1,976
rs3094315,Balochi,G,A,0.166667,0,742429,1,976
rs3094315,Hazara,G,A,0.181818,0,742429,1,976
rs3094315,Makrani,G,A,0.28,0,742429,1,976
rs12562034,Brahui,G,T,0.76,0,758311,1,976
rs12562034,Balochi,G,T,0.75,0,758311,1,976
rs12562034,Hazara,G,T,0.795455,0,758311,1,976
rs12562034,Makrani,G,T,0.8,0,758311,1,976
", header = TRUE, sep=",", as.is=TRUE)
require(dplyr)
order1 <-semi_join(df1, df2, by = c("SNP","A1","A2"))
order2 <-semi_join(df1, df2, by = c("SNP","A1" = "A2","A2" = "A1"))
rbind(order1,order2)
SNP A1 A2 EFF FRQ
1 rs12565286 C G -0.002259858 0.04354
2 rs3094315 A G -0.004255149 0.83640
我有一个数据框如下所示:
SNP A1 A2 EFF FRQ
rs12565286 C G -0.00225985777786465 .04354
rs11804171 A T -0.00530020318295282 .04485
rs3094315 A G -0.0042551489236695 .8364
rs12562034 A G -0.00911972489527125 .09763
rs12124819 A G 0.0250148724382224 .7744
rs2980319 A T 0.0178927256033542 .1306
rs4040617 A G -0.0173263263037023 .8707
rs2905062 A G -0.00799024279381536 .8668
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
rs12565286 Brahui C G 0 1 711153 1 982
rs12565286 Balochi C G 0 1 711153 1 982
rs12565286 Hazara C G 0 1 711153 1 982
rs12565286 Makrani C G 0 1 711153 1 982
rs11804171 Brahui G C 0.02 1 713682 1 982
rs11804171 Balochi G C 0 1 713682 1 982
rs11804171 Hazara G C 0.0227273 1 713682 1 982
rs11804171 Makrani G C 0 1 713682 1 982
rs3094315 Brahui G A 0.26 0 742429 1 976
rs3094315 Balochi G A 0.166667 0 742429 1 976
rs3094315 Hazara G A 0.181818 0 742429 1 976
rs3094315 Makrani G A 0.28 0 742429 1 976
rs12562034 Brahui G T 0.76 0 758311 1 976
rs12562034 Balochi G T 0.75 0 758311 1 976
rs12562034 Hazara G T 0.795455 0 758311 1 976
rs12562034 Makrani G T 0.8 0 758311 1 976
我希望给定 SNP 的 A1 和 A2 列与第二列中给定 SNP 的 A1 和 A2 列相匹配。顺序对我来说并不重要。例如:SNP rs3094315 在第一个数据帧中先是 A,然后是 G,但在第二个数据帧中先是 G,然后是 A。那很好。我只想从第一个数据框中删除没有匹配对的行。例如:SNP rs11804171 在数据帧一中先是 A,然后是 T。但是在数据框二中有 G 然后是 C。另一个例子:SNP rs12562034 在数据帧一中有 A 然后是 G,但在数据帧二中有 G 然后是 T,所以它们不匹配。我想删除所有不匹配的行。我想要的输出是:
SNP A1 A2 EFF FRQ
rs12565286 C G -0.00225985777786465 .04354
rs3094315 A G -0.0042551489236695 .8364
解决此问题的最佳方法是使用 dplyr
。困难在于匹配具有不同名称的列,因为顺序对于 A1 和 A2 并不重要。在下面的代码中,我做了两个单独的 semi_joins,一个在两个 dfs 中的列名相同,一个是 A1=A2 和 A2=A1.
数据
df1 <-read.table(text="
SNP,A1,A2,EFF,FRQ
rs12565286,C,G,-0.00225985777786465,.04354
rs11804171,A,T,-0.00530020318295282,.04485
rs3094315,A,G,-0.0042551489236695,.8364
rs12562034,A,G,-0.00911972489527125,.09763
rs12124819,A,G,0.0250148724382224,.7744
rs2980319,A,T,0.0178927256033542,.1306
rs4040617,A,G,-0.0173263263037023,.8707
rs2905062,A,G,-0.00799024279381536,.8668
", header = TRUE, sep=",", as.is=TRUE)
df2 <-read.table(text="
SNP,CLST,A1,A2,FRQ,IMP,POS,CHR,BVAL
rs12565286,Brahui,C,G,0,1,711153,1,982
rs12565286,Balochi,C,G,0,1,711153,1,982
rs12565286,Hazara,C,G,0,1,711153,1,982
rs12565286,Makrani,C,G,0,1,711153,1,982
rs11804171,Brahui,G,C,0.02,1,713682,1,982
rs11804171,Balochi,G,C,0,1,713682,1,982
rs11804171,Hazara,G,C,0.0227273,1,713682,1,982
rs11804171,Makrani,G,C,0,1,713682,1,982
rs3094315,Brahui,G,A,0.26,0,742429,1,976
rs3094315,Balochi,G,A,0.166667,0,742429,1,976
rs3094315,Hazara,G,A,0.181818,0,742429,1,976
rs3094315,Makrani,G,A,0.28,0,742429,1,976
rs12562034,Brahui,G,T,0.76,0,758311,1,976
rs12562034,Balochi,G,T,0.75,0,758311,1,976
rs12562034,Hazara,G,T,0.795455,0,758311,1,976
rs12562034,Makrani,G,T,0.8,0,758311,1,976
", header = TRUE, sep=",", as.is=TRUE)
require(dplyr)
order1 <-semi_join(df1, df2, by = c("SNP","A1","A2"))
order2 <-semi_join(df1, df2, by = c("SNP","A1" = "A2","A2" = "A1"))
rbind(order1,order2)
SNP A1 A2 EFF FRQ
1 rs12565286 C G -0.002259858 0.04354
2 rs3094315 A G -0.004255149 0.83640