从两个文件的两列中获取不常见的元素

Getting uncommon elements from two columns from two files

我有两个正常文件和癌症文件,用于 T 细胞血细胞序列,如下所示用于癌症

> head(cancer[1:2,])
  cloneId cloneCount cloneFraction                         targetSequences
1       0         64    0.02273535 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2       1         64    0.02273535       TGTCAACACAGTTACTCTATTCCGTGGACGTTC
                          targetQualities                  allVHitsWithScore
1 EEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN                 IGLV1-51*00(117.6)
2       NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGKV1-39*00(152),IGKV1D-39*00(152)
  allDHitsWithScore             allJHitsWithScore             allCHitsWithScore
1                   IGLJ2*00(42.3),IGLJ3*00(42.3) IGLC3*00(118),IGLC2*00(117.3)
2                                  IGKJ1*00(65.4)                   IGKC*00(75)
                                             allVAlignments allDAlignments
1                        421|446|473|0|25|SG425CSA427T|93.0               
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0               
                             allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0              ;      NA         NA
2                      19|30|58|22|33||55.0                     NA         NA
  nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1       NA          NA      NA         NA       NA          NA      NA         NA
2       NA          NA      NA         NA       NA          NA      NA         NA
                                 nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC          36      NA         NA       NA
2       TGTCAACACAGTTACTCTATTCCGTGGACGTTC          45      NA         NA       NA
  aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3     aaSeqCDR3 aaSeqFR4
1        NA       NA        NA       NA CASWDSSLKIVLF       NA
2        NA       NA        NA       NA   CQHSYSIPWTF       NA
                         refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2  :::::::::0:-9:15:::::22:1:33:::
> names(cancer)
 [1] "cloneId"           "cloneCount"        "cloneFraction"     "targetSequences"  
 [5] "targetQualities"   "allVHitsWithScore" "allDHitsWithScore" "allJHitsWithScore"
 [9] "allCHitsWithScore" "allVAlignments"    "allDAlignments"    "allJAlignments"   
[13] "allCAlignments"    "nSeqFR1"           "minQualFR1"        "nSeqCDR1"         
[17] "minQualCDR1"       "nSeqFR2"           "minQualFR2"        "nSeqCDR2"         
[21] "minQualCDR2"       "nSeqFR3"           "minQualFR3"        "nSeqCDR3"         
[25] "minQualCDR3"       "nSeqFR4"           "minQualFR4"        "aaSeqFR1"         
[29] "aaSeqCDR1"         "aaSeqFR2"          "aaSeqCDR2"         "aaSeqFR3"         
[33] "aaSeqCDR3"         "aaSeqFR4"          "refPoints"        
> 

和正常

> head(normal[1:2,])
  cloneId cloneCount cloneFraction                         targetSequences
1       0        100    0.03745318 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2       1         53    0.01985019       TGTCAACACAGTTACTCTATTCCGTGGACGTTC
                          targetQualities                      allVHitsWithScore
1 EEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN                     IGLV1-51*00(115.8)
2       NNNNNNNNNNNNNNNNNNNNNNNNNNNNNEEEE IGKV1-39*00(124.4),IGKV1D-39*00(124.4)
  allDHitsWithScore             allJHitsWithScore               allCHitsWithScore
1                   IGLJ2*00(44.8),IGLJ3*00(44.8) IGLC2*00(103.3),IGLC3*00(103.3)
2                                  IGKJ1*00(61.2)                  IGKC*00(114.2)
                                             allVAlignments allDAlignments
1                        421|446|473|0|25|SG425CSA427T|93.0               
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0               
                             allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0              ;      NA         NA
2                      19|30|58|22|33||55.0                     NA         NA
  nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1       NA          NA      NA         NA       NA          NA      NA         NA
2       NA          NA      NA         NA       NA          NA      NA         NA
                                 nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC          36      NA         NA       NA
2       TGTCAACACAGTTACTCTATTCCGTGGACGTTC          36      NA         NA       NA
  aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3     aaSeqCDR3 aaSeqFR4
1        NA       NA        NA       NA CASWDSSLKIVLF       NA
2        NA       NA        NA       NA   CQHSYSIPWTF       NA
                         refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2  :::::::::0:-9:15:::::22:1:33:::
> 

如何获取 aaSeqCDR3nSeqCDR3 列中不常见元素的癌症文件的子集?

我的意思是我在这两列中有癌症文件所有元素都是唯一的并且与正常文件不常见

如果我们想根据 'normal' 中不存在的元素进行子集化,请使用 anti_join

library(dplyr)
anti_join(cancer, normal[ c("aaSeqCDR3", "nSeqCDR3")],
          by = c("aaSeqCDR3", "nSeqCDR3"))