从两个文件的两列中获取不常见的元素
Getting uncommon elements from two columns from two files
我有两个正常文件和癌症文件,用于 T 细胞血细胞序列,如下所示用于癌症
> head(cancer[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 64 0.02273535 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 64 0.02273535 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(117.6)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGKV1-39*00(152),IGKV1D-39*00(152)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(42.3),IGLJ3*00(42.3) IGLC3*00(118),IGLC2*00(117.3)
2 IGKJ1*00(65.4) IGKC*00(75)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 45 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
> names(cancer)
[1] "cloneId" "cloneCount" "cloneFraction" "targetSequences"
[5] "targetQualities" "allVHitsWithScore" "allDHitsWithScore" "allJHitsWithScore"
[9] "allCHitsWithScore" "allVAlignments" "allDAlignments" "allJAlignments"
[13] "allCAlignments" "nSeqFR1" "minQualFR1" "nSeqCDR1"
[17] "minQualCDR1" "nSeqFR2" "minQualFR2" "nSeqCDR2"
[21] "minQualCDR2" "nSeqFR3" "minQualFR3" "nSeqCDR3"
[25] "minQualCDR3" "nSeqFR4" "minQualFR4" "aaSeqFR1"
[29] "aaSeqCDR1" "aaSeqFR2" "aaSeqCDR2" "aaSeqFR3"
[33] "aaSeqCDR3" "aaSeqFR4" "refPoints"
>
和正常
> head(normal[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 100 0.03745318 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 53 0.01985019 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(115.8)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNEEEE IGKV1-39*00(124.4),IGKV1D-39*00(124.4)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(44.8),IGLJ3*00(44.8) IGLC2*00(103.3),IGLC3*00(103.3)
2 IGKJ1*00(61.2) IGKC*00(114.2)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 36 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
>
如何获取 aaSeqCDR3
和 nSeqCDR3
列中不常见元素的癌症文件的子集?
我的意思是我在这两列中有癌症文件所有元素都是唯一的并且与正常文件不常见
如果我们想根据 'normal' 中不存在的元素进行子集化,请使用 anti_join
library(dplyr)
anti_join(cancer, normal[ c("aaSeqCDR3", "nSeqCDR3")],
by = c("aaSeqCDR3", "nSeqCDR3"))
我有两个正常文件和癌症文件,用于 T 细胞血细胞序列,如下所示用于癌症
> head(cancer[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 64 0.02273535 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 64 0.02273535 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(117.6)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGKV1-39*00(152),IGKV1D-39*00(152)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(42.3),IGLJ3*00(42.3) IGLC3*00(118),IGLC2*00(117.3)
2 IGKJ1*00(65.4) IGKC*00(75)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 45 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
> names(cancer)
[1] "cloneId" "cloneCount" "cloneFraction" "targetSequences"
[5] "targetQualities" "allVHitsWithScore" "allDHitsWithScore" "allJHitsWithScore"
[9] "allCHitsWithScore" "allVAlignments" "allDAlignments" "allJAlignments"
[13] "allCAlignments" "nSeqFR1" "minQualFR1" "nSeqCDR1"
[17] "minQualCDR1" "nSeqFR2" "minQualFR2" "nSeqCDR2"
[21] "minQualCDR2" "nSeqFR3" "minQualFR3" "nSeqCDR3"
[25] "minQualCDR3" "nSeqFR4" "minQualFR4" "aaSeqFR1"
[29] "aaSeqCDR1" "aaSeqFR2" "aaSeqCDR2" "aaSeqFR3"
[33] "aaSeqCDR3" "aaSeqFR4" "refPoints"
>
和正常
> head(normal[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 100 0.03745318 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 53 0.01985019 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(115.8)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNEEEE IGKV1-39*00(124.4),IGKV1D-39*00(124.4)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(44.8),IGLJ3*00(44.8) IGLC2*00(103.3),IGLC3*00(103.3)
2 IGKJ1*00(61.2) IGKC*00(114.2)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 36 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
>
如何获取 aaSeqCDR3
和 nSeqCDR3
列中不常见元素的癌症文件的子集?
我的意思是我在这两列中有癌症文件所有元素都是唯一的并且与正常文件不常见
如果我们想根据 'normal' 中不存在的元素进行子集化,请使用 anti_join
library(dplyr)
anti_join(cancer, normal[ c("aaSeqCDR3", "nSeqCDR3")],
by = c("aaSeqCDR3", "nSeqCDR3"))