Data.Table R:重复行列表不一致地显示行重复
Data.Table R: a list of duplicated rows does not consistently show row duplications
我有一个 data.table 的基因表达数据,其中包含来自 44 个簇的前 5 个基因,如下所示;
> cluster.top5gene
Cluster Genes aveLog2FC FDR
1: 1 Cd79a 5.125957 0.000000e+00
2: 1 Ly6d 3.918639 0.000000e+00
3: 1 Cd79b 3.532945 0.000000e+00
4: 1 Iglc2 3.523255 0.000000e+00
5: 1 Ebf1 3.322775 0.000000e+00
---
216: 44 Hba-a2 3.881978 4.074726e-31
217: 44 Hba-a1 3.892339 1.432746e-30
218: 44 Hbb-bs 3.971035 1.178994e-28
219: 44 Cd79a 2.629973 2.261226e-19
220: 44 Hbb-bt 3.139013 1.221915e-17
> str(cluster.top5gene)
Classes ‘data.table’ and 'data.frame': 220 obs. of 4 variables:
$ Cluster : int 1 1 1 1 1 2 2 2 2 2 ...
$ Genes : chr "Cd79a" "Ly6d" "Cd79b" "Iglc2" ...
$ aveLog2FC: num 5.13 3.92 3.53 3.52 3.32 ...
$ FDR : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "index")= int(0)
..- attr(*, "__Genes")= int [1:220] 185 212 161 52 86 120 135 110 103 56 ...
有重复的基因名称(在基因下);
> cluster.top5gene[duplicated(cluster.top5gene, by="Genes"), Genes]
1] "Ptprb" "Gsn" "C1qa" "C1qb" "C1qc" "Apoe" "Nkg7" "Ccl5" "Cd3g" "Gsn" "Mgp" "Nkg7" "C1qa"
[14] "C1qb" "Ccl8" "C1qc" "Apoe" "Car4" "Kdr" "Icam2" "Emp2" "Ly6c1" "Car4" "Ybx1" "Sftpc" "Sftpa1"
[27] "Cxcl15" "Sftpb" "F13a1" "Cxcl2" "Ptprb" "Tm4sf1" "Hba-a2" "Hba-a1" "Hbb-bs" "Cd79a" "Hbb-bt"
And their corresponding row numbers;
> cluster.top5gene[, .I[duplicated(Genes)]
+ ]
[1] 20 65 72 73 74 75 76 77 79 81 85 91 111 112 113 114 115 121 122 125 126 128 129 130 156 157 158 159 193
[30] 195 198 199 216 217 218 219 220
我做了一个重复的基因名列表,对应Cluster
个数字如下;
cluster.top5gene[duplicated(Genes, fromLast=F, by=Genes), Cluster, Genes]
Genes Cluster
1: Ptprb 4
2: Ptprb 40
3: Gsn 13
4: Gsn 17
5: C1qa 15
6: C1qa 23
7: C1qb 15
8: C1qb 23
9: C1qc 15
10: C1qc 23
11: Apoe 15
12: Apoe 23
13: Nkg7 16
14: Nkg7 19
15: Ccl5 16
16: Cd3g 16
17: Mgp 17
18: Ccl8 23
19: Car4 25
20: Car4 26
21: Kdr 25
22: Icam2 25
23: Emp2 26
24: Ly6c1 26
25: Ybx1 26
26: Sftpc 32
27: Sftpa1 32
28: Cxcl15 32
29: Sftpb 32
30: F13a1 39
31: Cxcl2 39
32: Tm4sf1 40
33: Hba-a2 44
34: Hba-a1 44
35: Hbb-bs 44
36: Cd79a 44
37: Hbb-bt 44
Genes Cluster
如您所见,一些基因在不同 Cluster
处显示重复,而另一些则没有,它们确实在差异 Cluster
处有重复,如下例所示;
> cluster.top5gene[Genes=="Ccl5",]
Cluster Genes aveLog2FC FDR
1: 9 Ccl5 4.076985 0
2: 16 Ccl5 3.724350 0
我非常感谢在这个问题上的任何帮助。
同样,在无法访问您的数据的情况下,我可能走错了路,但如果您想要一份重复基因及其簇的列表,也许最好这样做:
cluster.top5gene[, .SD[.N>1], by=Genes][, .(Genes, Cluster)]
我有一个 data.table 的基因表达数据,其中包含来自 44 个簇的前 5 个基因,如下所示;
> cluster.top5gene
Cluster Genes aveLog2FC FDR
1: 1 Cd79a 5.125957 0.000000e+00
2: 1 Ly6d 3.918639 0.000000e+00
3: 1 Cd79b 3.532945 0.000000e+00
4: 1 Iglc2 3.523255 0.000000e+00
5: 1 Ebf1 3.322775 0.000000e+00
---
216: 44 Hba-a2 3.881978 4.074726e-31
217: 44 Hba-a1 3.892339 1.432746e-30
218: 44 Hbb-bs 3.971035 1.178994e-28
219: 44 Cd79a 2.629973 2.261226e-19
220: 44 Hbb-bt 3.139013 1.221915e-17
> str(cluster.top5gene)
Classes ‘data.table’ and 'data.frame': 220 obs. of 4 variables:
$ Cluster : int 1 1 1 1 1 2 2 2 2 2 ...
$ Genes : chr "Cd79a" "Ly6d" "Cd79b" "Iglc2" ...
$ aveLog2FC: num 5.13 3.92 3.53 3.52 3.32 ...
$ FDR : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "index")= int(0)
..- attr(*, "__Genes")= int [1:220] 185 212 161 52 86 120 135 110 103 56 ...
有重复的基因名称(在基因下);
> cluster.top5gene[duplicated(cluster.top5gene, by="Genes"), Genes]
1] "Ptprb" "Gsn" "C1qa" "C1qb" "C1qc" "Apoe" "Nkg7" "Ccl5" "Cd3g" "Gsn" "Mgp" "Nkg7" "C1qa"
[14] "C1qb" "Ccl8" "C1qc" "Apoe" "Car4" "Kdr" "Icam2" "Emp2" "Ly6c1" "Car4" "Ybx1" "Sftpc" "Sftpa1"
[27] "Cxcl15" "Sftpb" "F13a1" "Cxcl2" "Ptprb" "Tm4sf1" "Hba-a2" "Hba-a1" "Hbb-bs" "Cd79a" "Hbb-bt"
And their corresponding row numbers;
> cluster.top5gene[, .I[duplicated(Genes)]
+ ]
[1] 20 65 72 73 74 75 76 77 79 81 85 91 111 112 113 114 115 121 122 125 126 128 129 130 156 157 158 159 193
[30] 195 198 199 216 217 218 219 220
我做了一个重复的基因名列表,对应Cluster
个数字如下;
cluster.top5gene[duplicated(Genes, fromLast=F, by=Genes), Cluster, Genes]
Genes Cluster
1: Ptprb 4
2: Ptprb 40
3: Gsn 13
4: Gsn 17
5: C1qa 15
6: C1qa 23
7: C1qb 15
8: C1qb 23
9: C1qc 15
10: C1qc 23
11: Apoe 15
12: Apoe 23
13: Nkg7 16
14: Nkg7 19
15: Ccl5 16
16: Cd3g 16
17: Mgp 17
18: Ccl8 23
19: Car4 25
20: Car4 26
21: Kdr 25
22: Icam2 25
23: Emp2 26
24: Ly6c1 26
25: Ybx1 26
26: Sftpc 32
27: Sftpa1 32
28: Cxcl15 32
29: Sftpb 32
30: F13a1 39
31: Cxcl2 39
32: Tm4sf1 40
33: Hba-a2 44
34: Hba-a1 44
35: Hbb-bs 44
36: Cd79a 44
37: Hbb-bt 44
Genes Cluster
如您所见,一些基因在不同 Cluster
处显示重复,而另一些则没有,它们确实在差异 Cluster
处有重复,如下例所示;
> cluster.top5gene[Genes=="Ccl5",]
Cluster Genes aveLog2FC FDR
1: 9 Ccl5 4.076985 0
2: 16 Ccl5 3.724350 0
我非常感谢在这个问题上的任何帮助。
同样,在无法访问您的数据的情况下,我可能走错了路,但如果您想要一份重复基因及其簇的列表,也许最好这样做:
cluster.top5gene[, .SD[.N>1], by=Genes][, .(Genes, Cluster)]