如何根据一列中的值对数据进行分类,并计算另一列中的出现次数,不包括 R 中的重复项?
How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?
我有一个相关的 r 值文件。我想将 r 值分成多个箱子并计算每个箱子中有多少 CNV。有没有办法做到这一点而不重复?
GeneChr SNP SNP_Position CNV start end r-value
1 rs7520551 100716167 1:101161140-101161459 100161140 102161459 0.950231679
1 rs6702766 100997635 1:101161140-101161459 100161140 102161459 0.376573375
1 rs11588568 101426960 1:101161140-101161459 100161140 102161459 0.252772248
1 rs4332900 10236894 1:10405137-10406094 9405137 11406094 0.171113128
1 rs11678947 10307395 1:10405137-10406094 9405137 11406094 0.334359684
1 rs2357468 10341468 1:10405137-10406094 9405137 11406094 0.30932652
1 rs1918705 10693478 1:10405137-10406094 9405137 11406094 0.822784876
1 rs7570190 101528047 1:101161140-101161459 100161140 102161459 0.391963719
1 rs643841 110832827 1:110028467-110029625 109028467 111029625 0.070643341
1 rs7514102 110998854 1:110028467-110029625 109028467 111029625 0.548219745
1 rs4676225 109609765 1:110028467-110029625 109028467 111029625 0.035118621
1 rs7608232 101699063 1:101161140-101161459 100161140 102161459 0.951958567
1 rs1449308 100708996 1:101161140-101161459 100161140 102161459 0.703308687
我有这一行来拆分数据,只需要计算 CNV 而无需重复计数。
xNew <- table(cut(CorTestMatrix$test, breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1)))
我只想知道每个bin中有多少CNV。
这行得通吗?
df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))
> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168
df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))
res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))
> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1
dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")
> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1
改编自
这是 dplyr
方法。 (请注意,如果你想计算不同的(CNV),这是一个小的变化)。
library(dplyr)
df %>% mutate(binned_r_value = cut(df$r_value, breaks=c(0,0.1,0.2,0.3,0.4,0.5,1))) %>%
group_by(binned_r_value) %>%
tally()
# A tibble: 3 x 2
binned_r_value n
<fct> <int>
1 (0,0.1] 2
2 (0.3,0.4] 2
3 (0.5,1] 1
我有一个相关的 r 值文件。我想将 r 值分成多个箱子并计算每个箱子中有多少 CNV。有没有办法做到这一点而不重复?
GeneChr SNP SNP_Position CNV start end r-value
1 rs7520551 100716167 1:101161140-101161459 100161140 102161459 0.950231679
1 rs6702766 100997635 1:101161140-101161459 100161140 102161459 0.376573375
1 rs11588568 101426960 1:101161140-101161459 100161140 102161459 0.252772248
1 rs4332900 10236894 1:10405137-10406094 9405137 11406094 0.171113128
1 rs11678947 10307395 1:10405137-10406094 9405137 11406094 0.334359684
1 rs2357468 10341468 1:10405137-10406094 9405137 11406094 0.30932652
1 rs1918705 10693478 1:10405137-10406094 9405137 11406094 0.822784876
1 rs7570190 101528047 1:101161140-101161459 100161140 102161459 0.391963719
1 rs643841 110832827 1:110028467-110029625 109028467 111029625 0.070643341
1 rs7514102 110998854 1:110028467-110029625 109028467 111029625 0.548219745
1 rs4676225 109609765 1:110028467-110029625 109028467 111029625 0.035118621
1 rs7608232 101699063 1:101161140-101161459 100161140 102161459 0.951958567
1 rs1449308 100708996 1:101161140-101161459 100161140 102161459 0.703308687
我有这一行来拆分数据,只需要计算 CNV 而无需重复计数。
xNew <- table(cut(CorTestMatrix$test, breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1)))
我只想知道每个bin中有多少CNV。
这行得通吗?
df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))
> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168
df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))
res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))
> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1
dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")
> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1
改编自
这是 dplyr
方法。 (请注意,如果你想计算不同的(CNV),这是一个小的变化)。
library(dplyr)
df %>% mutate(binned_r_value = cut(df$r_value, breaks=c(0,0.1,0.2,0.3,0.4,0.5,1))) %>%
group_by(binned_r_value) %>%
tally()
# A tibble: 3 x 2
binned_r_value n
<fct> <int>
1 (0,0.1] 2
2 (0.3,0.4] 2
3 (0.5,1] 1