如何根据一列中的值对数据进行分类,并计算另一列中的出现次数,不包括 R 中的重复项?

How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?

我有一个相关的 r 值文件。我想将 r 值分成多个箱子并计算每个箱子中有多少 CNV。有没有办法做到这一点而不重复?

GeneChr   SNP   SNP_Position          CNV           start       end         r-value
1   rs7520551   100716167   1:101161140-101161459   100161140   102161459   0.950231679
1   rs6702766   100997635   1:101161140-101161459   100161140   102161459   0.376573375
1   rs11588568  101426960   1:101161140-101161459   100161140   102161459   0.252772248
1   rs4332900   10236894    1:10405137-10406094     9405137     11406094    0.171113128
1   rs11678947  10307395    1:10405137-10406094     9405137     11406094    0.334359684
1   rs2357468   10341468    1:10405137-10406094     9405137     11406094    0.30932652
1   rs1918705   10693478    1:10405137-10406094     9405137     11406094    0.822784876
1   rs7570190   101528047   1:101161140-101161459   100161140   102161459   0.391963719
1   rs643841    110832827   1:110028467-110029625   109028467   111029625   0.070643341
1   rs7514102   110998854   1:110028467-110029625   109028467   111029625   0.548219745
1   rs4676225   109609765   1:110028467-110029625   109028467   111029625   0.035118621
1   rs7608232   101699063   1:101161140-101161459   100161140   102161459   0.951958567
1   rs1449308   100708996   1:101161140-101161459   100161140   102161459   0.703308687

我有这一行来拆分数据,只需要计算 CNV 而无需重复计数。

xNew <- table(cut(CorTestMatrix$test, breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1)))

我只想知道每个bin中有多少CNV。

这行得通吗?

df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
     ,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))

> df # minimal example
          CNV    r_value
1  1:10405137 0.03511862
2  1:10405137 0.07064334
3  1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168

df1 <- transform(df, group=cut(r_value, 
                        breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
                        labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))

res <- do.call(data.frame,aggregate(r_value~group, df1, 
                                    FUN=function(x) c(Count=length(x))))

> res # counts of intervals
  group r_value
1  <0.1       2
2   0.3       2
3  0.5<       1

dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")

> dNew # count of CNV by interval
  interval count
1     <0.1     2
2      0.1    NA
3      0.2    NA
4      0.3     2
5      0.4    NA
6     0.5<     1

改编自

这是 dplyr 方法。 (请注意,如果你想计算不同的(CNV),这是一个小的变化)。

library(dplyr)

df %>% mutate(binned_r_value = cut(df$r_value, breaks=c(0,0.1,0.2,0.3,0.4,0.5,1))) %>%
  group_by(binned_r_value) %>%
  tally()

# A tibble: 3 x 2
  binned_r_value     n
  <fct>          <int>
1 (0,0.1]            2
2 (0.3,0.4]          2
3 (0.5,1]            1