将低频计入单个 'other' 类别

Question

如果有解决此问题的非常简单的方法，我们深表歉意。我是 R 和一般数据处理的新手。

我有一个包含许多因素的数据集，以及与它们相关的计数。例如，

A 25
B 1
C 15
D 5
E 2

我的最终目标是使用数据框创建饼图。我想包括所有值，但将低于某个 count/percentage 的值分组到一个新类别或 'Other' 类别中。例如，如果阈值是 5:

A 25
C 15
Other 8

我可以使用 subset() 函数对高于特定阈值的数据进行分组，但这只是 returns 我想在新 table 中使用的更高值，而且我不知道如何将排除的值添加到新数据框中的 'Other' 类别。

如果有人能帮助我，我将不胜感激。过去有过一两个关于这个主题的类似帖子，但要么情况不太一样，要么我很难跟上。

感谢您的宝贵时间！

数据图片：

Answer 1

折叠 factor 级别或 character 的一个选项是使用 fct_collapse

library(dplyr)
library(forcats)
threshold <- 7
out <- df1 %>% 
         count(Col1 = fct_collapse(Col1, Other = unique(Col1[Col2 < threshold])),  
            wt = Col2)
out
# A tibble: 3 x 2
#  Col1      n
#  <fct> <int>
#1 A        25
#2 Other     8
#3 C        15

然后，我们可以创建一个饼图

library(ggplot2)
out %>% 
  ggplot(aes(x = "", y = n, fill = Col1)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0)

更新

根据OP的输入，我们可以将列名改为OP的列名

df2 %>%
  count(Haplogroup = fct_collapse(as.character(Haplogroup), 
      Other = unique(as.character(Haplogroup)[n < threshold])),
      wt = n, name = "n1")
# A tibble: 6 x 2
#  Haplogroup    n1
#  <fct>      <int>
#1 Other         40
#2 E1b           14
#3 N1a           12
#4 R1            10
#5 R1a           15
#6 R1b           25

或者另一个选项是 base R（假设列是 character class），通过比较 'threshold' 和 'Col2' 创建一个逻辑向量，将 'Col1' 中 'i1' 为 TRUE 的元素分配给 'Other'，然后按 sum 和 aggregate

进行分组

i1 <- df1$Col2 < threshold
df1$Col1[i1] <- "Other"
aggregate(Col2 ~ Col1, df1, sum)
#    Col1 Col2
#1     A   25
#2     C   15
#3 Other    8

数据

df1 <- structure(list(Col1 = c("A", "B", "C", "D", "E"), Col2 = c(25L, 
1L, 15L, 5L, 2L)), row.names = c(NA, -5L), class = "data.frame")

将低频计入单个 'other' 类别

Group low frequency counts in to a single 'other' category

group-by

r

categories

dplyr

更新

数据