R 删除计数小于全部数据 10% 的类别值，并找到与该类别关联的其他列的平均值

Question

我有以下格式的数据集

Cat     v1  V2
Low     10  1
Low     10  2
Low     10  3
Low     10  1
Low     10  2
Low     10  3
Low     10  1
Low     10  2
Low     10  3
Low     10  1
Low     10  2
Low     10  3
Low     10  1
Low     10  2
Low     10  3
Low     10  1
High    90  8
High    90  9
High    90  19
VeryLow 1   23

我想做的是，如果任何类别的频率小于数据集中总行的 10%，我将忽略该类别并找到两列的平均值（对于剩余的每个类别，就像数据集中的 groupby)。

所以我的最终数据集看起来像

Cat   Avgv1 Avgv2
Low    10   1.9
High   90   1.2

Very Low was removed as it was less than .1 * nrow(mydataset)

有什么方法可以做到吗R.I我很有希望！

谢谢

Answer 1

一种方法可以是：

#split the df according to your categories
categs <- split(df, df$Cat)

#then use lapply on the splits
#if a category has less than .1 * the rows of the original data.frame return NULL
#else calculate the averages.
#Using do.call(rbind... will remove the NULLs
do.call(rbind,
lapply(categs, function(x){
  if(nrow(x) < 0.1*nrow(df)) return(NULL) else aggregate(cbind(v1,V2)~Cat, x, FUN=mean)
}))

输出：

      Cat v1      V2
High High 90 12.0000
Low   Low 10  1.9375

Answer 2

您可以将 data.table 用作

library('data.table')
DT <- data.table(my.data.set)

DT <- DT[,
         .N,
         by = 'Cat'
         ][,
           freq:= N/sum(N)
           ][freq > my.number,][,
                               avg := N/sum(N)
                               ]

Answer 3

dplyr的方法：

library(dplyr)

low_cat_freqs <- df %>%
  group_by(Cat) %>%
  tally() %>%
  mutate(freq = n / sum(n)) %>%
  filter(freq <= 0.10)

low_cat_freqs
# Source: local data frame [1 x 3]
# 
#       Cat     n  freq
#    (fctr) (int) (dbl)
# 1 VeryLow     1  0.05

df %>%
  filter(!Cat %in% low_cat_freqs$Cat) %>%
  # continue to do as what you wish....
  group_by(Cat) %>%
  summarise(avg_v1 = mean(v1),
            avg_v2 = mean(V2))

#      Cat avg_v1  avg_v2
#   (fctr)  (dbl)   (dbl)
# 1   High     90 12.0000
# 2    Low     10  1.9375

R 删除计数小于全部数据 10% 的类别值，并找到与该类别关联的其他列的平均值

R remove category values whose count is less than 10% of over al data and find the average of the other columns associated with that category

r

data-analysis