R 删除计数小于全部数据 10% 的类别值,并找到与该类别关联的其他列的平均值
R remove category values whose count is less than 10% of over al data and find the average of the other columns associated with that category
我有以下格式的数据集
Cat v1 V2
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
High 90 8
High 90 9
High 90 19
VeryLow 1 23
我想做的是,如果任何类别的频率小于数据集中总行的 10%,我将忽略该类别并找到两列的平均值(对于剩余的每个类别,就像数据集中的 groupby)。
所以我的最终数据集看起来像
Cat Avgv1 Avgv2
Low 10 1.9
High 90 1.2
Very Low was removed as it was less than .1 * nrow(mydataset)
有什么方法可以做到吗R.I我很有希望!
谢谢
一种方法可以是:
#split the df according to your categories
categs <- split(df, df$Cat)
#then use lapply on the splits
#if a category has less than .1 * the rows of the original data.frame return NULL
#else calculate the averages.
#Using do.call(rbind... will remove the NULLs
do.call(rbind,
lapply(categs, function(x){
if(nrow(x) < 0.1*nrow(df)) return(NULL) else aggregate(cbind(v1,V2)~Cat, x, FUN=mean)
}))
输出:
Cat v1 V2
High High 90 12.0000
Low Low 10 1.9375
您可以将 data.table
用作
library('data.table')
DT <- data.table(my.data.set)
DT <- DT[,
.N,
by = 'Cat'
][,
freq:= N/sum(N)
][freq > my.number,][,
avg := N/sum(N)
]
dplyr
的方法:
library(dplyr)
low_cat_freqs <- df %>%
group_by(Cat) %>%
tally() %>%
mutate(freq = n / sum(n)) %>%
filter(freq <= 0.10)
low_cat_freqs
# Source: local data frame [1 x 3]
#
# Cat n freq
# (fctr) (int) (dbl)
# 1 VeryLow 1 0.05
df %>%
filter(!Cat %in% low_cat_freqs$Cat) %>%
# continue to do as what you wish....
group_by(Cat) %>%
summarise(avg_v1 = mean(v1),
avg_v2 = mean(V2))
# Cat avg_v1 avg_v2
# (fctr) (dbl) (dbl)
# 1 High 90 12.0000
# 2 Low 10 1.9375
我有以下格式的数据集
Cat v1 V2
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
Low 10 2
Low 10 3
Low 10 1
High 90 8
High 90 9
High 90 19
VeryLow 1 23
我想做的是,如果任何类别的频率小于数据集中总行的 10%,我将忽略该类别并找到两列的平均值(对于剩余的每个类别,就像数据集中的 groupby)。
所以我的最终数据集看起来像
Cat Avgv1 Avgv2
Low 10 1.9
High 90 1.2
Very Low was removed as it was less than .1 * nrow(mydataset)
有什么方法可以做到吗R.I我很有希望!
谢谢
一种方法可以是:
#split the df according to your categories
categs <- split(df, df$Cat)
#then use lapply on the splits
#if a category has less than .1 * the rows of the original data.frame return NULL
#else calculate the averages.
#Using do.call(rbind... will remove the NULLs
do.call(rbind,
lapply(categs, function(x){
if(nrow(x) < 0.1*nrow(df)) return(NULL) else aggregate(cbind(v1,V2)~Cat, x, FUN=mean)
}))
输出:
Cat v1 V2
High High 90 12.0000
Low Low 10 1.9375
您可以将 data.table
用作
library('data.table')
DT <- data.table(my.data.set)
DT <- DT[,
.N,
by = 'Cat'
][,
freq:= N/sum(N)
][freq > my.number,][,
avg := N/sum(N)
]
dplyr
的方法:
library(dplyr)
low_cat_freqs <- df %>%
group_by(Cat) %>%
tally() %>%
mutate(freq = n / sum(n)) %>%
filter(freq <= 0.10)
low_cat_freqs
# Source: local data frame [1 x 3]
#
# Cat n freq
# (fctr) (int) (dbl)
# 1 VeryLow 1 0.05
df %>%
filter(!Cat %in% low_cat_freqs$Cat) %>%
# continue to do as what you wish....
group_by(Cat) %>%
summarise(avg_v1 = mean(v1),
avg_v2 = mean(V2))
# Cat avg_v1 avg_v2
# (fctr) (dbl) (dbl)
# 1 High 90 12.0000
# 2 Low 10 1.9375