通过匹配列和值 R 对数据帧进行子集和分组
Subset and group dataframe by matching columns and values R
我有 2 个数据帧,df1 包含一个 groupID 和连续变量,如下所示:
GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043
并且 df2 包含每个变量的截止值 (ct):
Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294
我想要做的是,对于 df1 中的每个变量,在 df2 中的相关列中找到值大于截止值的行数,并 return 每个 groupID 的那个数字,所以输出看起来像这样:
GroupID N-Var1 N-Var2 N-Var3 N-Var4
1 62 78 33 99
2 69 25 77 12
3 55 45 27 62
df1 大约有 200 万行按 GroupID 分布不均匀,我需要计算 30 个变量列,我只是在寻找一种比为所有 30 个变量键入相同函数更有效的方法。
这是 dplyr
中的一个方法:
library(dplyr)
df1 %>%
group_by(GroupID) %>%
summarise(across(everything(), ~ sum(.x > df2[grepl(cur_column(), colnames(df2))][, 1])))
GroupID Var1 Var2 Var3 Var4
<int> <int> <int> <int> <int>
1 1 1 1 0 2
2 2 1 2 0 2
3 3 1 2 0 2
数据
df1 <- read.table(header = T, text = "GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- read.table(header = T, text = "Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")
一种data.table应该很好扩展的方法..
library(data.table)
# if df1 and dsf2 are not data.table, use
# setDT(df)1; setDT(df2)
# we need similara columnnames in df1 and df2 to easily join
setnames(df2, names(df1)[2:5])
# melt df1 and to long format
df1.long <- melt(df1, id.vars = "GroupID")
df2.long <- melt(df2, measure.vars = names(df2))
# join ct-values
df1.long[df2.long, ct := i.value, on = .(variable)]
# summarise
ans <- df1.long[, sum(value > ct), by = .(GroupID, variable)]
# cast to wide
dcast(ans, GroupID ~ variable, value.var = "V1")
# GroupID Var1 Var2 Var3 Var4
# 1: 1 1 1 0 2
# 2: 2 1 2 0 2
# 3: 3 1 2 0 2
示例数据
df1 <- fread("GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- fread("Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")
我有 2 个数据帧,df1 包含一个 groupID 和连续变量,如下所示:
GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043
并且 df2 包含每个变量的截止值 (ct):
Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294
我想要做的是,对于 df1 中的每个变量,在 df2 中的相关列中找到值大于截止值的行数,并 return 每个 groupID 的那个数字,所以输出看起来像这样:
GroupID N-Var1 N-Var2 N-Var3 N-Var4
1 62 78 33 99
2 69 25 77 12
3 55 45 27 62
df1 大约有 200 万行按 GroupID 分布不均匀,我需要计算 30 个变量列,我只是在寻找一种比为所有 30 个变量键入相同函数更有效的方法。
这是 dplyr
中的一个方法:
library(dplyr)
df1 %>%
group_by(GroupID) %>%
summarise(across(everything(), ~ sum(.x > df2[grepl(cur_column(), colnames(df2))][, 1])))
GroupID Var1 Var2 Var3 Var4
<int> <int> <int> <int> <int>
1 1 1 1 0 2
2 2 1 2 0 2
3 3 1 2 0 2
数据
df1 <- read.table(header = T, text = "GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- read.table(header = T, text = "Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")
一种data.table应该很好扩展的方法..
library(data.table)
# if df1 and dsf2 are not data.table, use
# setDT(df)1; setDT(df2)
# we need similara columnnames in df1 and df2 to easily join
setnames(df2, names(df1)[2:5])
# melt df1 and to long format
df1.long <- melt(df1, id.vars = "GroupID")
df2.long <- melt(df2, measure.vars = names(df2))
# join ct-values
df1.long[df2.long, ct := i.value, on = .(variable)]
# summarise
ans <- df1.long[, sum(value > ct), by = .(GroupID, variable)]
# cast to wide
dcast(ans, GroupID ~ variable, value.var = "V1")
# GroupID Var1 Var2 Var3 Var4
# 1: 1 1 1 0 2
# 2: 2 1 2 0 2
# 3: 3 1 2 0 2
示例数据
df1 <- fread("GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- fread("Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")