有没有更有效的方法来处理在 R 数据框中重复的事实?
Is there a more efficient way to handle facts which are duplicating in an R dataframe?
我有一个如下所示的数据框:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
数据框维度是这样工作的:
- 总是是一个ID/key,它唯一地标识提交的事实
- 始终 是给定事实的一个维度,定义提交的事实所属的 Overall_Category。
- 大多数时候 - 但 并非 总是 - 会有一个“描述符”的维度,
- 如果给定事实是“描述符”维度,则将有另一个“成员”维度来显示“描述符”中的可能成员。
问题在于,根据适用于给定事实的维度数量,为给定 ID 重复提交的单个事实。 我想要的是一种基于其 ID 仅显示一次事实并针对该单个 ID 存储适用维度的方法。
我通过这样做实现了它:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
但由于 pivot_wide,它似乎无法很好地扩展到具有多个维度的事实,并且通常看起来不是一种非常有效的方法。
有更好的方法吗?
我想你想要简单的 paste
和 sep
和 collapse
参数
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
您可以 unite
列并将每个 ID
组合在一起并取 Fact
个值的平均值。
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
选项str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')
我有一个如下所示的数据框:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
数据框维度是这样工作的:
- 总是是一个ID/key,它唯一地标识提交的事实
- 始终 是给定事实的一个维度,定义提交的事实所属的 Overall_Category。
- 大多数时候 - 但 并非 总是 - 会有一个“描述符”的维度,
- 如果给定事实是“描述符”维度,则将有另一个“成员”维度来显示“描述符”中的可能成员。
问题在于,根据适用于给定事实的维度数量,为给定 ID 重复提交的单个事实。 我想要的是一种基于其 ID 仅显示一次事实并针对该单个 ID 存储适用维度的方法。
我通过这样做实现了它:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
但由于 pivot_wide,它似乎无法很好地扩展到具有多个维度的事实,并且通常看起来不是一种非常有效的方法。
有更好的方法吗?
我想你想要简单的 paste
和 sep
和 collapse
参数
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
您可以 unite
列并将每个 ID
组合在一起并取 Fact
个值的平均值。
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
选项str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')