如何自动向 R 中的大数据框中的变量添加因子
How to automate adding factors to variables in large data frame in R
我在 R 中有一个大型数据框,其中有 200 多个主要是字符变量,我想为其添加因子。我已经在单独的数据框中准备了所有级别和标签。对于某个变量Var1
,对应的水平和标签是Var1_v
和Var1_b
,例如对于变量Gender
,水平和标签被命名为Gender_v
和Gender_l
.
这是我的数据示例:
df <- data.frame (Gender = c("2","2","1","2"),
AgeG = c("3","1","4","2"))
fct <- data.frame (Gender_v = c("1", "2"),
Gender_b = c("Male", "Female"),
AgeG_v = c("1","2","3","4"),
AgeG_b = c("<25","25-60","65-80",">80"))
df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)
有没有办法使过程自动化,以便将因素(水平和标签)应用于相应的变量,而无需我单独处理每个变量?
我认为这是通过 pmap
.
的函数完成的
我的目标是尽量减少此过程所需的工作量。有没有更好的方法来准备标签和级别?
非常感谢您的帮助。
我通过对您的代码进行简单的重构解决了这个问题,实现了一个循环的自动化。添加的数据越多,花费的时间就越多。我相信这个 fct[[paste0(names(df[i]),"_v")]]
可以重构为一个小函数,看起来更好
> df <- data.frame (Gender = c("2","2","1","2"),
+ AgeG = c("3","1","4","2"))
>
> fct <- data.frame (Gender_v = c("1", "2"),
+ Gender_b = c("Male", "Female"),
+ AgeG_v = c("1","2","3","4"),
+ AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
+
+ le <- fct[[paste0(names(df[i]),"_v")]]
+
+ la <- fct[[paste0(names(df[i]),"_b")]]
+
+ df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+
+ }
>
> df
Gender AgeG
1 Female 65-80
2 Female <25
3 Male >80
4 Female 25-60
>
编辑:这是添加的 if 条件
> df <- data.frame (Gender_f = c("2","2","1","2"),
+ AgeG_f = c("3","1","4","2"),
+ AgeN = c(70,15,96,30))
>
> fct <- data.frame (Gender_v = c("1", "2"),
+ Gender_b = c("Male", "Female"),
+ AgeG_v = c("1","2","3","4"),
+ AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
+
+ if(endsWith(names(df[i]),"_f")){
+
+ name <- str_remove(names(df[i]),"_f")
+
+ le <- fct[[paste0(name,"_v")]]
+
+ la <- fct[[paste0(name,"_b")]]
+
+ df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+
+ }
+
+ }
>
> df
Gender_f AgeG_f AgeN
1 Female 65-80 70
2 Female <25 15
3 Male >80 96
4 Female 25-60 30
>
数据框并不是真正适合存储数据的数据结构
因素水平定义:没有理由期望所有因素都有
等量的水平。相反,我只是使用一个简单的列表,并存储
将信息级别更紧凑地命名为向量,如下所示:
df <- data.frame(
Gender = c("2", "2", "1", "2"),
AgeG = c("3", "1", "4", "2")
)
value_labels <- list(
Gender = c("Male" = 1, "Female" = 2),
AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)
然后你可以创建一个函数,使用该数据结构来生成因子
在数据框中:
make_factors <- function(data, value_labels) {
for (var in names(value_labels)) {
if (var %in% colnames(data)) {
vl <- value_labels[[var]]
data[[var]] <- factor(
data[[var]],
levels = unname(vl),
labels = names(vl)
)
}
}
data
}
make_factors(df, value_labels)
#> Gender AgeG
#> 1 Female 65-80
#> 2 Female <25
#> 3 Male >80
#> 4 Female 25-60
我在 R 中有一个大型数据框,其中有 200 多个主要是字符变量,我想为其添加因子。我已经在单独的数据框中准备了所有级别和标签。对于某个变量Var1
,对应的水平和标签是Var1_v
和Var1_b
,例如对于变量Gender
,水平和标签被命名为Gender_v
和Gender_l
.
这是我的数据示例:
df <- data.frame (Gender = c("2","2","1","2"),
AgeG = c("3","1","4","2"))
fct <- data.frame (Gender_v = c("1", "2"),
Gender_b = c("Male", "Female"),
AgeG_v = c("1","2","3","4"),
AgeG_b = c("<25","25-60","65-80",">80"))
df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)
有没有办法使过程自动化,以便将因素(水平和标签)应用于相应的变量,而无需我单独处理每个变量?
我认为这是通过 pmap
.
我的目标是尽量减少此过程所需的工作量。有没有更好的方法来准备标签和级别?
非常感谢您的帮助。
我通过对您的代码进行简单的重构解决了这个问题,实现了一个循环的自动化。添加的数据越多,花费的时间就越多。我相信这个 fct[[paste0(names(df[i]),"_v")]]
可以重构为一个小函数,看起来更好
> df <- data.frame (Gender = c("2","2","1","2"),
+ AgeG = c("3","1","4","2"))
>
> fct <- data.frame (Gender_v = c("1", "2"),
+ Gender_b = c("Male", "Female"),
+ AgeG_v = c("1","2","3","4"),
+ AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
+
+ le <- fct[[paste0(names(df[i]),"_v")]]
+
+ la <- fct[[paste0(names(df[i]),"_b")]]
+
+ df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+
+ }
>
> df
Gender AgeG
1 Female 65-80
2 Female <25
3 Male >80
4 Female 25-60
>
编辑:这是添加的 if 条件
> df <- data.frame (Gender_f = c("2","2","1","2"),
+ AgeG_f = c("3","1","4","2"),
+ AgeN = c(70,15,96,30))
>
> fct <- data.frame (Gender_v = c("1", "2"),
+ Gender_b = c("Male", "Female"),
+ AgeG_v = c("1","2","3","4"),
+ AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
+
+ if(endsWith(names(df[i]),"_f")){
+
+ name <- str_remove(names(df[i]),"_f")
+
+ le <- fct[[paste0(name,"_v")]]
+
+ la <- fct[[paste0(name,"_b")]]
+
+ df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+
+ }
+
+ }
>
> df
Gender_f AgeG_f AgeN
1 Female 65-80 70
2 Female <25 15
3 Male >80 96
4 Female 25-60 30
>
数据框并不是真正适合存储数据的数据结构 因素水平定义:没有理由期望所有因素都有 等量的水平。相反,我只是使用一个简单的列表,并存储 将信息级别更紧凑地命名为向量,如下所示:
df <- data.frame(
Gender = c("2", "2", "1", "2"),
AgeG = c("3", "1", "4", "2")
)
value_labels <- list(
Gender = c("Male" = 1, "Female" = 2),
AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)
然后你可以创建一个函数,使用该数据结构来生成因子 在数据框中:
make_factors <- function(data, value_labels) {
for (var in names(value_labels)) {
if (var %in% colnames(data)) {
vl <- value_labels[[var]]
data[[var]] <- factor(
data[[var]],
levels = unname(vl),
labels = names(vl)
)
}
}
data
}
make_factors(df, value_labels)
#> Gender AgeG
#> 1 Female 65-80
#> 2 Female <25
#> 3 Male >80
#> 4 Female 25-60