R中一个因素的问题崩溃水平
Problem collapsing levels of a factor in R
我有一个乱七八糟的因子变量,它的水平比它应该有的多。这些案例来自一项公开调查,许多参与者写错了字或只是以不同的方式回答了类似的答案。
这是代表我的问题的样本 df:
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
我想要的输出是这样的:
> df
ID Nationality
1 1 Spanish
2 2 Spanish
3 3 Spanish
4 4 Spanish
5 5 Colombian
6 6 Colombian
7 7 British
8 8 British
9 9 Spanish
10 10 British
这就是我试图做的,以便将这 10 个人为因素水平降低到 3(西班牙、哥伦比亚、英国),因为它应该是:
library(forcats)
levels(df$Nationality) <- fct_collapse(df$Nationality, Spanish = c("espanol", "spaniol", "ESPANOL",
"spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)
这有效地将我的“国籍”因素降低到 3 个级别,但输出看起来像这样,与第一个类似的任何内容都不对应:
> df
ID Nationality
1 1 Colombian
2 2 British
3 3 British
4 4 Spanish
5 5 Spanish
6 6 Spanish
7 7 Spanish
8 8 Spanish
9 9 Colombian
10 10 British
在我使用的更大的数据集中,它也不起作用,但输出更糟,因为所有情况都变成了“西班牙语”,而且我不知道为什么会发生这种情况。
在此先感谢您的帮助!
最好,
卢卡斯
你试过让国籍成为第一因素吗?
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
library(forcats)
df2 <- df %>%
mutate(Nationality = factor(Nationality)) %>%
mutate(Nationality = fct_collapse(Nationality, Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")))
#more concise
mutate(across(Nationality, ~ fct_collapse(factor(.),
Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)))
以下是一些使用内置函数的解决方案:
解决方案 1
此解决方案假定列 Nationality
是字符变量
cases <- c(espanol = "Spanish", spaniol = "Spanish", ESPANOL = "Spanish", spanish = "Spanish",
British = "British", brit = "British", ESPanol = "Spanish", UK = "British",
colombia = "Colombian", Colombian = "Colombian")
df$Nationality <- factor(cases[df$Nationality])
解决方案 2
df$Nationality <- as.factor(df$Nationality)
levels(df$Nationality) <- list(Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK"))
输出数据
# ID Nationality
# 1 1 Spanish
# 2 2 Spanish
# 3 3 Spanish
# 4 4 Spanish
# 5 5 Colombian
# 6 6 Colombian
# 7 7 British
# 8 8 British
# 9 9 Spanish
# 10 10 British
我有一个乱七八糟的因子变量,它的水平比它应该有的多。这些案例来自一项公开调查,许多参与者写错了字或只是以不同的方式回答了类似的答案。
这是代表我的问题的样本 df:
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
我想要的输出是这样的:
> df
ID Nationality
1 1 Spanish
2 2 Spanish
3 3 Spanish
4 4 Spanish
5 5 Colombian
6 6 Colombian
7 7 British
8 8 British
9 9 Spanish
10 10 British
这就是我试图做的,以便将这 10 个人为因素水平降低到 3(西班牙、哥伦比亚、英国),因为它应该是:
library(forcats)
levels(df$Nationality) <- fct_collapse(df$Nationality, Spanish = c("espanol", "spaniol", "ESPANOL",
"spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)
这有效地将我的“国籍”因素降低到 3 个级别,但输出看起来像这样,与第一个类似的任何内容都不对应:
> df
ID Nationality
1 1 Colombian
2 2 British
3 3 British
4 4 Spanish
5 5 Spanish
6 6 Spanish
7 7 Spanish
8 8 Spanish
9 9 Colombian
10 10 British
在我使用的更大的数据集中,它也不起作用,但输出更糟,因为所有情况都变成了“西班牙语”,而且我不知道为什么会发生这种情况。
在此先感谢您的帮助! 最好, 卢卡斯
你试过让国籍成为第一因素吗?
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
library(forcats)
df2 <- df %>%
mutate(Nationality = factor(Nationality)) %>%
mutate(Nationality = fct_collapse(Nationality, Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")))
#more concise
mutate(across(Nationality, ~ fct_collapse(factor(.),
Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)))
以下是一些使用内置函数的解决方案:
解决方案 1
此解决方案假定列 Nationality
是字符变量
cases <- c(espanol = "Spanish", spaniol = "Spanish", ESPANOL = "Spanish", spanish = "Spanish",
British = "British", brit = "British", ESPanol = "Spanish", UK = "British",
colombia = "Colombian", Colombian = "Colombian")
df$Nationality <- factor(cases[df$Nationality])
解决方案 2
df$Nationality <- as.factor(df$Nationality)
levels(df$Nationality) <- list(Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK"))
输出数据
# ID Nationality
# 1 1 Spanish
# 2 2 Spanish
# 3 3 Spanish
# 4 4 Spanish
# 5 5 Colombian
# 6 6 Colombian
# 7 7 British
# 8 8 British
# 9 9 Spanish
# 10 10 British