R中一个因素的问题崩溃水平

Problem collapsing levels of a factor in R

我有一个乱七八糟的因子变量,它的水平比它应该有的多。这些案例来自一项公开调查,许多参与者写错了字或只是以不同的方式回答了类似的答案。

这是代表我的问题的样本 df:


df <- data.frame(ID=seq(1:10),
                 Nationality=c("espanol", "spaniol", "ESPANOL",
                               "spanish", "colombia", "Colombian",
                               "British", "brit", "ESPanol", "UK")
                               )

我想要的输出是这样的:

> df
   ID Nationality
1   1     Spanish
2   2     Spanish
3   3     Spanish
4   4     Spanish
5   5   Colombian
6   6   Colombian
7   7     British
8   8     British
9   9     Spanish
10 10     British

这就是我试图做的,以便将这 10 个人为因素水平降低到 3(西班牙、哥伦比亚、英国),因为它应该是:

library(forcats) 
                              
levels(df$Nationality) <- fct_collapse(df$Nationality, Spanish = c("espanol", "spaniol", "ESPANOL",
                                                                  "spanish", "ESPanol"),
                                                       Colombian = c("colombia", "Colombian"),
                                                       British = c("British", "brit", "UK")
                                        )

这有效地将我的“国籍”因素降低到 3 个级别,但输出看起来像这样,与第一个类似的任何内容都不对应:

> df
   ID Nationality
1   1   Colombian
2   2     British
3   3     British
4   4     Spanish
5   5     Spanish
6   6     Spanish
7   7     Spanish
8   8     Spanish
9   9   Colombian
10 10     British

在我使用的更大的数据集中,它也不起作用,但输出更糟,因为所有情况都变成了“西班牙语”,而且我不知道为什么会发生这种情况。

在此先感谢您的帮助! 最好, 卢卡斯

你试过让国籍成为第一因素吗?

df <- data.frame(ID=seq(1:10),
                 Nationality=c("espanol", "spaniol", "ESPANOL",
                               "spanish", "colombia", "Colombian",
                               "British", "brit", "ESPanol", "UK")
)
library(forcats) 


df2 <- df %>% 
  mutate(Nationality = factor(Nationality)) %>% 
 mutate(Nationality = fct_collapse(Nationality, Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
                                       Colombian = c("colombia", "Colombian"),
                                       British = c("British", "brit", "UK")))



#more concise

mutate(across(Nationality, ~ fct_collapse(factor(.), 
Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"), 
Colombian = c("colombia", "Colombian"), 
British = c("British", "brit", "UK")
))) 

以下是一些使用内置函数的解决方案:

解决方案 1

此解决方案假定列 Nationality 是字符变量

cases <- c(espanol = "Spanish", spaniol = "Spanish", ESPANOL = "Spanish", spanish = "Spanish", 
           British = "British", brit = "British", ESPanol = "Spanish", UK = "British",
           colombia = "Colombian", Colombian = "Colombian")

df$Nationality <- factor(cases[df$Nationality])

解决方案 2

df$Nationality <- as.factor(df$Nationality)

levels(df$Nationality) <- list(Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
                               Colombian = c("colombia", "Colombian"),
                               British = c("British", "brit", "UK"))

输出数据

#    ID Nationality
# 1   1     Spanish
# 2   2     Spanish
# 3   3     Spanish
# 4   4     Spanish
# 5   5   Colombian
# 6   6   Colombian
# 7   7     British
# 8   8     British
# 9   9     Spanish
# 10 10     British