更改因子级别 -- "f" 中的未知级别 -- 无法更改级别

Question

我有一个包含许多行业名称的因子。我需要将它们分解为主要类别和行业。例如，因为我允许受访者随心所欲地做出回应，所以我夸大了级别数（例如金融服务、金融服务、银行、金融）。因为这些情况不匹配，所以它们作为附加级别出现，所以我试图用 forcats 折叠它们：

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking",
  "Corporate Finance", "Finance", "Financial", "financial services",
  "financial services", "Financial Services", "Financial services"),
  NULL = "H")

我收到一条警告："Financial services" 未知。这非常令人沮丧，因为当我调用矢量时，我可以看到它确实存在。我试过复制和粘贴通话中的确切单词，重写它，但似乎有隐藏字符阻止它被更改。

如何正确折叠这些值？

-> test$industry
Banking
Corporate Finance 
Finance Financial 
financial services
financial services 
Financial Services 
Financial services

当我去 "revalue" 说，最后一层 "Financial services"，它告诉我它是一个未知的字符串。

编辑 dput(x$industry)

的输出

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

编辑弄清楚了。有些术语在结束后有一个额外的 space。例如，虽然当我调用 Prescreen$Industry 时，它会 return 一些名称，如 "Banking" 和 "Corporate Finance"，但它并没有告诉我有一个 space水平之后。银行业实际上.. "Banking " 有一个不可见的 space，它没有出现在 R 中。如何确保它是可见的并且不会再次发生？

我可以运行列中的 len 函数吗？如果是这样，那是如何工作的？ PrescreenF$Industry("Banking")?

Answer 1

如果 "x" 是你的 dataframe

library(stringr)

x$industry <- as.character(x$industry)
x$industry <- str_trim(x$industry)
x$industry <- as.factor(x$industry)

然后您可以返回 fct_collapse() 来简化您的因素。

更改因子级别 -- "f" 中的未知级别 -- 无法更改级别

Changing factor levels-- unknown levels in "f" -- can't change levels

r

forcats