R 中的 grepl:替换 character/numeric 级别

grepl in R: Replace character/numeric levels

我想用两个级别 DOG 和 CAT 替换我的级别 dog1 ... dog4 和 cat1 ... cat4,但是如果我使用 grepl,我的输出仅作为 NA。

在我的代码中:

x  <- (rep(c("dog1","dog2","dog3","dog4","cat1","cat2","cat3","cat4"),2)) #Levels
y<-rnorm(16)
d<-data.frame(cbind(x,y))
head(d)

     x                 y
1 dog1 0.906357739138289
2 dog2 0.974674552504268
3 dog3 0.664045049199848
4 dog4 0.911777985232099
5 cat1 0.246575548162824
6 cat2 0.758069789161901


d$x[grepl("dog", d$x)] <- "DOG" 

Warning message: In [<-.factor(*tmp*, grepl("dog", d$x), value = c(NA, NA, NA, : invalid factor level, NA generated

d$x[grepl("cat", d$x)] <- "CAT"

Warning message:
In `[<-.factor`(`*tmp*`, grepl("cat", d$x), value = c(NA_integer_,  :
  invalid factor level, NA generated

head(d)

     x                 y
1 <NA> 0.906357739138289
2 <NA> 0.974674552504268
3 <NA> 0.664045049199848
4 <NA> 0.911777985232099
5 <NA> 0.246575548162824
6 <NA> 0.758069789161901

如果代码 运行 OK,我想要的输出是:

head(d)

     x                 y
1 DOG  0.906357739138289
2 DOG  0.974674552504268
3 DOG  0.664045049199848
4 DOG  0.911777985232099
5 CAT  0.246575548162824
6 CAT  0.758069789161901

您可以尝试使用字符串作为 false 因素创建数据框:

d <- data.frame(cbind(x,y), stringsAsFactors=FALSE)
d$x[grepl("dog", d$x)] <- "DOG"
d$x[grepl("cat", d$x)] <- "CAT" 

这里的关键(正如 Tim 所暗示的那样)是了解 factor 变量虽然表面上相似,但实际上与 character 变量完全不同。

这是访问和更新因子水平的一种方法:

levels(d$x)
# [1] "cat1" "cat2" "cat3" "cat4" "dog1" "dog2" "dog3" "dog4"

levels(d$x)[grepl("dog", levels(d$x))] <- "DOG"
levels(d$x)[grepl("cat", levels(d$x))] <- "CAT"
head(d)
#     x                   y
# 1 DOG -0.0489713202962167
# 2 DOG  -0.548503649991368
# 3 DOG   0.460493884654479
# 4 DOG   0.143044665735075
# 5 CAT   -2.13008189672678
# 6 CAT  -0.136767747543626

levels(d$x)
[1] "CAT" "DOG"

另一个版本,但这里使用正则表达式。我们捕获所有内容,直到找到一个数字并将其转为大写。 (\U).

d$x <- sub("(.*)\d+", "\U\1", d$x, perl = TRUE)
d$x
 #[1] "DOG" "DOG" "DOG" "DOG" "CAT" "CAT" "CAT" "CAT" "DOG" "DOG" "DOG" "DOG" 
 #    "CAT" "CAT" "CAT" "CAT"