在 R 中的 df 中查找因子级别名称中的正则表达式匹配项

Question

我有一个包含因子的数据框。这些因素有一定的层次。我无法使用正则表达式根据他们的名字找到完全匹配的内容。

  df <- structure(list(age = structure(1:2, .Label = c("18-25", 
                   ">25"), class = "factor"), `M` = c("13.4", 
                   "12.8"), 'N' = c("73", "76"), `SD` = c("6.8", 
                    "6.6")), row.names = 51:52, class = "data.frame")

我的df

     age   M  N  SD
51 18-25 13.4 73 6.8
52   >25 12.8 76 6.6




First try: 

         regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Second Try

     saved_level_name <- structure(list(V1 = structure(1L, .Label = "18-25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame") 
     regexpr(pattern = saved_level_name, text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1]  1  4 -1 -1
    attr(,"match.length")
    [1]  1  1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Third Try (compare two outputs!)

     saved_name_level_2 <- structure(list(V4 = structure(1L, .Label = ">25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame")

     regexpr(pattern = saved_level_name, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)

     regexpr(pattern = saved_name_level_2, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)



    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Forth Try

     regexpr(pattern = as.character( saved_name_level ), text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)

    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

第一次尝试：0 个结果第二次尝试：结果没有意义（1、4？）第三次尝试：以不同的面值输入得到相同的结果。第四次尝试：没有结果！

可能，正则表达式找到因子的存储值而不是它们的面 value/name?

如何使用 Regex 搜索因子名称而不是它们的值？

Answer 1

失败的原因可以通过 debug:

找到

debugonce(regexpr)
regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)
# debugging in: regexpr(pattern = "18-25", text = df, ignore.case = FALSE, perl = FALSE, 
#     fixed = T)
# debug: {
#     if (!is.character(text)) 
#         text <- as.character(text)
#     .Internal(regexpr(as.character(pattern), text, ignore.case, 
#         perl, fixed, useBytes))
# }
debug: if (!is.character(text)) text <- as.character(text)
debug: text <- as.character(text)

好的，让 R 运行那个 as.character 命令，它正在将 "text"（实际上是一个帧）转换成它的字符版本。

text
# [1] "1:2"                   "c(\"13.4\", \"12.8\")" "c(\"73\", \"76\")"    
# [4] "c(\"6.8\", \"6.6\")"

最后一部分是关键。当 regexpr 转换您的 text 参数（即 really intended to be a character vector）时，它会将您的 df$age 的 factor 转换为因子数字的字符表示，作为 1:2。（它生成 : 序列的事实对我来说很有趣……但那是另一回事。）

显然 "1:2" 不符合您的 "18-25" 测试。你真的应该检查个人 vectors/columns。如果你有倍数，那么也许

lapply(df, function(v) regexpr(pattern = "18-25", text=v, ignore.case = FALSE, perl = FALSE,  fixed = T))

或 df[,1:3] 或 df[,-5] 或任何您可以用来描述要使用或不使用哪些列的内容。但是用因子一次检查整个框架是行不通的。

如果您只想 在模式匹配的因子中找到 个实例（而不是提取或替换它），那么 grepl 可能更适合:

lapply(df, grepl, pattern = "18-25", fixed = TRUE)
# $age
# [1]  TRUE FALSE
# $M
# [1] FALSE FALSE
# $N
# [1] FALSE FALSE
# $SD
# [1] FALSE FALSE

在 R 中的 df 中查找因子级别名称中的正则表达式匹配项

Find regex matches in the names of factor levels in a df in R

regex

r

find

match

levels