如何从 R 中的字符串列表中提取值?

How to extract values from string lists in R?

我想从三个字符串向量中提取 X 平方值和 p 值(仅限数字)。

smr.text1

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T1"                              
[5] "X-squared = 0.017361, df = 1, p-value = 0.8952"                
[6] ""    

smr.txt2    

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T2"                              
[5] "X-squared = 2.5679e-32, df = 1, p-value = 1"                   
[6] ""  

smr.text3

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T3"                              
[5] "X-squared = 0.17857, df = 1, p-value = 0.6726"                
[6] ""  

我很容易使用索引号从第一个字符串向量中提取这些值:

> c1 <- as.numeric(str_sub(smr.txt1[5], 13, 20))

> c1

[1] 0.017361

> p1 <- as.numeric(str_sub(smr.txt1[5], -6))

> p1

[1] 0.8952

但是在第二个字符串向量中我不能真正做同样的事情,因为它是一个科学数字。我也可以对第三个字符串向量做同样的事情,但是有没有更好的方法,例如使用循环只提取这些值并将它们放在同一个数据框中?提前致谢!

而不是 str_sub(这是基于位置的,当 start/end 位置不是常量时它不会起作用,如示例 2 所示)我们可以使用正则表达式环视来提取 p-value 子串和后面带 . 的数字 (str_extract)

library(stringr)
f1 <- function(x, categ ="p-value") {
     as.numeric(str_extract(x, 
        glue::glue("(?<={categ} \= )[0-9.]+(e-[0-9]*)?")))
     }

-测试

> f1("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f1("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857

另一种选择是将列名称转换为 data.frame 'X-squared'、'p-value'、'df',然后提取列值

f2 <- function(x, categ = "p-value") {

   x1 <-  gsub(",\s*", "\n", gsub("\s*=\s*", ":", x))
   type.convert(as.data.frame(read.dcf(textConnection(x1))),
       as.is = TRUE)[[categ]]


}

-测试

> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
>  f2("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f2("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857

不清楚为什么我们需要将 chisq.test 的输出 list 输出转换为字符串以进行提取,即从 chisq.test 的输出,使用 [= 更容易提取=22=] 或 [[

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
Xsq$p.value
#[1] 2.953589e-07
Xsq$statistic[["X-squared"]]
[1] 30.07015

虽然不是您所要求的,但您似乎使用 capture.output(.) 来捕获这些字符串。我建议您不要尝试从捕获的输出中提取字符串,而是从对象本身获取 实数

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
names(Xsq)
# [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed"  "expected"  "residuals" "stdres"   
Xsq[c("statistic","p.value")]
# $statistic
# X-squared 
#  30.07015 
# $p.value
# [1] 2.953589e-07

既然你提到有一个列表,那么使用它也很容易。例如,如果您有一个测试结果列表,如

Xsq2 <- lapply(list(M, M), chisq.test)
Xsq2
# [[1]]
#   Pearson's Chi-squared test
# data:  X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
# [[2]]
#   Pearson's Chi-squared test
# data:  X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
lapply(Xsq2, `[`, c("statistic", "p.value"))
# [[1]]
# [[1]]$statistic
# X-squared 
#  30.07015 
# [[1]]$p.value
# [1] 2.953589e-07
# [[2]]
# [[2]]$statistic
# X-squared 
#  30.07015 
# [[2]]$p.value
# [1] 2.953589e-07

可以很容易地转换成 data.frame,其中:

do.call(rbind.data.frame, lapply(Xsq2, `[`, c("statistic", "p.value")))
#   statistic      p.value
# 1  30.07015 2.953589e-07
# 2  30.07015 2.953589e-07