如何使用 stringr 根据前面的模式从字符串中提取数字?

How to extract a number from a string based on a preceding pattern using stringr?

我想提取 HBA1C 的值。这些值出现在数据帧 df 的文本变量 X2 中的模式 "HBA1C = " 之后。该模式可以出现在字符串的开头,如第 2、3 和 6 行,也可以出现在中间,如第 4 行。

df<-data.frame(X1=1:6,X2=c(NA,"HBA1C = 8.9 (09/06/15)","HBA1C = 9.8 (03/08/15)",
                           "JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), 
                           NEHR LOCKED. 18/8/15","SLIDING SCALE FOLLOWED STRICTLY",
                           "HBA1C = 11.7 (17/7/15)"))

# df
#  X1                                                                              X2
#1  1                                                                            <NA>
#2  2                                                          HBA1C = 8.9 (09/06/15)
#3  3                                                          HBA1C = 9.8 (03/08/15)
#4  4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15
#5  5                                                 SLIDING SCALE FOLLOWED STRICTLY
#6  6                                                          HBA1C = 11.7 (17/7/15)

我想提取的这些值应该保存在一个新变量中,X3,如下所示:

# df
#  X1                                                                              X2   X3
#1  1                                                                            <NA>   NA
#2  2                                                          HBA1C = 8.9 (09/06/15)  8.9
#3  3                                                          HBA1C = 9.8 (03/08/15)  9.8
#4  4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15  6.2
#5  5                                                 SLIDING SCALE FOLLOWED STRICTLY   NA
#6  6                                                          HBA1C = 11.7 (17/7/15) 11.7

我试过下面的代码,但是不行。

library(stringr)
df1$X3 <- 
str_extract(str_extract(df$X2,pattern = "HBA1C = [0-9].[0-9]"),pattern = "[0-9].[0-9]")

我收到此错误:

Error in df$X2 : object of type 'closure' is not subsettable

我们可以使用单个 str_extract 和正则表达式环视

df$X3 <- as.numeric(str_extract(df$X2,pattern = "(?<=HBA1C \= )[0-9]+\.[0-9]+"))
df$X3
#[1]   NA  8.9  9.8  6.2   NA 11.7

pattern 匹配是一个或多个数字 ([0-9]+) 后跟一个 . 后跟一个或多个数字,紧跟单词 'HBA1C' 后跟一个space、= 和 space

注意:有些字符是元字符,即它们被正则表达式引擎不同地感知,即例如 . 它表示任何字符而不是文字点 (.)。因此,对于这些情况,我们必须转义 (\) 或将其放在方括号 [.]