使用 dplyr/tidyr 方法提取字符串及其位置

Question

输入数据框有三个id列和一个raw_text。 u_id对应用户，doc_id对应特定用户的文档，句子id对应用户文档中的句子。

df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
                 doc_id=c(1,1,1,2,2,1,1,2),
                 sent_id=c(1,2,3,1,2,1,2,1),
                 text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                               "pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                               "method exists and the former because calls to the corresponding",
                        "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                        "pertinent results: 2105-4-16 05:02pm gap-14 
                        2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding"))

假设我们需要从 raw_text 中提取所有日期及其位置。到目前为止我的方法 -

#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"

#library
library(dplyr)
library(stringr)

#extract dates
df_i<-df %>% 
  mutate(i=str_extract_all(text,date)) %>% 
  mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>% 
  unnest(date)

#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x) 
  rbind(x, matrix(, n-nrow(x), ncol(x))))))

日期提取采用 data.frame 格式。有没有一种方法可以将 string_locations 置于与其 id 和字符串相对应的 data.frame 格式中？理想情况下，输出应该是 -

output<-data.frame(id=c(1,1,2,2,3),
               text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."),
               date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"),
               date_start=c(17,43,20,74,96),
               date_end=c(25,51,28,82,104))

Answer 1

你可以这样做：

regex = "\b[0-9]+[-][0-9]+[-][0-9]+\b"
df_i = str_extract_all(df$text, regex) 
df_ii = str_locate_all(df$text, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, date = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

或坚持使用管道语法：

regex = "[0-9]+[-][0-9]+[-][0-9]+"

output1 = df %>%
  {list(.$id, str_extract_all(.$text, regex), 
       str_locate_all(.$text, regex))} %>%
  {Map(function(x, y, z){
    if(length(y) == 0){
      y = NA
    }
    if(nrow(z) == 0){
      z = rbind(z, list(start = NA, end = NA))
    }
    data.frame(id = x, date = y, z)
  }, .[[1]], .[[2]], .[[3]])} %>%
  do.call(rbind, .) %>%
  merge(df, .)

结果：

  id
1  1
2  1
3  2
4  2
5  2
6  3
                                                                                                                 text
1                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
2                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
3 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
4 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
5 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
6                                                     method exists and the former because calls to the corresponding
       date start end
1 2001-4-19    17  25
2 2002-5-23    43  51
3 2105-4-16    20  28
4 2105-4-16    77  85
5 2105-4-16   104 112
6      <NA>    NA  NA

备注：

您的正则表达式错误地从 "rdw-13.1 2105-4-16" 中提取“13.1”，因为您在 [- . /] 中添加了空格。 date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})" 应该这样做。
mutate 允许您在同一函数调用中使用刚刚创建的变量，因此无需为 df_i 使用两个单独的 mutate。
对于我的 pipping-only 解决方案，在 list() 和 Map() 周围需要 {} 来覆盖 dplyr 默认将上一步的输出馈送到 first 下一个函数的参数。

例如：

df %>%
      list(.$id, str_extract_all(.$text, regex), 
                 str_locate_all(.$text, regex))

变成：

list(df, df$id, str_extract_all(df$text, regex), 
                str_locate_all(df$text, regex))

这不是我们想要的。

编辑：

OP 更新了他的 df 以包含 text 不包含任何 dates 的行。这将导致我的原始解决方案失败，因为来自 str_extract_all 和 str_locate_all 的列表中的某些元素将具有 length(0) 和 nrow(0)。我通过添加两个 if 语句解决了这个问题：

if(length(y) == 0){
  y = NA
}
if(nrow(z) == 0){
  z = rbind(z, list(start = NA, end = NA))
}

这使得 dates = "NA 并为那些没有日期的行添加一行 NA 到 start 和 end。这允许 id 在 data.frame 步骤中绑定一行。

使用 dplyr/tidyr 方法提取字符串及其位置

Extract string and its location using dplyr/tidyr approach

r

stringr

dplyr

tidyr