从定义不明确的用户输入数据中提取多个字符串

Question

我希望根据数据创建一个查找 table，其中一列 (user_entry) 中的条目采用不同的格式并且每行可能包含多个实例。

# create example dataframe.
id <- c(1111,1112,1113,1114)
user_entry <- c("999/1001","1002;1003","999/1004\n999/1005","9991006 9991007")
df <- data.frame(id,user_entry)

> df
    id         user_entry
1 1111           999/1001
2 1112          1002;1003
3 1113 999/1004\n999/1005
4 1114    9991006 9991007

我只对 4 位代码感兴趣，它前面可能有也可能没有 3 位位置代码 and/or 分隔符，例如“/”或 space。每个条目中可能有一个以上的 4 位代码，我想在最终查找中分别列出每个代码 table（请参阅下面的 lookup）。

下面的代码可以满足我的要求，但对于循环内循环和内部增长的数据帧来说确实不够优雅。有没有更简洁的方法来做到这一点？

library(dplyr);library(stringr)

# use stringr package to extract only digits
df <- df %>% 
mutate(entries = str_extract_all(user_entry,"[[:digit:]]+")) %>%
select(-user_entry)

# initialise lookup dataframe
lookup <- df[FALSE,]
for (record in 1:nrow(df)){   
  entries <- df$entries[[record]]    
  for (element in 1:length(entries)){
    # only interested in 4 digit codes
    if (nchar(entries[element])>3){
      # remove 3 digit code if it is still attached
      lookup_entry <- gsub('.*?(\d{4})$','\1',entries[element])
      lookup <- rbind(lookup,data.frame(id=df$id[[record]],entries=lookup_entry))
    }
  }
}

> lookup
    id entries
1 1111    1001
2 1112    1002
3 1112    1003
4 1113    1004
5 1113    1005
6 1114    1006
7 1114    1007

Answer 1

使用基数 R，

matches <- regmatches(user_entry, gregexpr("(\d{4})\b", user_entry))

data.frame(
  id = rep(id, lengths(matches)),
  entries = unlist(matches),
  stringsAsFactors = FALSE
)
#     id entries
# 1 1111    1001
# 2 1112    1002
# 3 1112    1003
# 4 1113    1004
# 5 1113    1005
# 6 1114    1006
# 7 1114    1007

Answer 2

不是很优雅，但我认为它应该适用于您的情况：

    library("tidyverse")
df1 <- df %>%
  separate_rows(user_entry, sep = '(/|;|\n|\s)')

extract <- str_extract(df1$user_entry,"(?=\d{3})\d{4}$")
df1$extract <- extract
df2 <- df1[!is.na(df1$extract),]
df2


> df2
     id user_entry extract
 #1111       1001    1001
 #1112       1002    1002
 #1112       1003    1003
 #1113       1004    1004
 #1113       1005    1005
 #1114    9991006    1006
 #1114    9991007    1007

从定义不明确的用户输入数据中提取多个字符串

Extracting multiple strings from poorly defined user input data

regex

r

stringr