在 R 中,convert/split 1 列数据帧根据拆分字符串中的内容分为 4 列

In R, convert/split 1-column dataframe into 4 columns based on splitting content in strings

这感觉像是 R 中一个相当困难的数据操作/数据框修复问题。我们有以下混乱的数据框,当前组织为将多列信息打包到 X2 列中。在以下示例中使用假名、电子邮件、phone 号码:

coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach", 
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu", 
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant Coachgabensspar@gmail.edu", 
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associateabbetter@gmail.edu\r\n   111-222-3333", 
"Nick Romanov\r\n Head Crew Coach\r\n nick.nick@school.edu\r\n 123-123-1234", 
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L, 
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")

head(coach_info, 4)
    X1                                                                                   X2
1 <NA>                                                                   TBA\r\n Head Coach
2 <NA>                             Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA>                                    Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu

我们正在尝试将 X2 列信息拆分为 NameTitleEmailPhone 的 4 列。当我们 strsplit(coach_info$X2, '\r\n') 时,我们得到的是一个混乱的嵌套列表,并且使用 \r\n 的拆分是不完美的,因为某些行中缺少 \r\n

除此之外,每个内部嵌套列表都有不同数量的元素,因为许多行缺少 1 个或多个姓名、phone 号码或电子邮件地址:

> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
 [1] 2 2 2 3 1 2 3 4 2

我们的目标是尽可能接近这个:

output_df <- data.frame(
    Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
    Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
              NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
              "Head Crew Coach", "Assistant Coach"),
    Email = c(NA, NA, NA, "Tom.Tommy@abc.edu", NA, "Coachgabensspar@gmail.edu", "abbetter@gmail.edu", "nick.nick@school.edu", NA),
    Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
    stringsAsFactors = FALSE
  )
  

>   head(output_df, 4)
          Name                                   Title             Email          Phone
1          TBA                              Head Coach              <NA>           <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach              <NA>           <NA>
3   Jimmy Jimm      Assistant Women's Basketball Coach              <NA>           <NA>
4   Rod Barber             Head Men's Basketball Coach Tom.Tommy@abc.edu (123) 456-7890

似乎不​​可能干净地拆分不同字段之间不存在空格或 \r\n 的字符串,如上面的屏幕截图所示。我们只是想在这一点上尽可能接近...

这样的怎么样

require(data.table)
setDT(coach_info)

re.phone <- '.*(\d{3}[^[:alnum:]]*\d{3}[^[:alnum:]]*\d{4}).*'
re.email <- ".*[^_[:alnum:]\-\.]([_[:alnum:]\-\.]+@[[:alnum:]\.]+).*"
re.text1 <- '([[:alnum:][:blank:]]+)\r\n([[:alnum:][:blank:][:punct:]]+).*'


coach_info[,processed:=X2]

coach_info[grepl(re.phone,X2), phone:=gsub(re.phone,'\1',X2)]
coach_info[!is.na(phone), processed:=gsub(phone,' ',X2,fixed=T),by=phone]

coach_info[grepl(re.email,processed), email:=gsub(re.email,'\1',processed)]
coach_info[!is.na(email), processed:=gsub(email,' ',processed,fixed=T),by=email]

coach_info[, Name:=gsub(re.text1,'\1',processed)]
coach_info[, Title:=gsub(re.text1,'\2',processed)]