从字符串中提取所有单词并用结果创建一个列

Question

我有一个数据框 (data3)，其列名为 "Collector"。在这个专栏中，我有字母数字字符。例如："Ruiz and Galvis 650"。我需要分别提取字母字符和数字字符，并创建两个新列，一个包含该字符串的数字 (ColID)，另一个包含所有单词 (Col)：

输入：

Collector                       Times     Sample
Ruiz and Galvis 650             9         SP.1              
Smith et al 469                 8         SP.1

预期输出

Collector                       Times     Sample     ColID    Col
Ruiz and Galvis 650             9         SP.1        650     Ruiz and Galvis
Smith et al 469                 8         SP.1        469     Smith et al

我尝试了以下操作，但是当我尝试保存文件时出现错误（.External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol 中的错误, : 'EncodeElement' 中未实现的类型 'list'):

regexp <- "[[:digit:]]+"
data3$colID<- NA
data3$colID <- str_extract (data3$Collector, regexp)

data3$Col<- NA
regexp <-"[[:alpha:]]+"
data3$Col <- (str_extract_all (data3$Collector, regexp))
write.table(data3, file = paste("borrar2",".csv", sep=""), quote=T, sep = ",", row.names = F)

Answer 1

问题是 str_extract_all 不是只找到一个字符串，而是一个包含多个字符串的列表。例如：

> dput(str_extract_all("Ruiz and Galvis 650", "[[:alpha:]]+"))
list(c("Ruiz", "and", "Galvis"))

具有嵌套元素（如上）的数据框显然无法保存到文件中。

但是，如果您更新正则表达式模式以匹配 space 和字母，您可以返回使用 str_extract 代替：

> dput(str_extract("Ruiz and Galvis 650", "[[:alpha:] ]+"))
"Ruiz and Galvis "

注意第二个正则表达式中的 space。这会将所有 letters/spaces 作为一个字符串匹配，并允许您将 data.frame 写入文件。

Answer 2

如果您的数据与示例所示一样统一，那么这是另一种选择：

library(stringi)
library(purrr)
library(dplyr)

df <- data.frame(Collector=c("Ruiz and Galvis 650", "Smith et al 469"),
                 Times=c(9, 8),
                 Sample=c("SP.1", "SP.1"),
                 stringsAsFactors=FALSE)

stri_match_first(df$Collector, regex="([[:alpha:][:space:]]+) ([[:digit:]]+)") %>% 
  as.data.frame(stringsAsFactors=FALSE) %>% 
  select(Col=V2, ColID=V3) %>% 
  bind_cols(df) %>% 
  select(-Collector)
##               Col ColID Times Sample
## 1 Ruiz and Galvis   650     9   SP.1
## 2     Smith et al   469     8   SP.1

从字符串中提取所有单词并用结果创建一个列

Extract all words from a string and create a column with the result

string

r

extract

alphanumeric

dataframe