在 R 中拆分字符串

Question

我有下一行

    x<-"CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:"

我想提取 "CUST_Id_8"、"Mr. Praveen Kumar" 以及出生日期之后的所有内容：母亲姓名：联系电话号码：等等存储在变量中，例如客户 ID、姓名、出生日期等。

请帮忙。

我用过

    strsplit(x, ":")

但结果是一个包含文本的列表。但是如果变量名后面没有任何内容，我需要空白。

any1可以告诉我如何提取两个单词之间的字符串吗？就像我想在 Name: 和 DOB

之间提取 "Mr. Praveen Kumar"

Answer 1

如果您事先知道密钥，则可以像这样提取值：

keys <- c("CUST_Id_8Name", "DOB", "Mother's Name", 
  "Contact Num", "Email address", "Owns Car", "Products held with Bank", 
  "Company Name", "Salary per. month", "Background")
cbind(keys, values = sub("^:", "", strsplit(x, paste0(keys, collapse = "|"))[[1]][-1]))
#                 keys                      values            
# [1,] "CUST_Id_8Name"           "Mr.Praveen Kumar"
# [2,] "DOB"                     ""                
# [3,] "Mother's Name"           ""                
# [4,] "Contact Num"             ""                
# [5,] "Email address"           ""                
# [6,] "Owns Car"                ""                
# [7,] "Products held with Bank" ""                
# [8,] "Company Name"            ""                
# [9,] "Salary per. month"       ""                
# [10,] "Background"              ""

Answer 2

您可以使用regexec 和regmatches 提取各种数据项作为子字符串。这是一个有效的例子：

示例数据

txt <- c("CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:",
         "CUST_Id_15Name:Mr.Joe JohnsonDOB:01/02/1973Mother's Name:BarbaraContact Num:0123 456789Email address:joe@joesville.comOwns Car:YesProducts held with Bank:Savings, CurrentCompany Name:Joes villeSalary per. month:0000Background:shady")

要匹配的模式：

pattern <- "CUST_Id_(.*)Name:(.*)DOB:(.*)Mother's Name:(.*)Contact Num:(.*)Email address:(.*)Owns Car:(.*)Products held with Bank:(.*)Company Name:(.*)Salary per. month:(.*)Background:(.*)"
var_names <- strsplit(pattern, "[:_]\(\.\*\)")[[1]]

运行匹配：

data <- as.data.frame(do.call("rbind", regmatches(txt, regexec(pattern, txt))))[, -1]
colnames(data) <- var_names

输出：

#  CUST_Id             Name        DOB Mother's Name Contact Num
#1       8 Mr.Praveen Kumar                                     
#2      15   Mr.Joe Johnson 01/02/1973       Barbara 0123 456789
#      Email address Owns Car Products held with Bank Company Name
#1                                                                
#2 joe@joesville.com      Yes        Savings, Current   Joes ville
#  Salary per. month Background
#1                             
#2           0000      shady

在 R 中拆分字符串

Splitting strings in R

r

text-analysis

text-mining

rstudio