多列数据框

Question

我有一个包含单列的数据框，我想在 R 上拆分它。它包含日期、文本和数字。我想将我的文本保留在一个列中，所以我不能用 space 分隔。我想在单词之间添加破折号，然后用 spaces 分隔。但我不知道如何在不删除单词的第一个和最后一个字母的情况下做到这一点。

有没有人有想法：

在包含所有字母的单词之间添加破折号
以任何其他方式分隔多列

这是我拥有的数据框类型：

tab <- data.frame(c1 = c("21.03.2016 This amasingly interesting text 2'000.50 3'000.60",
                         "22.03.2016 This other terrific text 5'000.54 6'000.90"))


#This is what I would like to obtain
tab1 <- data.frame(c1 = c("21.03.2016", "22.03.2016"),
                   c2 = c("This amasingly interesting text", "This other terrific text"),
                   c3 = c( "2'000.50", "5'000.54"),
                   c4 = c( "3'000.60", "6'000.90"))


#This is what I did to add dash
tab <- gsub("[A-z] [A-z]","_", tab$c1)
tab <- data.frame(tab)
library(stringr)
tab <- data.frame(str_split_fixed(tab$tab, " ", 4))

#This is pretty much what I want unless that some letters are missing 
tab$X2 <- gsub("_"," ",tab$X2)

Answer 1

您可以尝试 tidyr::extract 函数并提供 regex 参数以您期望的方式将文本与列分开。

这样的尝试可以是：

library(tidyverse)

tab %>% extract(col = c1, into = c("C1","C2","C3","C4"), 
                regex = "([0-9.]+)\s([A-Za-z ]+)\s([0-9.']+)\s(.*)")

#           C1                              C2       C3       C4
# 1 21.03.2016 This amasingly interesting text 2'000.50 3'000.60
# 2 22.03.2016        This other terrific text 5'000.54 6'000.90

正则表达式解释：

`([0-9.]+)`     - Look for `0-9` or `.` and make 1st group for 1st column
`\s`           - Leave a space
`([A-Za-z ]+)`  - Look for `alphabetic` or `space` characters. Group for 2nd column
`\s`           - Leave a space
([0-9.']        - Look for `0-9`, `.` or `'` and make group for 3rd column
`\s`           - Leave a space
(.*)             - Anything at the end to make group for 4th column

多列数据框

Multiple columns data frame

regex

r

gsub

dataframe