在 R 数据框中组合碎片化的句子

Combining fragmented sentences in an R dataframe

我有一个数据框,其中包含整个句子的一部分,在某些情况下,分布在数据框的多行中。

例如,head(mydataframe) returns

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

假设一个句子可以被终止

“。”要么 ”?”要么 ”!”或“...”

是否有任何 R 库函数能够输出以下内容:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.

这是我得到的。我相信有更好的方法可以做到这一点。这里我使用了基本函数。我创建了一个名为 foo 的示例数据框。首先,我创建了一个包含 txt 中所有文本的字符串。 toString() 添加了 ,,所以我在第一个 gsub() 中删除了它们。然后,我在第二个gsub()中处理了白色space(超过2spaces)。然后,我用您指定的分隔符拆分字符串。感谢 Tyler Rinker this post,我设法在 strsplit() 中留下分隔符。最后的工作是去掉句首位置的白色space。然后,取消列出列表。

编辑 Steven Beaupré 修改了我的代码。这就是要走的路!

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah.", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

library(magrittr)

toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x) 
            {gsub(pattern = "^ ", replacement = "", x = x)
      }) %>%
unlist

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah."                                             
#[4] "No I'm sorry." 

这应该适用于所有以 . ... ?!

结尾的句子
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\s)", perl=TRUE)))

向@AvinashRaj 致谢,以获取有关回顾的指针

给出:

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah..."                                           
#[4] "No, I'm sorry." 

数据

我修改了玩具数据集以包含字符串以 ... 结尾的情况(根据 OP 的要求)

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)