在 R 数据框中组合碎片化的句子
Combining fragmented sentences in an R dataframe
我有一个数据框,其中包含整个句子的一部分,在某些情况下,分布在数据框的多行中。
例如,head(mydataframe)
returns
# 1 Do you have any idea what
# 2 they were arguing about?
# 3 Do--Do you speak
# 4 English?
# 5 yeah.
# 6 No, I'm sorry.
假设一个句子可以被终止
“。”要么 ”?”要么 ”!”或“...”
是否有任何 R 库函数能够输出以下内容:
# 1 Do you have any idea what they were arguing about?
# 2 Do--Do you speak English?
# 3 yeah.
# 4 No, I'm sorry.
这是我得到的。我相信有更好的方法可以做到这一点。这里我使用了基本函数。我创建了一个名为 foo
的示例数据框。首先,我创建了一个包含 txt
中所有文本的字符串。 toString()
添加了 ,
,所以我在第一个 gsub()
中删除了它们。然后,我在第二个gsub()
中处理了白色space(超过2spaces)。然后,我用您指定的分隔符拆分字符串。感谢 Tyler Rinker this post,我设法在 strsplit()
中留下分隔符。最后的工作是去掉句首位置的白色space。然后,取消列出列表。
编辑
Steven Beaupré 修改了我的代码。这就是要走的路!
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah.", "No, I'm sorry."),
stringsAsFactors = FALSE)
library(magrittr)
toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x)
{gsub(pattern = "^ ", replacement = "", x = x)
}) %>%
unlist
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah."
#[4] "No I'm sorry."
这应该适用于所有以 .
...
?
或 !
结尾的句子
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\s)", perl=TRUE)))
向@AvinashRaj 致谢,以获取有关回顾的指针
给出:
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah..."
#[4] "No, I'm sorry."
数据
我修改了玩具数据集以包含字符串以 ...
结尾的情况(根据 OP 的要求)
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah...", "No, I'm sorry."),
stringsAsFactors = FALSE)
我有一个数据框,其中包含整个句子的一部分,在某些情况下,分布在数据框的多行中。
例如,head(mydataframe)
returns
# 1 Do you have any idea what
# 2 they were arguing about?
# 3 Do--Do you speak
# 4 English?
# 5 yeah.
# 6 No, I'm sorry.
假设一个句子可以被终止
“。”要么 ”?”要么 ”!”或“...”
是否有任何 R 库函数能够输出以下内容:
# 1 Do you have any idea what they were arguing about?
# 2 Do--Do you speak English?
# 3 yeah.
# 4 No, I'm sorry.
这是我得到的。我相信有更好的方法可以做到这一点。这里我使用了基本函数。我创建了一个名为 foo
的示例数据框。首先,我创建了一个包含 txt
中所有文本的字符串。 toString()
添加了 ,
,所以我在第一个 gsub()
中删除了它们。然后,我在第二个gsub()
中处理了白色space(超过2spaces)。然后,我用您指定的分隔符拆分字符串。感谢 Tyler Rinker this post,我设法在 strsplit()
中留下分隔符。最后的工作是去掉句首位置的白色space。然后,取消列出列表。
编辑 Steven Beaupré 修改了我的代码。这就是要走的路!
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah.", "No, I'm sorry."),
stringsAsFactors = FALSE)
library(magrittr)
toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x)
{gsub(pattern = "^ ", replacement = "", x = x)
}) %>%
unlist
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah."
#[4] "No I'm sorry."
这应该适用于所有以 .
...
?
或 !
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\s)", perl=TRUE)))
向@AvinashRaj 致谢,以获取有关回顾的指针
给出:
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah..."
#[4] "No, I'm sorry."
数据
我修改了玩具数据集以包含字符串以 ...
结尾的情况(根据 OP 的要求)
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah...", "No, I'm sorry."),
stringsAsFactors = FALSE)