loop 或 lappy() - 按因子连接字符串
loop or lappy() - string concatenation by factors
我想按因素类别 (module_components) 连接文本条目。
最终,我需要为数据集中每个因素 (module_components) 的文本获取单词 frequencies/n-grams。所以我想先连接每个因子级别的所有文本条目。
我有数据:
Row
module_component
Long_Text
1
Computer123
Computer retuned due to a battery issue
2
Computer 123
The computer did not power on
3
Laptop42
Screen Broken
4
Laptop42
Keyboard unresponsive
5
Lapop62
Battery chord issues
我想要的数据: 类别列表 (module_components) 以及串联的文本字段 (Long_Text)
module_component
contatonated_Long_Text
Computer123
Computer retuned due to a battery issue The computer did not power on
Laptop42
Screen Broken Keyboard unresponsive
Lapop62
Battery chord issues
我试过的代码
df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**
我不确定连接片的功能。 paste(Long_Text)
不工作。
我愿意接受任何其他方法来完成这项工作。谢谢
一个可能的解决方案,基于 stringr
(mutate
是删除 Computer 123
中的 space,我猜这是一个错字):
library(tidyverse)
df <- data.frame(
Row = c(1L, 2L, 3L, 4L, 5L),
module_component = c("Computer123",
"Computer 123","Laptop42","Laptop42","Lapop62"),
Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on","Screen Broken","Keyboard unresponsive",
"Battery chord issues")
)
df %>%
mutate(module_component = str_remove_all(module_component,"\s")) %>%
group_by(module_component) %>%
summarise(Long_Text = str_c(Long_Text, collapse = " "))
#> # A tibble: 3 × 2
#> module_component Long_Text
#> <chr> <chr>
#> 1 Computer123 Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62 Battery chord issues
#> 3 Laptop42 Screen Broken Keyboard unresponsive
在 aggregate
中使用 toString
。
aggregate(Long_Text ~ module_component, dat, toString)
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken, Keyboard unresponsive
或paste
.
aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken Keyboard unresponsive
不过我更喜欢toString
。
数据:
dat <- structure(list(Row = 1:5, module_component = c("Computer123",
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on", "Screen Broken", "Keyboard unresponsive",
"Battery chord issues")), class = "data.frame", row.names = c(NA,
-5L))
我想按因素类别 (module_components) 连接文本条目。 最终,我需要为数据集中每个因素 (module_components) 的文本获取单词 frequencies/n-grams。所以我想先连接每个因子级别的所有文本条目。
我有数据:
Row | module_component | Long_Text |
---|---|---|
1 | Computer123 | Computer retuned due to a battery issue |
2 | Computer 123 | The computer did not power on |
3 | Laptop42 | Screen Broken |
4 | Laptop42 | Keyboard unresponsive |
5 | Lapop62 | Battery chord issues |
我想要的数据: 类别列表 (module_components) 以及串联的文本字段 (Long_Text)
module_component | contatonated_Long_Text |
---|---|
Computer123 | Computer retuned due to a battery issue The computer did not power on |
Laptop42 | Screen Broken Keyboard unresponsive |
Lapop62 | Battery chord issues |
我试过的代码
df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**
我不确定连接片的功能。 paste(Long_Text)
不工作。
我愿意接受任何其他方法来完成这项工作。谢谢
一个可能的解决方案,基于 stringr
(mutate
是删除 Computer 123
中的 space,我猜这是一个错字):
library(tidyverse)
df <- data.frame(
Row = c(1L, 2L, 3L, 4L, 5L),
module_component = c("Computer123",
"Computer 123","Laptop42","Laptop42","Lapop62"),
Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on","Screen Broken","Keyboard unresponsive",
"Battery chord issues")
)
df %>%
mutate(module_component = str_remove_all(module_component,"\s")) %>%
group_by(module_component) %>%
summarise(Long_Text = str_c(Long_Text, collapse = " "))
#> # A tibble: 3 × 2
#> module_component Long_Text
#> <chr> <chr>
#> 1 Computer123 Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62 Battery chord issues
#> 3 Laptop42 Screen Broken Keyboard unresponsive
在 aggregate
中使用 toString
。
aggregate(Long_Text ~ module_component, dat, toString)
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken, Keyboard unresponsive
或paste
.
aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken Keyboard unresponsive
不过我更喜欢toString
。
数据:
dat <- structure(list(Row = 1:5, module_component = c("Computer123",
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on", "Screen Broken", "Keyboard unresponsive",
"Battery chord issues")), class = "data.frame", row.names = c(NA,
-5L))