loop 或 lappy() - 按因子连接字符串

loop or lappy() - string concatenation by factors

我想按因素类别 (module_components) 连接文本条目。 最终,我需要为数据集中每个因素 (module_components) 的文本获取单词 frequencies/n-grams。所以我想先连接每个因子级别的所有文本条目。

我有数据:

Row module_component Long_Text
1 Computer123 Computer retuned due to a battery issue
2 Computer 123 The computer did not power on
3 Laptop42 Screen Broken
4 Laptop42 Keyboard unresponsive
5 Lapop62 Battery chord issues

我想要的数据: 类别列表 (module_components) 以及串联的文本字段 (Long_Text)

module_component contatonated_Long_Text
Computer123 Computer retuned due to a battery issue The computer did not power on
Laptop42 Screen Broken Keyboard unresponsive
Lapop62 Battery chord issues

我试过的代码

df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**
              

我不确定连接片的功能。 paste(Long_Text) 不工作。

我愿意接受任何其他方法来完成这项工作。谢谢

一个可能的解决方案,基于 stringrmutate 是删除 Computer 123 中的 space,我猜这是一个错字):

library(tidyverse)

df <- data.frame(
  Row = c(1L, 2L, 3L, 4L, 5L),
  module_component = c("Computer123",
                       "Computer 123","Laptop42","Laptop42","Lapop62"),
  Long_Text = c("Computer retuned due to a battery issue",
                "The computer did not power on","Screen Broken","Keyboard unresponsive",
                "Battery chord issues")
)

df %>% 
  mutate(module_component = str_remove_all(module_component,"\s")) %>% 
  group_by(module_component) %>% 
  summarise(Long_Text = str_c(Long_Text, collapse = " "))

#> # A tibble: 3 × 2
#>   module_component Long_Text                                                    
#>   <chr>            <chr>                                                        
#> 1 Computer123      Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62          Battery chord issues                                         
#> 3 Laptop42         Screen Broken Keyboard unresponsive

aggregate 中使用 toString

aggregate(Long_Text ~ module_component, dat, toString)
#   module_component                                                              Long_Text
# 1      Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2          Lapop62                                                   Battery chord issues
# 3         Laptop42                                   Screen Broken, Keyboard unresponsive

paste.

aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
#   module_component                                                             Long_Text
# 1      Computer123 Computer retuned due to a battery issue The computer did not power on
# 2          Lapop62                                                  Battery chord issues
# 3         Laptop42                                   Screen Broken Keyboard unresponsive

不过我更喜欢toString


数据:

dat <- structure(list(Row = 1:5, module_component = c("Computer123", 
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue", 
"The computer did not power on", "Screen Broken", "Keyboard unresponsive", 
"Battery chord issues")), class = "data.frame", row.names = c(NA, 
-5L))