loop 或 lappy() - 按因子连接字符串

Question

我想按因素类别 (module_components) 连接文本条目。最终，我需要为数据集中每个因素 (module_components) 的文本获取单词 frequencies/n-grams。所以我想先连接每个因子级别的所有文本条目。

我有数据：

Row	module_component	Long_Text
1	Computer123	Computer retuned due to a battery issue
2	Computer 123	The computer did not power on
3	Laptop42	Screen Broken
4	Laptop42	Keyboard unresponsive
5	Lapop62	Battery chord issues

我想要的数据： 类别列表 (module_components) 以及串联的文本字段 (Long_Text)

module_component	contatonated_Long_Text
Computer123	Computer retuned due to a battery issue The computer did not power on
Laptop42	Screen Broken Keyboard unresponsive
Lapop62	Battery chord issues

我试过的代码

df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**

我不确定连接片的功能。 paste(Long_Text) 不工作。

我愿意接受任何其他方法来完成这项工作。谢谢

Answer 1

一个可能的解决方案，基于 stringr（mutate 是删除 Computer 123 中的 space，我猜这是一个错字）：

library(tidyverse)

df <- data.frame(
  Row = c(1L, 2L, 3L, 4L, 5L),
  module_component = c("Computer123",
                       "Computer 123","Laptop42","Laptop42","Lapop62"),
  Long_Text = c("Computer retuned due to a battery issue",
                "The computer did not power on","Screen Broken","Keyboard unresponsive",
                "Battery chord issues")
)

df %>% 
  mutate(module_component = str_remove_all(module_component,"\s")) %>% 
  group_by(module_component) %>% 
  summarise(Long_Text = str_c(Long_Text, collapse = " "))

#> # A tibble: 3 × 2
#>   module_component Long_Text                                                    
#>   <chr>            <chr>                                                        
#> 1 Computer123      Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62          Battery chord issues                                         
#> 3 Laptop42         Screen Broken Keyboard unresponsive

Answer 2

在 aggregate 中使用 toString。

aggregate(Long_Text ~ module_component, dat, toString)
#   module_component                                                              Long_Text
# 1      Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2          Lapop62                                                   Battery chord issues
# 3         Laptop42                                   Screen Broken, Keyboard unresponsive

或paste.

aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
#   module_component                                                             Long_Text
# 1      Computer123 Computer retuned due to a battery issue The computer did not power on
# 2          Lapop62                                                  Battery chord issues
# 3         Laptop42                                   Screen Broken Keyboard unresponsive

不过我更喜欢toString。

数据：

dat <- structure(list(Row = 1:5, module_component = c("Computer123", 
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue", 
"The computer did not power on", "Screen Broken", "Keyboard unresponsive", 
"Battery chord issues")), class = "data.frame", row.names = c(NA, 
-5L))

loop 或 lappy() - 按因子连接字符串

loop or lappy() - string concatenation by factors

loops

r

lapply