清理自由文本然后转换为交易数据集的最佳方式

Question

我有调查信息，其中包含我想清理的自由文本，然后将其放入交易数据集中到 Arules R 包中的运行。现在文本看起来像这样。

id | Answers    
1  | John thinks that the product is not worth the price
2  | Amy believes that the functionality is well above expectations

这是我正在尝试做的事情：

1 | John | thinks   | Product       | Not   | Worth | Price    
1 | Amy  | Believes | Functionality | Above | Expectations

现在我已经能够使用 tm 包清理数据，但我不知道将其转换为交易数据集的最佳方法是什么。我已将信息全部转为小写并删除了停用词。

假设我的数据位于名为 "Questions" 的数据框中。清理后无法将语料库转换为交易数据集

Answer 1

你可以试试：

library(stringr)
str_split(data$Answers, " ")

输出是一个列表：

[[1]]
 [1] "John"    "thinks"  "that"    "the"     "product" "is"      "not"     "worth"   "the"     "price"  

[[2]]
[1] "Amy"           "believes"      "that"          "the"           "functionality" "is"           
[7] "well"          "above"         "expectations"

编辑：

使用 unique 函数删除重复项：

my_list <- str_split(data$Answers, " ")
lapply(my_list , unique)

[[1]]
[1] "John"    "thinks"  "that"    "the"     "product" "is"      "not"     "worth"   "price"  

[[2]]
[1] "Amy"           "believes"      "that"          "the"           "functionality" "is"           
[7] "well"          "above"         "expectations"

清理自由文本然后转换为交易数据集的最佳方式

Best way to clean free text then turn into a transaction dataset

r

text-mining

market-basket-analysis

编辑：