清理自由文本然后转换为交易数据集的最佳方式
Best way to clean free text then turn into a transaction dataset
我有调查信息,其中包含我想清理的自由文本,然后将其放入交易数据集中到 Arules R 包中的 运行。现在文本看起来像这样。
id | Answers
1 | John thinks that the product is not worth the price
2 | Amy believes that the functionality is well above expectations
这是我正在尝试做的事情:
1 | John | thinks | Product | Not | Worth | Price
1 | Amy | Believes | Functionality | Above | Expectations
现在我已经能够使用 tm
包清理数据,但我不知道将其转换为交易数据集的最佳方法是什么。我已将信息全部转为小写并删除了停用词。
假设我的数据位于名为 "Questions" 的数据框中。清理后无法将语料库转换为交易数据集
你可以试试:
library(stringr)
str_split(data$Answers, " ")
输出是一个列表:
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "the" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"
编辑:
使用 unique
函数删除重复项:
my_list <- str_split(data$Answers, " ")
lapply(my_list , unique)
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"
我有调查信息,其中包含我想清理的自由文本,然后将其放入交易数据集中到 Arules R 包中的 运行。现在文本看起来像这样。
id | Answers
1 | John thinks that the product is not worth the price
2 | Amy believes that the functionality is well above expectations
这是我正在尝试做的事情:
1 | John | thinks | Product | Not | Worth | Price
1 | Amy | Believes | Functionality | Above | Expectations
现在我已经能够使用 tm
包清理数据,但我不知道将其转换为交易数据集的最佳方法是什么。我已将信息全部转为小写并删除了停用词。
假设我的数据位于名为 "Questions" 的数据框中。清理后无法将语料库转换为交易数据集
你可以试试:
library(stringr)
str_split(data$Answers, " ")
输出是一个列表:
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "the" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"
编辑:
使用 unique
函数删除重复项:
my_list <- str_split(data$Answers, " ")
lapply(my_list , unique)
[[1]]
[1] "John" "thinks" "that" "the" "product" "is" "not" "worth" "price"
[[2]]
[1] "Amy" "believes" "that" "the" "functionality" "is"
[7] "well" "above" "expectations"