从数据框中删除在不同位置相同的单词
Remove words from a dataframe which are the same in different place
在dfm中有这样的词
库(“quanteda”)
包版本:2.1.2
dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))
例如标记“hello_text”和“text_hello”在不同的地方是相同的。怎么可能只保留其中一个选项?
示例输出
dfmat <- dfm(c("hello_text","test1_test2", "test2_test2_test2", "test2_other", "other"))
我找到了 解决方案示例,但它删除了相同的词
拆分下划线处的字符串并按字母顺序排序,然后使用此列表识别重复项并将其应用于原始列表:
words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")
words_sorted <- sapply(sapply(words, strsplit, "_"), sort)
words[!duplicated(words_sorted)]
Returns:
[1] "hello_text" "test1_test2" "test2_test2_test2" "test2_other"
[5] "other"
在dfm中有这样的词 库(“quanteda”)
包版本:2.1.2
dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))
例如标记“hello_text”和“text_hello”在不同的地方是相同的。怎么可能只保留其中一个选项?
示例输出
dfmat <- dfm(c("hello_text","test1_test2", "test2_test2_test2", "test2_other", "other"))
我找到了
拆分下划线处的字符串并按字母顺序排序,然后使用此列表识别重复项并将其应用于原始列表:
words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")
words_sorted <- sapply(sapply(words, strsplit, "_"), sort)
words[!duplicated(words_sorted)]
Returns:
[1] "hello_text" "test1_test2" "test2_test2_test2" "test2_other"
[5] "other"