R 中的 TextMining - 仅提取 2 克的几个术语和 1 克的其余部分

Question

text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

我想为大多数单词提取 1 克标记，为极端、不、不等单词提取 2 克标记

例如，当我获得令牌时，它们应该如下所示：这，护士，曾是，非常有帮助，她，真的， gem, 帮助，没有任何问题，不错

这些是术语文档矩阵中应该显示的术语

感谢您的帮助！！

Answer 1

这是一个可能的解决方案（假设您不想在 c("extremely", "no", "not") 上仅拆分，但还想包含与其相似的词）。 pkg qdapDictionaries 有一些字典 amplification.words（如 "extremely"）、negation.words（如 "no" 和 "not"）等。

下面是一个如何拆分 space 的示例，除了 space 跟在预定义向量中的单词之后（这里我们使用 amplification.words、negation.words, & deamplification.words 来自 qdapDictionaries)。如果您想使用更自定义的单词列表，您可以更改 no_split_words 的定义。

执行拆分

library(stringr)
library(qdapDictionaries)

text <-  c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or       <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex    <- regex(paste("((?<!",regex_or,"))\s"))

# perform split
str_split(text, split_regex)

#output
[[1]]
[1] "the"               "nurse"             "was"               "extremely helpful"

[[2]]
[1] "she"     "was"     "truly a" "gem"    

[[3]]
[1] "helping"

[[4]]
[1] "no issue"

[[5]]
[1] "not bad"

使用 `tidytext`

创建 dtm

（假设上面的代码块已经运行）

library(tidytext)
library(dplyr)

doc_df <- data_frame(text) %>% 
  mutate(doc_id = row_number())

# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>% 
  unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>% 
  mutate(value = 1) %>%  
  cast_dtm(doc_id, tokens, value)

# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)

R 中的 TextMining - 仅提取 2 克的几个术语和 1 克的其余部分

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

r

stringr

tm

rweka

执行拆分

使用 `tidytext`

R 中的 TextMining - 仅提取 2 克的几个术语和 1 克的其余部分

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

r

stringr

tm

rweka

执行拆分

使用 tidytext

使用 `tidytext`