R 中的 TextMining - 仅提取 2 克的几个术语和 1 克的其余部分
TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest
text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
我想为大多数单词提取 1 克标记,为极端、不、不等单词提取 2 克标记
例如,当我获得令牌时,它们应该如下所示:
这,
护士,
曾是,
非常有帮助,
她,
真的,
gem,
帮助,
没有任何问题,
不错
这些是术语文档矩阵中应该显示的术语
感谢您的帮助!!
这是一个可能的解决方案(假设您不想在 c("extremely", "no", "not")
上仅拆分 ,但还想包含与其相似的词)。
pkg qdapDictionaries
有一些字典 amplification.words
(如 "extremely")、negation.words
(如 "no" 和 "not")等。
下面是一个如何拆分 space 的示例,除了 space 跟在预定义向量中的单词之后(这里我们使用 amplification.words
、negation.words
, & deamplification.words
来自 qdapDictionaries
)。如果您想使用更自定义的单词列表,您可以更改 no_split_words
的定义。
执行拆分
library(stringr)
library(qdapDictionaries)
text <- c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex <- regex(paste("((?<!",regex_or,"))\s"))
# perform split
str_split(text, split_regex)
#output
[[1]]
[1] "the" "nurse" "was" "extremely helpful"
[[2]]
[1] "she" "was" "truly a" "gem"
[[3]]
[1] "helping"
[[4]]
[1] "no issue"
[[5]]
[1] "not bad"
使用 tidytext
创建 dtm
(假设上面的代码块已经 运行)
library(tidytext)
library(dplyr)
doc_df <- data_frame(text) %>%
mutate(doc_id = row_number())
# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>%
unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>%
mutate(value = 1) %>%
cast_dtm(doc_id, tokens, value)
# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)
text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
我想为大多数单词提取 1 克标记,为极端、不、不等单词提取 2 克标记
例如,当我获得令牌时,它们应该如下所示: 这, 护士, 曾是, 非常有帮助, 她, 真的, gem, 帮助, 没有任何问题, 不错
这些是术语文档矩阵中应该显示的术语
感谢您的帮助!!
这是一个可能的解决方案(假设您不想在 c("extremely", "no", "not")
上仅拆分 ,但还想包含与其相似的词)。
pkg qdapDictionaries
有一些字典 amplification.words
(如 "extremely")、negation.words
(如 "no" 和 "not")等。
下面是一个如何拆分 space 的示例,除了 space 跟在预定义向量中的单词之后(这里我们使用 amplification.words
、negation.words
, & deamplification.words
来自 qdapDictionaries
)。如果您想使用更自定义的单词列表,您可以更改 no_split_words
的定义。
执行拆分
library(stringr)
library(qdapDictionaries)
text <- c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex <- regex(paste("((?<!",regex_or,"))\s"))
# perform split
str_split(text, split_regex)
#output
[[1]]
[1] "the" "nurse" "was" "extremely helpful"
[[2]]
[1] "she" "was" "truly a" "gem"
[[3]]
[1] "helping"
[[4]]
[1] "no issue"
[[5]]
[1] "not bad"
使用 tidytext
创建 dtm
(假设上面的代码块已经 运行)
library(tidytext)
library(dplyr)
doc_df <- data_frame(text) %>%
mutate(doc_id = row_number())
# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>%
unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>%
mutate(value = 1) %>%
cast_dtm(doc_id, tokens, value)
# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)