使用 Quanteda 计算 ngram 和文档之间的卡方值
Compute chi square value between ngrams and documents with Quanteda
我使用 Quanteda R 包从文本 Data_clean$Review 中提取 ngram(此处为 1gram 和 2gram),但我正在寻找一种使用 R 来计算文档和提取的文件之间的卡方的方法ngrams :
这是我用来清理文本 (revoiew) 并生成 n-gram 的 R 代码。
有什么想法吗?
谢谢你
#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]
Data_clean$id <- seq.int(nrow(Data_clean))
train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)
#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))
train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]
temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
dfm # generate dfm
您不会为此使用 ngrams
,而是使用名为 textstat_collocations()
的函数。
由于解释或提供了这些对象中的 none,因此很难完全按照您的示例进行操作,但让我们尝试使用一些 quanteda 的内置对象 -在数据中。我将从就职语料库中获取文本并应用一些类似于您上面的过滤器。
因此,要为 chi^2 的二元组评分,您可以使用:
# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)
# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
# collocation count X2
# 1 reverend clergy 2 28614.00
# 2 Majority Leader 2 28614.00
# 3 Information Age 2 28614.00
# 4 Founding Fathers 3 28614.00
# 5 distinguished guests 3 28614.00
# 6 Social Security 3 28614.00
# 7 Chief Justice 9 23409.82
# 8 middle class 4 22890.40
# 9 Abraham Lincoln 2 19075.33
# 10 society's ills 2 19075.33
已添加:
# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President" "Johnson" "Mr_Speaker" "Mr_Chief" "Chief_Justice"
# [6] "President" "Eisenhower" "Vice_President" "Nixon" "President"
# [11] "Truman" "reverend_clergy" "fellow_citizens" "we" "observe"
# [16] "today" "not" "a" "victory" "of"
我使用 Quanteda R 包从文本 Data_clean$Review 中提取 ngram(此处为 1gram 和 2gram),但我正在寻找一种使用 R 来计算文档和提取的文件之间的卡方的方法ngrams :
这是我用来清理文本 (revoiew) 并生成 n-gram 的 R 代码。
有什么想法吗?
谢谢你
#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]
Data_clean$id <- seq.int(nrow(Data_clean))
train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)
#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))
train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]
temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
dfm # generate dfm
您不会为此使用 ngrams
,而是使用名为 textstat_collocations()
的函数。
由于解释或提供了这些对象中的 none,因此很难完全按照您的示例进行操作,但让我们尝试使用一些 quanteda 的内置对象 -在数据中。我将从就职语料库中获取文本并应用一些类似于您上面的过滤器。
因此,要为 chi^2 的二元组评分,您可以使用:
# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)
# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
# collocation count X2
# 1 reverend clergy 2 28614.00
# 2 Majority Leader 2 28614.00
# 3 Information Age 2 28614.00
# 4 Founding Fathers 3 28614.00
# 5 distinguished guests 3 28614.00
# 6 Social Security 3 28614.00
# 7 Chief Justice 9 23409.82
# 8 middle class 4 22890.40
# 9 Abraham Lincoln 2 19075.33
# 10 society's ills 2 19075.33
已添加:
# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President" "Johnson" "Mr_Speaker" "Mr_Chief" "Chief_Justice"
# [6] "President" "Eisenhower" "Vice_President" "Nixon" "President"
# [11] "Truman" "reverend_clergy" "fellow_citizens" "we" "observe"
# [16] "today" "not" "a" "victory" "of"