使用双字母创建共现矩阵
Create Co-occurrence matrix with bigrams
我想用双字母组而不是单个字符串中的单字母组创建一个共现矩阵。我指的是以下链接
http://text2vec.org/glove.html
https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html#3_statistical_significance
我想创建矩阵并遍历它以创建如下数据集
Trem1 Term2 Score
最大的问题是用双字母遍历句子。任何对此的帮助都会很棒
只需指定您的双字母组并创建共现矩阵。下面是一些(非常)简单的例子。选择 1 个包并用那个包做所有事情。 quanteda 和 text2vec 都可以使用多核/多线程。遍历生成的共现矩阵可以用 reshape2::melt 完成,像这样 reshape2::melt(as.matrix(my_cooccurence_matrix))
.
txt <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
使用 quanteda 创建特征共现矩阵:
library(quanteda)
toks <- tokens(char_tolower(txt), remove_punct = TRUE, ngrams = 2)
f <- fcm(toks, context = "document")
Feature co-occurrence matrix of: 14 by 14 features.
14 x 14 sparse Matrix of class "fcm"
features
features the_quick quick_brown brown_fox fox_jumped jumped_over over_the the_lazy lazy_dog the_dog dog_jumped jumped_and and_ate
the_quick 0 1 1 1 1 1 1 1 0 0 0 0
quick_brown 0 0 1 1 1 1 1 1 0 0 0 0
brown_fox 0 0 0 1 1 1 1 1 0 0 0 0
fox_jumped 0 0 0 0 1 1 1 1 0 0 0 0
jumped_over 0 0 0 0 0 1 1 1 0 0 0 0
over_the 0 0 0 0 0 0 1 1 0 0 0 0
the_lazy 0 0 0 0 0 0 0 1 0 0 0 0
lazy_dog 0 0 0 0 0 0 0 0 0 0 0 0
the_dog 0 0 0 0 0 0 0 0 0 1 1 1
dog_jumped 0 0 0 0 0 0 0 0 0 0 1 1
jumped_and 0 0 0 0 0 0 0 0 0 0 0 1
and_ate 0 0 0 0 0 0 0 0 0 0 0 0
ate_the 0 0 0 0 0 0 0 0 0 0 0 0
the_fox 0 0 0 0 0 0 0 0 0 0 0 0
features
features ate_the the_fox
the_quick 0 0
quick_brown 0 0
brown_fox 0 0
fox_jumped 0 0
jumped_over 0 0
over_the 0 0
the_lazy 0 0
lazy_dog 0 0
the_dog 1 1
dog_jumped 1 1
jumped_and 1 1
and_ate 1 1
ate_the 0 1
the_fox 0 0
使用 text2vec 创建特征共现矩阵:
library(text2vec)
i <- itoken(txt)
v <- create_vocabulary(i, ngram = c(2L, 2L))
vectorizer <- vocab_vectorizer(v)
f2 <- create_tcm(i, vectorizer)
14 sparse Matrix of class "dgTMatrix"
[[ suppressing 14 column names ‘the_lazy’, ‘and_ate’, ‘The_quick’ ... ]]
the_lazy . . . 0.25 1.0 . 0.2 0.3333333 . . 1.0000000 . 0.5000000 .
and_ate . . . . . 1 . . 0.5000000 1.0 . 0.3333333 . 0.5000000
The_quick . . . 0.50 . . 1.0 0.3333333 . . 0.2000000 . 0.2500000 .
brown_fox . . . . 0.2 . 1.0 1.0000000 . . 0.3333333 . 0.5000000 .
lazy_dog. . . . . . . . 0.2500000 . . 0.5000000 . 0.3333333 .
jumped_and . . . . . . . . 0.3333333 0.5 . 0.5000000 . 1.0000000
quick_brown . . . . . . . 0.5000000 . . 0.2500000 . 0.3333333 .
fox_jumped . . . . . . . . . . 0.5000000 . 1.0000000 .
the_fox. . . . . . . . . . 1.0 . 0.2000000 . 0.2500000
ate_the . . . . . . . . . . . 0.2500000 . 0.3333333
over_the . . . . . . . . . . . . 1.0000000 .
The_dog . . . . . . . . . . . . . 1.0000000
jumped_over . . . . . . . . . . . . . .
dog_jumped . . . . . . . . . . . . . .
我想用双字母组而不是单个字符串中的单字母组创建一个共现矩阵。我指的是以下链接
http://text2vec.org/glove.html
https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html#3_statistical_significance
我想创建矩阵并遍历它以创建如下数据集
Trem1 Term2 Score
最大的问题是用双字母遍历句子。任何对此的帮助都会很棒
只需指定您的双字母组并创建共现矩阵。下面是一些(非常)简单的例子。选择 1 个包并用那个包做所有事情。 quanteda 和 text2vec 都可以使用多核/多线程。遍历生成的共现矩阵可以用 reshape2::melt 完成,像这样 reshape2::melt(as.matrix(my_cooccurence_matrix))
.
txt <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
使用 quanteda 创建特征共现矩阵:
library(quanteda)
toks <- tokens(char_tolower(txt), remove_punct = TRUE, ngrams = 2)
f <- fcm(toks, context = "document")
Feature co-occurrence matrix of: 14 by 14 features.
14 x 14 sparse Matrix of class "fcm"
features
features the_quick quick_brown brown_fox fox_jumped jumped_over over_the the_lazy lazy_dog the_dog dog_jumped jumped_and and_ate
the_quick 0 1 1 1 1 1 1 1 0 0 0 0
quick_brown 0 0 1 1 1 1 1 1 0 0 0 0
brown_fox 0 0 0 1 1 1 1 1 0 0 0 0
fox_jumped 0 0 0 0 1 1 1 1 0 0 0 0
jumped_over 0 0 0 0 0 1 1 1 0 0 0 0
over_the 0 0 0 0 0 0 1 1 0 0 0 0
the_lazy 0 0 0 0 0 0 0 1 0 0 0 0
lazy_dog 0 0 0 0 0 0 0 0 0 0 0 0
the_dog 0 0 0 0 0 0 0 0 0 1 1 1
dog_jumped 0 0 0 0 0 0 0 0 0 0 1 1
jumped_and 0 0 0 0 0 0 0 0 0 0 0 1
and_ate 0 0 0 0 0 0 0 0 0 0 0 0
ate_the 0 0 0 0 0 0 0 0 0 0 0 0
the_fox 0 0 0 0 0 0 0 0 0 0 0 0
features
features ate_the the_fox
the_quick 0 0
quick_brown 0 0
brown_fox 0 0
fox_jumped 0 0
jumped_over 0 0
over_the 0 0
the_lazy 0 0
lazy_dog 0 0
the_dog 1 1
dog_jumped 1 1
jumped_and 1 1
and_ate 1 1
ate_the 0 1
the_fox 0 0
使用 text2vec 创建特征共现矩阵:
library(text2vec)
i <- itoken(txt)
v <- create_vocabulary(i, ngram = c(2L, 2L))
vectorizer <- vocab_vectorizer(v)
f2 <- create_tcm(i, vectorizer)
14 sparse Matrix of class "dgTMatrix"
[[ suppressing 14 column names ‘the_lazy’, ‘and_ate’, ‘The_quick’ ... ]]
the_lazy . . . 0.25 1.0 . 0.2 0.3333333 . . 1.0000000 . 0.5000000 .
and_ate . . . . . 1 . . 0.5000000 1.0 . 0.3333333 . 0.5000000
The_quick . . . 0.50 . . 1.0 0.3333333 . . 0.2000000 . 0.2500000 .
brown_fox . . . . 0.2 . 1.0 1.0000000 . . 0.3333333 . 0.5000000 .
lazy_dog. . . . . . . . 0.2500000 . . 0.5000000 . 0.3333333 .
jumped_and . . . . . . . . 0.3333333 0.5 . 0.5000000 . 1.0000000
quick_brown . . . . . . . 0.5000000 . . 0.2500000 . 0.3333333 .
fox_jumped . . . . . . . . . . 0.5000000 . 1.0000000 .
the_fox. . . . . . . . . . 1.0 . 0.2000000 . 0.2500000
ate_the . . . . . . . . . . . 0.2500000 . 0.3333333
over_the . . . . . . . . . . . . 1.0000000 .
The_dog . . . . . . . . . . . . . 1.0000000
jumped_over . . . . . . . . . . . . . .
dog_jumped . . . . . . . . . . . . . .