从文档术语矩阵计算前 n 个词对共现
Computing top n word pair co-occurrences from document term matrix
我使用gensim创建了一个词袋模型。虽然实际上要长得多,但这是使用 Gensim 在标记化文本上创建词袋文档术语矩阵时输出的格式:
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]
[[(0, 2),
(1, 1),
(2, 1),
(3, 1),
(4, 11),
(385, 1),
(386, 2),
(387, 3),
(388, 1),
(389, 1),
(390, 1)],
[(4, 31),
(8, 2),
(13, 2),
(16, 2),
(17, 2),
(26, 1),
(28, 4),
(29, 1),
(30, 1)]]
这是一个稀疏矩阵表示,据我所知,其他库也以类似的方式表示文档术语矩阵。如果文档术语矩阵是非稀疏的(意味着也有零条目),我知道我只需要 (A.T*A),因为 A 的维度是 (num. of documents by num . of terms),因此将两者相乘将得出术语共现。最终,我想获得前 n 个共同出现(因此获得在同一文本中一起出现的前 n 个术语对)。我将如何实现这一目标?我没有依附于 Gensim 来创建 BOW 模型。如果另一个像 sklearn 这样的库可以更容易地做到这一点,我很开放。如果 advice/help/code 遇到此问题,我将不胜感激 -- 谢谢!
编辑:这是实现您询问的矩阵乘法的方法。免责声明:对于非常大的语料库,这可能不可行。
Sklearn:
from sklearn.feature_extraction.text import CountVectorizer
Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
docs = [Doc1, Doc2]
# Instantiate CountVectorizer and apply it to docs
cv = CountVectorizer()
doc_cv = cv.fit_transform(docs)
# Display tokens
cv.get_feature_names()
# Display tokens (dict keys) and their numerical encoding (dict values)
cv.vocabulary_
# Matrix multiplication of the term matrix
token_mat = doc_cv.toarray().T @ doc_cv.toarray()
Gensim:
import gensim as gs
import numpy as np
cp = [[(0, 2),
(1, 1),
(2, 1),
(3, 1),
(4, 11),
(7, 1),
(11, 2),
(13, 3),
(22, 1),
(26, 1),
(30, 1)],
[(4, 31),
(8, 2),
(13, 2),
(16, 2),
(17, 2),
(26, 1),
(28, 4),
(29, 1),
(30, 1)]]
# Convert to a dense matrix and perform the matrix multiplication
mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
mat = np.append(mat_1, mat_2, axis=0)
mat_product = mat.T @ mat
对于连续出现的单词,您可以为一组文档准备一个二元组列表,然后使用 python 的计数器来计算二元组的出现次数。这是一个使用 nltk 的例子。
import nltk
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter
stop_words = set(stopwords.words('english'))
# Get the tokens from the built-in collection of presidential inaugural speeches
tokens = nltk.corpus.inaugural.words()
# Futher text preprocessing
tokens = [t.lower() for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
# Create bigram list and count bigrams
bi_grams = list(ngrams(tokens, 2))
counter = Counter(bi_grams)
# Show the most common bigrams
counter.most_common(5)
Out[36]:
[(('united', 'state'), 153),
(('fellow', 'citizen'), 116),
(('let', 'u'), 99),
(('i', 'shall'), 96),
(('american', 'people'), 40)]
# Query the occurrence of a specific bigram
counter[('great', 'people')]
Out[37]: 7
我使用gensim创建了一个词袋模型。虽然实际上要长得多,但这是使用 Gensim 在标记化文本上创建词袋文档术语矩阵时输出的格式:
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]
[[(0, 2),
(1, 1),
(2, 1),
(3, 1),
(4, 11),
(385, 1),
(386, 2),
(387, 3),
(388, 1),
(389, 1),
(390, 1)],
[(4, 31),
(8, 2),
(13, 2),
(16, 2),
(17, 2),
(26, 1),
(28, 4),
(29, 1),
(30, 1)]]
这是一个稀疏矩阵表示,据我所知,其他库也以类似的方式表示文档术语矩阵。如果文档术语矩阵是非稀疏的(意味着也有零条目),我知道我只需要 (A.T*A),因为 A 的维度是 (num. of documents by num . of terms),因此将两者相乘将得出术语共现。最终,我想获得前 n 个共同出现(因此获得在同一文本中一起出现的前 n 个术语对)。我将如何实现这一目标?我没有依附于 Gensim 来创建 BOW 模型。如果另一个像 sklearn 这样的库可以更容易地做到这一点,我很开放。如果 advice/help/code 遇到此问题,我将不胜感激 -- 谢谢!
编辑:这是实现您询问的矩阵乘法的方法。免责声明:对于非常大的语料库,这可能不可行。
Sklearn:
from sklearn.feature_extraction.text import CountVectorizer
Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
docs = [Doc1, Doc2]
# Instantiate CountVectorizer and apply it to docs
cv = CountVectorizer()
doc_cv = cv.fit_transform(docs)
# Display tokens
cv.get_feature_names()
# Display tokens (dict keys) and their numerical encoding (dict values)
cv.vocabulary_
# Matrix multiplication of the term matrix
token_mat = doc_cv.toarray().T @ doc_cv.toarray()
Gensim:
import gensim as gs
import numpy as np
cp = [[(0, 2),
(1, 1),
(2, 1),
(3, 1),
(4, 11),
(7, 1),
(11, 2),
(13, 3),
(22, 1),
(26, 1),
(30, 1)],
[(4, 31),
(8, 2),
(13, 2),
(16, 2),
(17, 2),
(26, 1),
(28, 4),
(29, 1),
(30, 1)]]
# Convert to a dense matrix and perform the matrix multiplication
mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
mat = np.append(mat_1, mat_2, axis=0)
mat_product = mat.T @ mat
对于连续出现的单词,您可以为一组文档准备一个二元组列表,然后使用 python 的计数器来计算二元组的出现次数。这是一个使用 nltk 的例子。
import nltk
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter
stop_words = set(stopwords.words('english'))
# Get the tokens from the built-in collection of presidential inaugural speeches
tokens = nltk.corpus.inaugural.words()
# Futher text preprocessing
tokens = [t.lower() for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
# Create bigram list and count bigrams
bi_grams = list(ngrams(tokens, 2))
counter = Counter(bi_grams)
# Show the most common bigrams
counter.most_common(5)
Out[36]:
[(('united', 'state'), 153),
(('fellow', 'citizen'), 116),
(('let', 'u'), 99),
(('i', 'shall'), 96),
(('american', 'people'), 40)]
# Query the occurrence of a specific bigram
counter[('great', 'people')]
Out[37]: 7