从 Google Ngrams 中有效地导出术语共现矩阵

Effectively derive term co-occurrence matrix from Google Ngrams

我需要使用 Google Books N-grams 中的词汇数据来构建一个(稀疏!)术语共现矩阵(其中行是单词,列是相同的单词,单元格反映了多少次它们出现在相同的上下文中 window)。然后,生成的 tcm 将用于测量一堆词汇统计数据,并作为矢量语义方法(Glove、LSA、LDA)的输入。

作为参考,Google Books (v2) 数据集的格式如下(制表符分隔)

ngram      year    match_count    volume_count
some word  1999    32             12            # example bigram

然而,问题当然是这些数据是巨大的。虽然,我只需要某些几十年的数据子集(大约 20 年的 ngram),并且我很高兴上下文 window 最多为 2(即,使用 trigram 语料库)。我有一些想法,但 none 看起来特别好。

-想法 1- 最初或多或少是这样的:

# preprocessing (pseudo)
for file in trigram-files:
    download $file
    filter $lines where 'year' tag matches one of years of interest
    find the frequency of each of those ngrams (match_count)
    cat those $lines * $match_count >> file2
     # (write the same line x times according to the match_count tag)  
    remove $file

# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it         <- itoken(grams)
vocab      <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm        <- create_tcm(it, vectorizer) # nice and sparse

但是,我有预感这可能不是最佳解决方案。 ngram 数据文件已经包含 n-gram 形式的共现数据,并且有一个给出频率的标签。我觉得应该有更直接的方法。

-想法 2- 我也在考虑将每个过滤后的 ngram 只添加一次到新文件中(而不是将其复制 match_count 次),然后创建一个空的 tcm,然后遍历整个(年过滤的)ngram 数据集和记录实例(使用 match_count 标签),其中任何两个词共同出现以填充 tcm。但是,同样,数据很大,这种循环可能需要很长时间。

-想法 3- 我找到了一个名为 google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything 的 Python 库(我假设 Python 循环遍历这么多数据超慢),所以它似乎更针对更小的数据子集。

edit -想法 4- 碰到 this old SO question 询问关于使用 Hadoop 和 Hive 完成类似的任务,有一个带有损坏 link 的简短回答和关于 MapReduce 的评论(none 我很熟悉,所以我不知道从哪里开始)。


但是考虑到 Ngram 数据集的流行以及 (非 word2vec)在 tcm 或 dtm 输入上运行的分布式语义方法;因此 ->

...问题:从 Google Books Ngram 数据构建术语-术语共现矩阵的更 reasonable/effective 方法是什么?(可以是完全不同的提议想法的变体;R 首选但不是必需的)

我会告诉你如何做到这一点。但它可以在几个地方进行改进。我特意写成"spagetti-style"是为了更好的解释性,但是可以泛化到tri-grams

以上
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()

# vocab here is vocabulary from chunk, but you can be interested first 
# to create vocabulary from whole corpus of ngrams and filter non 
# interesting/rare words

vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations 
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)

ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]

dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]

dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]

tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T, 
                   giveCsparse = F, check = F, dimnames = list(vocab, vocab))