我如何使用 gensim 对我的数据框中的这些词进行矢量化,以便我可以对它们执行聚类?

How do I use gensim to vectorize these words in my dataframe so I can perform clustering on them?

我正在尝试对 pandas 数据框上的诗歌词进行聚类分析(最好是 k-means)。我首先尝试使用 gensim 包中的词到向量功能对词进行矢量化。但是,向量刚好以 0 出现,所以我的代码无法将单词转换为向量。结果,聚类不起作用。这是我的代码:

# create a gensim model 
model = gensim.models.Word2Vec(vector_size=100) 
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list 
for i, row in data.iterrows(): 
    poem_vectorized = [] 
    poem = row['Main_text']
    poem_all_words = poem.split(sep=" ")
    for poem_w in poem_all_words: #iterate through list of words 
        try:
            poem_vectorized.append(list(model.wv[poem_w]))
        except Exception as e:
            pass
    try:
        poem_vectorized = np.asarray(poem_vectorized)
        poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
    except Exception as e:
        poem_vectorized_mean = list(np.zeros(100))
        pass
    try:
        len(poem_vectorized_mean)
    except:
        poem_vectorized_mean = list(np.zeros(100))
    temp_row = np.asarray(poem_vectorized_mean)
    final_data.append(temp_row)
X = np.asarray(final_data)
print(X)

仔细检查:

poem_vectorized.append(list(model.wv[poem_w]))

问题似乎是这样的:

如果我理解正确的话,你想使用现有模型来获取标记的语义嵌入,然后对单词进行聚类,对吗?

因为您设置模型的方式是在准备一个新模型进行训练,但随后不向它提供任何训练数据并对其进行训练,所以您的模型不知道任何单词并且总是抛出调用 model.wv[poem_w].

时出现 KeyError

使用 gensim.downloader 加载现有模型(查看 their repository 以获取所有可用模型的列表):

import gensim.downloader as api
import numpy as np
import pandas

poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")

然后用它来检索模型知道的所有单词的向量:

final_data = []
for poem in poems['Main_text']:
    poem_all_words = poem.split()
    poem_vectorized = []
    for poem_w in poem_all_words:
        if poem_w in model:
            poem_vectorized.append(model[poem_w])
    poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
    final_data.append(poem_vectorized_mean)

或作为列表理解:

final_data = []
for poem in poems['Main_text']:
    poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
    final_data.append(poem_vectorized_mean)

两者都会给你:

X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01  3.73661995e-01  4.09943342e-01 -2.07784668e-01
    ...
    -1.85739681e-01 -7.07386672e-01  3.31366658e-01  3.31600010e-01]
   [-3.29973340e-01  4.13213342e-01  5.26199996e-01 -2.29261339e-01
    ...
    -1.25366330e-01 -5.87253332e-01  2.80240029e-01  2.56700337e-01]]

请注意,尝试在空列表中获取 np.mean() 会引发错误,因此您可能希望捕捉到该错误,以防出现空诗或模型未知所有单词的情况。