我如何使用 gensim 对我的数据框中的这些词进行矢量化,以便我可以对它们执行聚类?
How do I use gensim to vectorize these words in my dataframe so I can perform clustering on them?
我正在尝试对 pandas 数据框上的诗歌词进行聚类分析(最好是 k-means)。我首先尝试使用 gensim 包中的词到向量功能对词进行矢量化。但是,向量刚好以 0 出现,所以我的代码无法将单词转换为向量。结果,聚类不起作用。这是我的代码:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
仔细检查:
poem_vectorized.append(list(model.wv[poem_w]))
问题似乎是这样的:
如果我理解正确的话,你想使用现有模型来获取标记的语义嵌入,然后对单词进行聚类,对吗?
因为您设置模型的方式是在准备一个新模型进行训练,但随后不向它提供任何训练数据并对其进行训练,所以您的模型不知道任何单词并且总是抛出调用 model.wv[poem_w]
.
时出现 KeyError
使用 gensim.downloader
加载现有模型(查看 their repository 以获取所有可用模型的列表):
import gensim.downloader as api
import numpy as np
import pandas
poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")
然后用它来检索模型知道的所有单词的向量:
final_data = []
for poem in poems['Main_text']:
poem_all_words = poem.split()
poem_vectorized = []
for poem_w in poem_all_words:
if poem_w in model:
poem_vectorized.append(model[poem_w])
poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
final_data.append(poem_vectorized_mean)
或作为列表理解:
final_data = []
for poem in poems['Main_text']:
poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
final_data.append(poem_vectorized_mean)
两者都会给你:
X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01 3.73661995e-01 4.09943342e-01 -2.07784668e-01
...
-1.85739681e-01 -7.07386672e-01 3.31366658e-01 3.31600010e-01]
[-3.29973340e-01 4.13213342e-01 5.26199996e-01 -2.29261339e-01
...
-1.25366330e-01 -5.87253332e-01 2.80240029e-01 2.56700337e-01]]
请注意,尝试在空列表中获取 np.mean()
会引发错误,因此您可能希望捕捉到该错误,以防出现空诗或模型未知所有单词的情况。
我正在尝试对 pandas 数据框上的诗歌词进行聚类分析(最好是 k-means)。我首先尝试使用 gensim 包中的词到向量功能对词进行矢量化。但是,向量刚好以 0 出现,所以我的代码无法将单词转换为向量。结果,聚类不起作用。这是我的代码:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
仔细检查:
poem_vectorized.append(list(model.wv[poem_w]))
问题似乎是这样的:
如果我理解正确的话,你想使用现有模型来获取标记的语义嵌入,然后对单词进行聚类,对吗?
因为您设置模型的方式是在准备一个新模型进行训练,但随后不向它提供任何训练数据并对其进行训练,所以您的模型不知道任何单词并且总是抛出调用 model.wv[poem_w]
.
使用 gensim.downloader
加载现有模型(查看 their repository 以获取所有可用模型的列表):
import gensim.downloader as api
import numpy as np
import pandas
poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")
然后用它来检索模型知道的所有单词的向量:
final_data = []
for poem in poems['Main_text']:
poem_all_words = poem.split()
poem_vectorized = []
for poem_w in poem_all_words:
if poem_w in model:
poem_vectorized.append(model[poem_w])
poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
final_data.append(poem_vectorized_mean)
或作为列表理解:
final_data = []
for poem in poems['Main_text']:
poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
final_data.append(poem_vectorized_mean)
两者都会给你:
X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01 3.73661995e-01 4.09943342e-01 -2.07784668e-01
...
-1.85739681e-01 -7.07386672e-01 3.31366658e-01 3.31600010e-01]
[-3.29973340e-01 4.13213342e-01 5.26199996e-01 -2.29261339e-01
...
-1.25366330e-01 -5.87253332e-01 2.80240029e-01 2.56700337e-01]]
请注意,尝试在空列表中获取 np.mean()
会引发错误,因此您可能希望捕捉到该错误,以防出现空诗或模型未知所有单词的情况。