用于文本分类算法的 word2Vec 向量表示

word2Vec vector representation for text classification algorithm

我正在尝试在文本分类算法中使用 word2vec。 我想使用 word2vec 创建向量化器,我使用了下面的脚本。但是我无法为每个文档获取一行,而是为每个文档获取不同维度的矩阵。 例如,第一个文档矩阵为 31X100,第二个文档矩阵为 163X100,第三个文档矩阵为 73X100,依此类推。 实际上我需要每个文档的尺寸为 1X100 ,这样我就可以将它们用作训练模型的输入特征

谁能帮帮我。

import os
import pandas as pd       
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # Import the stop word list
import gensim
import numpy as np

train = pd.read_csv("Data.csv",encoding='cp1252')
wordnet_lemmatizer = WordNetLemmatizer()

def Description_to_words(raw_Description):
    Description_text = BeautifulSoup(raw_Description).get_text() 
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
    words = word_tokenize(letters_only.lower())    
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops]
    return( " ".join(wordnet_lemmatizer.lemmatize(w) for w in meaningful_words))

num_Descriptions = train["Summary"].size
clean_train_Descriptions = []
print("Cleaning and parsing the training set ticket Descriptions...\n")
clean_train_Descriptions = []
for i in range( 0, num_Descriptions ):
    if( (i+1)%1000 == 0 ):
        print("Description %d of %d\n" % ( i+1, num_Descriptions ))
    clean_train_Descriptions.append(Description_to_words( train["Summary"][i] ))

model = gensim.models.Word2Vec(clean_train_Descriptions, size=100)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        #self.dim = len(word2vec.itervalues().next())
        self.dim = 100

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

a=MeanEmbeddingVectorizer(w2v)
clean_train_Descriptions[1]
a.transform(clean_train_Descriptions[1])

train_Descriptions = []
for i in range( 0, num_Descriptions ):
    if( (i+1)%1000 == 0 ):
        print("Description %d of %d\n" % ( i+1, num_Descriptions ))
    train_Descriptions.append(a.transform(" ".join(clean_train_Descriptions[i])))

您的代码中有 2 个导致问题的问题,都很容易解决。

首先,Word2Vec 要求句子实际上是单词列表,而不是作为单个字符串的实际句子。所以从你的Description_to_words只是return列表,不要加入。

return [wordnet_lemmatizer.lemmatize(w) for w in meaningful_words]

由于 word2vec 遍历每个句子来获取单词,之前它遍历一个字符串,而您实际上是从 wv.

中获取字符级嵌入

其次,您调用转换的方式也存在类似问题 - X 应该是文档列表,而不是单个文档。因此,当您执行 for words in X 时,您实际上是在创建一个字符列表,然后对其进行迭代以创建嵌入。所以你的输出实际上是句子中每个字符的单个字符嵌入。简单的改,一次转换所有文件就可以了!

train_Descriptions = a.transform(clean_train_Descriptions)

(一次做一个,包装在列表中([clean_train_Descriptions[1]]),或 select 1 使用范围 select 或(clean_train_Descriptions[1:2])。

通过这两项更改,每个输入句子应该返回 1 行。