Word2Vec 尺寸不正确

Word2Vec dimensions incorrect

正在使用的数据保存在 csv 文件中:

Sentence #  Word    POS Tag
Sentence1   YASHAWANTHA NNP B-PER
Sentence1   K   NNP I-PER
Sentence1   S   NNP I-PER
Sentence1   Mobile  NNP O
Sentence1   :   :   O
Sentence1   -7353555773 JJ  O

我正在尝试获取包含以下列的数据集:Sentence #、Word、POS、Tag 并将 Word 列中的所有条目转换为 Word2Vec 向量。

我在这里阅读数据集并拆分成句子:

from gensim.models import Word2Vec
import pandas as pd

data = pd.read_csv(path_to_csv)

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1#
        self.data = data

        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

getter = SentenceGetter(data)
sentences = getter.sentences 

现在我将所有单词转换为它们对应的 Word2Vec 向量,其中 word2idx 是一个字典,其键为字符串,其对应的 Word2Vec 向量为值:

vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
    word2idx[key] = vec_model.wv[key]

然后对于标签列,我使用简单枚举:

tag2idx = {t: i for i, t in enumerate(tags)}

然后我填充单词和标签:

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]

然后定义模型:

from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

然后拟合模型:

history = model.fit(
    x_train, np.array(y_train),
    validation_split=0.2,
    batch_size=32, 
    epochs=1,
    verbose=1,    
)

此拟合步骤导致以下错误,我不确定如何修复它

Input 0 of layer "spatial_dropout1d_2" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 60, 30, 60)

填充前的形状

X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)

是csv文件中3个句子的(3, 6, 30),padding后(3, 60, 30),30是word2wec的大小。 但模型需要大小为 (3, 60)

的输入

其余不改,修改网络即可:

wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)

model = Model(input_word, out)