Word2Vec 尺寸不正确
Word2Vec dimensions incorrect
正在使用的数据保存在 csv 文件中:
Sentence # Word POS Tag
Sentence1 YASHAWANTHA NNP B-PER
Sentence1 K NNP I-PER
Sentence1 S NNP I-PER
Sentence1 Mobile NNP O
Sentence1 : : O
Sentence1 -7353555773 JJ O
我正在尝试获取包含以下列的数据集:Sentence #、Word、POS、Tag 并将 Word 列中的所有条目转换为 Word2Vec 向量。
我在这里阅读数据集并拆分成句子:
from gensim.models import Word2Vec
import pandas as pd
data = pd.read_csv(path_to_csv)
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1#
self.data = data
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
getter = SentenceGetter(data)
sentences = getter.sentences
现在我将所有单词转换为它们对应的 Word2Vec 向量,其中 word2idx 是一个字典,其键为字符串,其对应的 Word2Vec 向量为值:
vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
word2idx[key] = vec_model.wv[key]
然后对于标签列,我使用简单枚举:
tag2idx = {t: i for i, t in enumerate(tags)}
然后我填充单词和标签:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]
然后定义模型:
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
然后拟合模型:
history = model.fit(
x_train, np.array(y_train),
validation_split=0.2,
batch_size=32,
epochs=1,
verbose=1,
)
此拟合步骤导致以下错误,我不确定如何修复它
Input 0 of layer "spatial_dropout1d_2" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 60, 30, 60)
填充前的形状
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)
是csv文件中3个句子的(3, 6, 30)
,padding后(3, 60, 30)
,30是word2wec的大小。
但模型需要大小为 (3, 60)
的输入
其余不改,修改网络即可:
wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)
model = Model(input_word, out)
正在使用的数据保存在 csv 文件中:
Sentence # Word POS Tag
Sentence1 YASHAWANTHA NNP B-PER
Sentence1 K NNP I-PER
Sentence1 S NNP I-PER
Sentence1 Mobile NNP O
Sentence1 : : O
Sentence1 -7353555773 JJ O
我正在尝试获取包含以下列的数据集:Sentence #、Word、POS、Tag 并将 Word 列中的所有条目转换为 Word2Vec 向量。
我在这里阅读数据集并拆分成句子:
from gensim.models import Word2Vec
import pandas as pd
data = pd.read_csv(path_to_csv)
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1#
self.data = data
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
getter = SentenceGetter(data)
sentences = getter.sentences
现在我将所有单词转换为它们对应的 Word2Vec 向量,其中 word2idx 是一个字典,其键为字符串,其对应的 Word2Vec 向量为值:
vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
word2idx[key] = vec_model.wv[key]
然后对于标签列,我使用简单枚举:
tag2idx = {t: i for i, t in enumerate(tags)}
然后我填充单词和标签:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]
然后定义模型:
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
然后拟合模型:
history = model.fit(
x_train, np.array(y_train),
validation_split=0.2,
batch_size=32,
epochs=1,
verbose=1,
)
此拟合步骤导致以下错误,我不确定如何修复它
Input 0 of layer "spatial_dropout1d_2" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 60, 30, 60)
填充前的形状
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)
是csv文件中3个句子的(3, 6, 30)
,padding后(3, 60, 30)
,30是word2wec的大小。
但模型需要大小为 (3, 60)
其余不改,修改网络即可:
wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)
model = Model(input_word, out)