使用 DNN 进行多标签预测
Multi-label prediction using DNN
我正在尝试预测给定文本的多个标签。它适用于单个标签,但我不知道如何为多标签预测实现置信度分数。
我有以下非规范化格式的数据:
┌────┬──────────┬────────┐
│ id │ Topic │ Text │
├────┼──────────┼────────┤
│ 1 │ Apples │ FooBar │
│ 1 │ Oranges │ FooBar │
│ 1 │ Kiwis │ FooBar │
│ 2 │ Potatoes │ BazBak │
│ 3 │ Carrot │ BalBan │
└────┴──────────┴────────┘
每篇文章可以指定一个或多个主题。
到目前为止,我想出了这个。
首先,我准备我的数据 - tokenize、stem 等
df = #read data from csv
categories = [ "Apples", "Oranges", "Kiwis", "Potatoes", "Carrot"]
words = []
docs = []
for index, row in df.iterrows():
stems = tokenize_and_stem(row, stemmer)
words.extend(stems)
docs.append((stems, row[1]))
# remove duplicates
words = sorted(list(set(words)))
# create training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(categories)
for doc in docs:
# initialize our bag of words(bow) for each document in the list
bow = []
# list of tokenized words for the pattern
token_words = doc[0]
# create our bag of words array
for w in words:
bow.append(1) if w in token_words else bow.append(0)
output_row = list(output_empty)
output_row[categories.index(doc[1])] = 1
# our training set will contain a the bag of words model and the output row that tells which catefory that bow belongs to.
training.append([bow, output_row])
# shuffle our features and turn into np.array as tensorflow takes in numpy array
random.shuffle(training)
training = np.array(training)
# trainX contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])
接下来,我创建我的训练模型
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=1000, batch_size=8, show_metric=True)
model.save('model.tflearn')
之后我试图预测我的主题:
df = # read data from excel
for index, row in df.iterrows():
prediction = model.predict([get_bag_of_words(row[2])])
return categories[np.argmax(prediction)]
如您所见,我选择了最大值 prediction
,这对单个主题很有效。为了选择多个主题,我需要一些置信度分数或其他东西,它可以告诉我什么时候停止,因为我不能盲目地设置任意阈值。
有什么建议吗?
不应在输出层上使用 softmax 激活,而应使用 sigmoid 激活。你的损失函数应该仍然是 cross entropy。这是 multi-class.
应该需要的关键更改
softmax 的问题在于它会在您的输出上创建概率分布。因此,如果 class A 和 B 都被强烈表示,softmax over 3 classes 可能会给你一个像 [0.49, 0.49, 0.02] 这样的结果,但你更喜欢 [0.99, 0.99, 0.01].
sigmoid 激活就是这样做的,它将实值 logits(应用转换前最后一层的值)压缩到 [0, 1] 范围(这是使用交叉熵所必需的损失函数)。它独立地为每个输出执行此操作。
我正在尝试预测给定文本的多个标签。它适用于单个标签,但我不知道如何为多标签预测实现置信度分数。
我有以下非规范化格式的数据:
┌────┬──────────┬────────┐
│ id │ Topic │ Text │
├────┼──────────┼────────┤
│ 1 │ Apples │ FooBar │
│ 1 │ Oranges │ FooBar │
│ 1 │ Kiwis │ FooBar │
│ 2 │ Potatoes │ BazBak │
│ 3 │ Carrot │ BalBan │
└────┴──────────┴────────┘
每篇文章可以指定一个或多个主题。 到目前为止,我想出了这个。 首先,我准备我的数据 - tokenize、stem 等
df = #read data from csv
categories = [ "Apples", "Oranges", "Kiwis", "Potatoes", "Carrot"]
words = []
docs = []
for index, row in df.iterrows():
stems = tokenize_and_stem(row, stemmer)
words.extend(stems)
docs.append((stems, row[1]))
# remove duplicates
words = sorted(list(set(words)))
# create training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(categories)
for doc in docs:
# initialize our bag of words(bow) for each document in the list
bow = []
# list of tokenized words for the pattern
token_words = doc[0]
# create our bag of words array
for w in words:
bow.append(1) if w in token_words else bow.append(0)
output_row = list(output_empty)
output_row[categories.index(doc[1])] = 1
# our training set will contain a the bag of words model and the output row that tells which catefory that bow belongs to.
training.append([bow, output_row])
# shuffle our features and turn into np.array as tensorflow takes in numpy array
random.shuffle(training)
training = np.array(training)
# trainX contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])
接下来,我创建我的训练模型
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=1000, batch_size=8, show_metric=True)
model.save('model.tflearn')
之后我试图预测我的主题:
df = # read data from excel
for index, row in df.iterrows():
prediction = model.predict([get_bag_of_words(row[2])])
return categories[np.argmax(prediction)]
如您所见,我选择了最大值 prediction
,这对单个主题很有效。为了选择多个主题,我需要一些置信度分数或其他东西,它可以告诉我什么时候停止,因为我不能盲目地设置任意阈值。
有什么建议吗?
不应在输出层上使用 softmax 激活,而应使用 sigmoid 激活。你的损失函数应该仍然是 cross entropy。这是 multi-class.
应该需要的关键更改softmax 的问题在于它会在您的输出上创建概率分布。因此,如果 class A 和 B 都被强烈表示,softmax over 3 classes 可能会给你一个像 [0.49, 0.49, 0.02] 这样的结果,但你更喜欢 [0.99, 0.99, 0.01].
sigmoid 激活就是这样做的,它将实值 logits(应用转换前最后一层的值)压缩到 [0, 1] 范围(这是使用交叉熵所必需的损失函数)。它独立地为每个输出执行此操作。