Keras:推文分类
Keras: tweets classification
尊敬的论坛会员,您好,
我有一个包含 2000 万条随机收集的个人推文的数据集(没有两条推文来自同一个帐户)。让我将此数据集称为 "general" 数据集。此外,我还有另一个 "specific" 数据集,其中包括从药物(阿片类药物)滥用者那里收集的 100,000 条推文。每条推文至少有一个与之关联的标签,例如阿片类药物、成瘾、过量、氢可酮等(最多 25 个标签)。
我的目标是使用 "specific" 数据集使用 Keras 训练模型,然后用它来标记 "general" 数据集中的推文,以识别可能由药物编写的推文滥用者。
根据 source1 and source2 中的示例,我设法构建了此类模型的简单工作版本:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
为了继续前进,我想澄清几件事:
- 假设我所有的训练推文都有一个标签——阿片类药物。然后,如果我将未标记的推文传递给它,该模型是否可能只是将它们全部标记为阿片类药物,因为它什么都不知道?为了学习目的,我应该使用各种不同的 tweets/tags 吗?或许,对于选择 tweets/tags 用于训练目的有任何一般准则吗?
- 如何添加更多带有训练标签的列(代码中没有使用类似的标签)?
- 一旦我训练了模型并达到了适当的准确度,我如何通过它传递未标记的推文来进行预测?
- 如何添加混淆矩阵?
也非常感谢任何其他相关反馈。
谢谢!
"general" 个推文示例:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
"specific" 个推文示例:
million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
我对此的看法是:
使用来自一般数据 + 特定数据的推文创建一个新数据集。比方说 200k-250K,其中 100K 是您的特定数据集,其余是通用的
拿你的 25 keywords/tags 写一条规则,如果推文中存在一个或多个,则它是 DA(药物滥用者)或 NDA(非药物滥用者)。这将是您的因变量。
您的新数据集将是一列包含所有推文,另一列包含说明它是 DA 还是 NDA
的因变量
现在分成 train/test 并使用 keras 或任何其他算法。以便它可以学习。
然后通过绘制混淆矩阵来测试模型
将其他剩余数据集从 General 传递给此模型并检查,
如果它们是不在特定数据集中的 25 以外的新词,根据您构建的模型,它仍然会尝试通过组合在一起的词组、语气等智能地猜测正确的类别
尊敬的论坛会员,您好,
我有一个包含 2000 万条随机收集的个人推文的数据集(没有两条推文来自同一个帐户)。让我将此数据集称为 "general" 数据集。此外,我还有另一个 "specific" 数据集,其中包括从药物(阿片类药物)滥用者那里收集的 100,000 条推文。每条推文至少有一个与之关联的标签,例如阿片类药物、成瘾、过量、氢可酮等(最多 25 个标签)。
我的目标是使用 "specific" 数据集使用 Keras 训练模型,然后用它来标记 "general" 数据集中的推文,以识别可能由药物编写的推文滥用者。
根据 source1 and source2 中的示例,我设法构建了此类模型的简单工作版本:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
为了继续前进,我想澄清几件事:
- 假设我所有的训练推文都有一个标签——阿片类药物。然后,如果我将未标记的推文传递给它,该模型是否可能只是将它们全部标记为阿片类药物,因为它什么都不知道?为了学习目的,我应该使用各种不同的 tweets/tags 吗?或许,对于选择 tweets/tags 用于训练目的有任何一般准则吗?
- 如何添加更多带有训练标签的列(代码中没有使用类似的标签)?
- 一旦我训练了模型并达到了适当的准确度,我如何通过它传递未标记的推文来进行预测?
- 如何添加混淆矩阵?
也非常感谢任何其他相关反馈。
谢谢!
"general" 个推文示例:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
"specific" 个推文示例:
million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
我对此的看法是:
使用来自一般数据 + 特定数据的推文创建一个新数据集。比方说 200k-250K,其中 100K 是您的特定数据集,其余是通用的
拿你的 25 keywords/tags 写一条规则,如果推文中存在一个或多个,则它是 DA(药物滥用者)或 NDA(非药物滥用者)。这将是您的因变量。
您的新数据集将是一列包含所有推文,另一列包含说明它是 DA 还是 NDA
的因变量
现在分成 train/test 并使用 keras 或任何其他算法。以便它可以学习。
然后通过绘制混淆矩阵来测试模型
将其他剩余数据集从 General 传递给此模型并检查,
如果它们是不在特定数据集中的 25 以外的新词,根据您构建的模型,它仍然会尝试通过组合在一起的词组、语气等智能地猜测正确的类别