使用 keras/python 和 CSV 文件创建顺序模型但准确性较差
Creating sequential model using keras/python and CSV file but getting bad accuracy
我想构建一个分类器,但我无法找到可以清楚地解释 keras 功能以及如何着手做我想做的事情的资源。我想使用以下数据:
0 1 2 3 4 5 6 7
0 Name TRY LOC OUTPUT TYPE_A SIGNAL A-B SPOT
1 inc 1 2 20 TYPE-1 TORPEDO ULTRA A -21
2 inc 2 3 16 TYPE-2 TORPEDO ILH B -14
3 inc 3 2 20 BLACK47 TORPEDO LION A 49
4 inc 4 3 12 TYPE-2 CENTRALPA LION A 25
5 inc 5 3 10 TYPE-2 THREE LION A -21
6 inc 6 2 20 TYPE-2 ATF LION A -48
7 inc 7 4 2 NIVEA-1 ATF LION B -23
8 inc 8 3 16 NIVEA-1 ATF LION B 18
9 inc 9 3 18 BLENDER CENTRALPA LION B 48
10 inc 10 4 20 DELCO ATF LION B -26
11 inc 11 3 20 VE248 ATF LION B 44
12 inc 12 1 20 SILVER CENTRALPA LION B -35
13 inc 13 2 20 CALVIN3 SEVENX LION B -20
14 inc 14 3 14 DECK-BT CENTRALPA LION B -38
15 inc 15 4 4 10-LEVI BERWYEN OWL B -29
16 inc 16 4 14 TYPE-2 ATF NOV B -31
17 inc 17 4 10 NYNY TORPEDO NOV B 21
18 inc 18 2 20 NIVEA-1 CENTRALPA NOV B 45
19 inc 19 3 27 FMRA97 TORPEDO NOV B -26
20 inc 20 4 18 SILVER ATF NOV B -46
我想使用第 1、2、4、5、6、7 列作为输入,输出为 3(输出)。
我目前的密码是:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot
df = pd.read_csv("file.csv")
df.drop('Name', axis=1, inplace=True)
obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)
target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())
X = df4[predictors].values
y = df4[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))
# Compile the model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)
我不明白为什么这是我得到的结果:
Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04
我似乎也找不到与此问题相关的任何答案。我是否错误地使用了keras函数?这是我将对象类型转换为整数的方式吗?假设有 1250 个输出,我将如何修复图层?任何提示或帮助将不胜感激。谢谢。
正如我在评论中所说,这似乎是一个明显的模型欠拟合案例——对于模型本身的大小,你的数据太少了。与其玩弄层的大小,不如先尝试 SVM 或 RandomForest classifiers,看看是否有可能对您的数据进行任何合理的 classification。同样,对于如此大量的数据,神经网络几乎不是一个好的选择。
所以改为这样做:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
如果这有效并且可以做出一些预测,那么您可以继续尝试调整您的顺序模型。
编辑: 只需阅读您的评论,您总共有 1250 class 个标签和 5000 个样本。这可能不适用于大多数 classifier。 class 太多,样本支持太少。
我想构建一个分类器,但我无法找到可以清楚地解释 keras 功能以及如何着手做我想做的事情的资源。我想使用以下数据:
0 1 2 3 4 5 6 7
0 Name TRY LOC OUTPUT TYPE_A SIGNAL A-B SPOT
1 inc 1 2 20 TYPE-1 TORPEDO ULTRA A -21
2 inc 2 3 16 TYPE-2 TORPEDO ILH B -14
3 inc 3 2 20 BLACK47 TORPEDO LION A 49
4 inc 4 3 12 TYPE-2 CENTRALPA LION A 25
5 inc 5 3 10 TYPE-2 THREE LION A -21
6 inc 6 2 20 TYPE-2 ATF LION A -48
7 inc 7 4 2 NIVEA-1 ATF LION B -23
8 inc 8 3 16 NIVEA-1 ATF LION B 18
9 inc 9 3 18 BLENDER CENTRALPA LION B 48
10 inc 10 4 20 DELCO ATF LION B -26
11 inc 11 3 20 VE248 ATF LION B 44
12 inc 12 1 20 SILVER CENTRALPA LION B -35
13 inc 13 2 20 CALVIN3 SEVENX LION B -20
14 inc 14 3 14 DECK-BT CENTRALPA LION B -38
15 inc 15 4 4 10-LEVI BERWYEN OWL B -29
16 inc 16 4 14 TYPE-2 ATF NOV B -31
17 inc 17 4 10 NYNY TORPEDO NOV B 21
18 inc 18 2 20 NIVEA-1 CENTRALPA NOV B 45
19 inc 19 3 27 FMRA97 TORPEDO NOV B -26
20 inc 20 4 18 SILVER ATF NOV B -46
我想使用第 1、2、4、5、6、7 列作为输入,输出为 3(输出)。
我目前的密码是:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot
df = pd.read_csv("file.csv")
df.drop('Name', axis=1, inplace=True)
obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)
target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())
X = df4[predictors].values
y = df4[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))
# Compile the model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)
我不明白为什么这是我得到的结果:
Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04
我似乎也找不到与此问题相关的任何答案。我是否错误地使用了keras函数?这是我将对象类型转换为整数的方式吗?假设有 1250 个输出,我将如何修复图层?任何提示或帮助将不胜感激。谢谢。
正如我在评论中所说,这似乎是一个明显的模型欠拟合案例——对于模型本身的大小,你的数据太少了。与其玩弄层的大小,不如先尝试 SVM 或 RandomForest classifiers,看看是否有可能对您的数据进行任何合理的 classification。同样,对于如此大量的数据,神经网络几乎不是一个好的选择。
所以改为这样做:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
如果这有效并且可以做出一些预测,那么您可以继续尝试调整您的顺序模型。
编辑: 只需阅读您的评论,您总共有 1250 class 个标签和 5000 个样本。这可能不适用于大多数 classifier。 class 太多,样本支持太少。