使用 keras/python 和 CSV 文件创建顺序模型但准确性较差

Question

我想构建一个分类器，但我无法找到可以清楚地解释 keras 功能以及如何着手做我想做的事情的资源。我想使用以下数据：

         0    1    2        3          4       5    6     7
0     Name  TRY  LOC   OUTPUT     TYPE_A   SIGNAL  A-B  SPOT
1    inc 1    2   20   TYPE-1    TORPEDO   ULTRA    A   -21
2    inc 2    3   16   TYPE-2    TORPEDO     ILH    B   -14
3    inc 3    2   20  BLACK47    TORPEDO    LION    A    49
4    inc 4    3   12   TYPE-2  CENTRALPA    LION    A    25
5    inc 5    3   10   TYPE-2      THREE    LION    A   -21
6    inc 6    2   20   TYPE-2        ATF    LION    A   -48
7    inc 7    4    2  NIVEA-1        ATF    LION    B   -23
8    inc 8    3   16  NIVEA-1        ATF    LION    B    18
9    inc 9    3   18  BLENDER  CENTRALPA    LION    B    48
10   inc 10   4   20    DELCO        ATF    LION    B   -26
11   inc 11   3   20    VE248        ATF    LION    B    44
12   inc 12   1   20   SILVER  CENTRALPA    LION    B   -35
13   inc 13   2   20  CALVIN3     SEVENX    LION    B   -20
14   inc 14   3   14  DECK-BT  CENTRALPA    LION    B   -38
15   inc 15   4    4  10-LEVI    BERWYEN     OWL    B   -29
16   inc 16   4   14   TYPE-2        ATF     NOV    B   -31
17   inc 17   4   10     NYNY    TORPEDO     NOV    B    21
18   inc 18   2   20  NIVEA-1  CENTRALPA     NOV    B    45
19   inc 19   3   27   FMRA97    TORPEDO     NOV    B   -26
20   inc 20   4   18   SILVER        ATF     NOV    B   -46

我想使用第 1、2、4、5、6、7 列作为输入，输出为 3（输出）。

我目前的密码是：

import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot

df = pd.read_csv("file.csv")

df.drop('Name', axis=1, inplace=True)

obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)

target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())

X = df4[predictors].values
y = df4[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)

model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)

我不明白为什么这是我得到的结果：

Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04

我似乎也找不到与此问题相关的任何答案。我是否错误地使用了keras函数？这是我将对象类型转换为整数的方式吗？假设有 1250 个输出，我将如何修复图层？任何提示或帮助将不胜感激。谢谢。

Answer 1

正如我在评论中所说，这似乎是一个明显的模型欠拟合案例——对于模型本身的大小，你的数据太少了。与其玩弄层的大小，不如先尝试 SVM 或 RandomForest classifiers，看看是否有可能对您的数据进行任何合理的 classification。同样，对于如此大量的数据，神经网络几乎不是一个好的选择。

所以改为这样做：

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import pandas as pd

df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

如果这有效并且可以做出一些预测，那么您可以继续尝试调整您的顺序模型。

编辑： 只需阅读您的评论，您总共有 1250 class 个标签和 5000 个样本。这可能不适用于大多数 classifier。 class 太多，样本支持太少。

使用 keras/python 和 CSV 文件创建顺序模型但准确性较差

Creating sequential model using keras/python and CSV file but getting bad accuracy

python

pandas

keras

tensorflow

keras-layer