尝试使用 OneHotEncoder 然后使用 make_column_transformer 规范化值后出现 ValueError
ValueError after attempting to use OneHotEncoder and then normalize values with make_column_transformer
所以我试图将我的数据的时间戳从 Unix 时间戳转换为更易读的日期格式。我创建了一个简单的 Java 程序来执行此操作并写入 .csv 文件,一切顺利。我尝试将它用于我的模型,方法是将其一次性编码为数字,然后将所有内容都转换为规范化数据。然而,在我尝试单热编码(我不确定它是否有效)之后,我使用 make_column_transformer 的规范化过程失败了。
# model 4
# next model
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras import layers
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
np.set_printoptions(precision=3, suppress=True)
btc_data = pd.read_csv(
"/content/drive/MyDrive/Science Fair/output2.csv",
names=["Time", "Open"])
X_btc = btc_data[["Time"]]
y_btc = btc_data["Open"]
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
print(X_btc)
X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
ct = make_column_transformer(
(MinMaxScaler(), ["Time"])
)
ct.fit(X_train)
X_train_normal = ct.transform(X_train)
X_test_normal = ct.transform(X_test)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
btc_model_4 = tf.keras.Sequential([
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(1, activation="linear")
])
btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
optimizer = tf.optimizers.Adam())
history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
y_pred = btc_model_4.predict(X_test_normal)
btc_model_4.save("btc_model_4")
btc_model_4.save("btc_model_4.h5")
# plot model
def plot_evaluations(train_data=X_train_normal,
train_labels=y_train,
test_data=X_test_normal,
test_labels=y_test,
predictions=y_pred):
print(test_data.shape)
print(predictions.shape)
plt.figure(figsize=(100, 15))
plt.scatter(train_data, train_labels, c='b', label="Training")
plt.scatter(test_data, test_labels, c='g', label="Testing")
plt.scatter(test_data, predictions, c='r', label="Results")
plt.legend()
plot_evaluations()
# plot loss curve
pd.DataFrame(history.history).plot()
plt.ylabel("loss")
plt.xlabel("epochs")
我的正常数据格式是这样的:
2015-12-05 12:52:00,377.48
2015-12-05 12:53:00,377.5
2015-12-05 12:54:00,377.5
2015-12-05 12:56:00,377.5
2015-12-05 12:57:00,377.5
2015-12-05 12:58:00,377.5
2015-12-05 12:59:00,377.5
2015-12-05 13:00:00,377.5
2015-12-05 13:01:00,377.79
2015-12-05 13:02:00,377.5
2015-12-05 13:03:00,377.79
2015-12-05 13:05:00,377.74
2015-12-05 13:06:00,377.79
2015-12-05 13:07:00,377.64
2015-12-05 13:08:00,377.79
2015-12-05 13:10:00,377.77
2015-12-05 13:11:00,377.7
2015-12-05 13:12:00,377.77
2015-12-05 13:13:00,377.77
2015-12-05 13:14:00,377.79
2015-12-05 13:15:00,377.72
2015-12-05 13:16:00,377.5
2015-12-05 13:17:00,377.49
2015-12-05 13:18:00,377.5
2015-12-05 13:19:00,377.5
2015-12-05 13:20:00,377.8
2015-12-05 13:21:00,377.84
2015-12-05 13:22:00,378.29
2015-12-05 13:23:00,378.3
2015-12-05 13:24:00,378.3
2015-12-05 13:25:00,378.33
2015-12-05 13:26:00,378.33
2015-12-05 13:28:00,378.31
2015-12-05 13:29:00,378.68
第一个是日期,逗号后的第二个值是当时BTC的价格。现在在“one-hot encoding”之后,我添加了一个 print 语句来打印那些 X 值的值,并给出了以下值:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
(4, 4) 1.0
(5, 5) 1.0
(6, 6) 1.0
(7, 7) 1.0
(8, 8) 1.0
(9, 9) 1.0
(10, 10) 1.0
(11, 11) 1.0
(12, 12) 1.0
(13, 13) 1.0
(14, 14) 1.0
(15, 15) 1.0
(16, 16) 1.0
(17, 17) 1.0
(18, 18) 1.0
(19, 19) 1.0
(20, 20) 1.0
(21, 21) 1.0
(22, 22) 1.0
(23, 23) 1.0
(24, 24) 1.0
: :
(2526096, 2526096) 1.0
(2526097, 2526097) 1.0
(2526098, 2526098) 1.0
(2526099, 2526099) 1.0
(2526100, 2526100) 1.0
(2526101, 2526101) 1.0
(2526102, 2526102) 1.0
(2526103, 2526103) 1.0
(2526104, 2526104) 1.0
(2526105, 2526105) 1.0
(2526106, 2526106) 1.0
(2526107, 2526107) 1.0
(2526108, 2526108) 1.0
(2526109, 2526109) 1.0
(2526110, 2526110) 1.0
(2526111, 2526111) 1.0
(2526112, 2526112) 1.0
(2526113, 2526113) 1.0
(2526114, 2526114) 1.0
(2526115, 2526115) 1.0
(2526116, 2526116) 1.0
(2526117, 2526117) 1.0
(2526118, 2526118) 1.0
(2526119, 2526119) 1.0
(2526120, 2526120) 1.0
进行归一化拟合后,我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
408 try:
--> 409 all_columns = X.columns
410 except AttributeError:
5 frames
AttributeError: columns not found
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
410 except AttributeError:
411 raise ValueError(
--> 412 "Specifying the columns using strings is only "
413 "supported for pandas DataFrames"
414 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
我的one-hot编码正确吗?这样做的合适方法是什么?我应该在规范化过程中直接实现 one-hot 编码器吗?
使用 OneHotEncoder 不是去这里的方法,最好从列 time 中提取特征作为单独的特征,例如年份、月、日、小时、分钟等...并将这些列作为模型的输入。
btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
这里的问题来自 oneHotEncoder,它正在返回一个 scipy 稀疏矩阵 并乘坐“时间”列因此要更正此问题,您必须将输出重新转换为 pandas 数据帧 并添加“时间”列。
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"]
解决 countournate 内存问题 的一种方法是:
- 生成两个具有相同random_state的索引,一个用于pandas数据帧,一个用于scipy稀疏矩阵
X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
- 为 MinMaxScaler() 使用 pandas 数据框。
ct = make_column_transformer((MinMaxScaler(), ["Time"]))
ct.fit(X_train_pd)
result_train = ct.transform(X_train_pd)
result_test = ct.transform(X_test_pd)
- 在训练和测试阶段使用生成器加载数据(这将解决内存问题)并在生成器中包含缩放时间。
def nn_batch_generator(X_data, y_data, scaled, batch_size):
samples_per_epoch = X_data.shape[0]
number_of_batches = samples_per_epoch / batch_size
counter = 0
index = np.arange(np.shape(y_data)[0])
while True:
index_batch = index[batch_size * counter:batch_size * (counter + 1)]
scaled_array = scaled[index_batch]
X_batch = X_data[index_batch, :].todense()
y_batch = y_data.iloc[index_batch]
counter += 1
yield np.array(np.hstack((np.array(X_batch), scaled_array))), np.array(y_batch)
if (counter > number_of_batches):
counter = 0
def nn_batch_generator_test(X_data, scaled, batch_size):
samples_per_epoch = X_data.shape[0]
number_of_batches = samples_per_epoch / batch_size
counter = 0
index = np.arange(np.shape(X_data)[0])
while True:
index_batch = index[batch_size * counter:batch_size * (counter + 1)]
scaled_array = scaled[index_batch]
X_batch = X_data[index_batch, :].todense()
counter += 1
yield np.hstack((X_batch, scaled_array))
if (counter > number_of_batches):
counter = 0
最终拟合模型
history = btc_model_4.fit(nn_batch_generator(X_train, y_train, scaled=result_train, batch_size=2), steps_per_epoch=#Todetermine,
batch_size=2, epochs=10,
callbacks=[callback])
btc_model_4.evaluate(nn_batch_generator(X_test, y_test, scaled=result_test, batch_size=2), batch_size=2, steps=#Todetermine)
y_pred = btc_model_4.predict(nn_batch_generator_test(X_test, scaled=result_test, batch_size=2), steps=#Todetermine)
只是为了添加到现有答案中,如果您从 Scipy 压缩稀疏行 (CSR) 矩阵转换为 Pandas DataFrame 并将时间戳字符串转换为 datetime64,则模型将开始训练 - 至少在提供的小子集上:
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"]
X_btc['Time'] = X_btc['Time'].astype('datetime64[ns]')
根据您对内存密集度的评论,这就是您处理问题的本质 - 通过使用时间戳进行一次热编码,如果您的特征矩阵具有 n 行,每行包含一个不同的值(我们在处理时间戳时会期望这一点),应用单热编码将生成一个 n x n 矩阵,这可能是巨大的。为了验证,如果您使用测试数据单步执行或打印出在此过程中生成的中间矩阵,您将观察到 X_btc
启动了一个 34 x 1 矩阵,并且在应用编码器 (X_btc = enc.transform(X_btc)
) 后变成 34 x 34 矩阵。
我不确定这个问题的结局 objective 是什么,但是如果您想继续使用这种方法,您可能希望以更细粒度的方式对样本进行分类 - 例如,当一个热编码时,将每个时间戳处理到毫秒,因为它是自己独特的类别,t运行适应小时,然后应用一个热编码:
X_btc['Time'] = X_btc['Time'].astype('datetime64[h]') # convert to units to hours before one hot encoding
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"].astype('datetime64[ns]') # Use 'ns' here to retain the full timestamp information
在提供的示例数据中,由于我们有 2 个不同的小时(12 和 13),当应用一种热编码时,我们现在只有 2 个不同的 类,而不是 34 个。这应该可以减轻内存占用问题,因为与此数据的总记录相比,您的小时数应该少得多。
沿着同样的思路,您可以从时间戳中提取小时(也可能是分钟)到一个热编码中,而不是将 t运行 设置为小时:
X_btc['Time'] = str(X_btc['Time'].astype('datetime64[ns]').dt.hour)
# + ":" + str(X_btc['Time'].astype('datetime64[ns]').dt.minute) # UNCOMMENT TO INCLUDE minute
这种方法的好处是,如果你保存了编码器,你就可以在被引入系统的新数据上重用这个逻辑,而在当前对训练数据进行编码的方法中,你将无法运行 训练集中不包含日期的数据流上的模型。它们将属于一个新类别,需要重新安装编码器和模型。
如果您只使用一个小时,这意味着您将从一个热编码器中获得 24 个不同的 类。如果您也使用分钟,您将有 24 * 60 = 1440 个不同的 类(这应该仍然远远少于您正在处理的记录数)。
所以我试图将我的数据的时间戳从 Unix 时间戳转换为更易读的日期格式。我创建了一个简单的 Java 程序来执行此操作并写入 .csv 文件,一切顺利。我尝试将它用于我的模型,方法是将其一次性编码为数字,然后将所有内容都转换为规范化数据。然而,在我尝试单热编码(我不确定它是否有效)之后,我使用 make_column_transformer 的规范化过程失败了。
# model 4
# next model
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras import layers
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
np.set_printoptions(precision=3, suppress=True)
btc_data = pd.read_csv(
"/content/drive/MyDrive/Science Fair/output2.csv",
names=["Time", "Open"])
X_btc = btc_data[["Time"]]
y_btc = btc_data["Open"]
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
print(X_btc)
X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
ct = make_column_transformer(
(MinMaxScaler(), ["Time"])
)
ct.fit(X_train)
X_train_normal = ct.transform(X_train)
X_test_normal = ct.transform(X_test)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
btc_model_4 = tf.keras.Sequential([
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(100, activation="relu"),
layers.Dense(1, activation="linear")
])
btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
optimizer = tf.optimizers.Adam())
history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
y_pred = btc_model_4.predict(X_test_normal)
btc_model_4.save("btc_model_4")
btc_model_4.save("btc_model_4.h5")
# plot model
def plot_evaluations(train_data=X_train_normal,
train_labels=y_train,
test_data=X_test_normal,
test_labels=y_test,
predictions=y_pred):
print(test_data.shape)
print(predictions.shape)
plt.figure(figsize=(100, 15))
plt.scatter(train_data, train_labels, c='b', label="Training")
plt.scatter(test_data, test_labels, c='g', label="Testing")
plt.scatter(test_data, predictions, c='r', label="Results")
plt.legend()
plot_evaluations()
# plot loss curve
pd.DataFrame(history.history).plot()
plt.ylabel("loss")
plt.xlabel("epochs")
我的正常数据格式是这样的:
2015-12-05 12:52:00,377.48
2015-12-05 12:53:00,377.5
2015-12-05 12:54:00,377.5
2015-12-05 12:56:00,377.5
2015-12-05 12:57:00,377.5
2015-12-05 12:58:00,377.5
2015-12-05 12:59:00,377.5
2015-12-05 13:00:00,377.5
2015-12-05 13:01:00,377.79
2015-12-05 13:02:00,377.5
2015-12-05 13:03:00,377.79
2015-12-05 13:05:00,377.74
2015-12-05 13:06:00,377.79
2015-12-05 13:07:00,377.64
2015-12-05 13:08:00,377.79
2015-12-05 13:10:00,377.77
2015-12-05 13:11:00,377.7
2015-12-05 13:12:00,377.77
2015-12-05 13:13:00,377.77
2015-12-05 13:14:00,377.79
2015-12-05 13:15:00,377.72
2015-12-05 13:16:00,377.5
2015-12-05 13:17:00,377.49
2015-12-05 13:18:00,377.5
2015-12-05 13:19:00,377.5
2015-12-05 13:20:00,377.8
2015-12-05 13:21:00,377.84
2015-12-05 13:22:00,378.29
2015-12-05 13:23:00,378.3
2015-12-05 13:24:00,378.3
2015-12-05 13:25:00,378.33
2015-12-05 13:26:00,378.33
2015-12-05 13:28:00,378.31
2015-12-05 13:29:00,378.68
第一个是日期,逗号后的第二个值是当时BTC的价格。现在在“one-hot encoding”之后,我添加了一个 print 语句来打印那些 X 值的值,并给出了以下值:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
(4, 4) 1.0
(5, 5) 1.0
(6, 6) 1.0
(7, 7) 1.0
(8, 8) 1.0
(9, 9) 1.0
(10, 10) 1.0
(11, 11) 1.0
(12, 12) 1.0
(13, 13) 1.0
(14, 14) 1.0
(15, 15) 1.0
(16, 16) 1.0
(17, 17) 1.0
(18, 18) 1.0
(19, 19) 1.0
(20, 20) 1.0
(21, 21) 1.0
(22, 22) 1.0
(23, 23) 1.0
(24, 24) 1.0
: :
(2526096, 2526096) 1.0
(2526097, 2526097) 1.0
(2526098, 2526098) 1.0
(2526099, 2526099) 1.0
(2526100, 2526100) 1.0
(2526101, 2526101) 1.0
(2526102, 2526102) 1.0
(2526103, 2526103) 1.0
(2526104, 2526104) 1.0
(2526105, 2526105) 1.0
(2526106, 2526106) 1.0
(2526107, 2526107) 1.0
(2526108, 2526108) 1.0
(2526109, 2526109) 1.0
(2526110, 2526110) 1.0
(2526111, 2526111) 1.0
(2526112, 2526112) 1.0
(2526113, 2526113) 1.0
(2526114, 2526114) 1.0
(2526115, 2526115) 1.0
(2526116, 2526116) 1.0
(2526117, 2526117) 1.0
(2526118, 2526118) 1.0
(2526119, 2526119) 1.0
(2526120, 2526120) 1.0
进行归一化拟合后,我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
408 try:
--> 409 all_columns = X.columns
410 except AttributeError:
5 frames
AttributeError: columns not found
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
410 except AttributeError:
411 raise ValueError(
--> 412 "Specifying the columns using strings is only "
413 "supported for pandas DataFrames"
414 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
我的one-hot编码正确吗?这样做的合适方法是什么?我应该在规范化过程中直接实现 one-hot 编码器吗?
使用 OneHotEncoder 不是去这里的方法,最好从列 time 中提取特征作为单独的特征,例如年份、月、日、小时、分钟等...并将这些列作为模型的输入。
btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
这里的问题来自 oneHotEncoder,它正在返回一个 scipy 稀疏矩阵 并乘坐“时间”列因此要更正此问题,您必须将输出重新转换为 pandas 数据帧 并添加“时间”列。
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"]
解决 countournate 内存问题 的一种方法是:
- 生成两个具有相同random_state的索引,一个用于pandas数据帧,一个用于scipy稀疏矩阵
X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
- 为 MinMaxScaler() 使用 pandas 数据框。
ct = make_column_transformer((MinMaxScaler(), ["Time"]))
ct.fit(X_train_pd)
result_train = ct.transform(X_train_pd)
result_test = ct.transform(X_test_pd)
- 在训练和测试阶段使用生成器加载数据(这将解决内存问题)并在生成器中包含缩放时间。
def nn_batch_generator(X_data, y_data, scaled, batch_size):
samples_per_epoch = X_data.shape[0]
number_of_batches = samples_per_epoch / batch_size
counter = 0
index = np.arange(np.shape(y_data)[0])
while True:
index_batch = index[batch_size * counter:batch_size * (counter + 1)]
scaled_array = scaled[index_batch]
X_batch = X_data[index_batch, :].todense()
y_batch = y_data.iloc[index_batch]
counter += 1
yield np.array(np.hstack((np.array(X_batch), scaled_array))), np.array(y_batch)
if (counter > number_of_batches):
counter = 0
def nn_batch_generator_test(X_data, scaled, batch_size):
samples_per_epoch = X_data.shape[0]
number_of_batches = samples_per_epoch / batch_size
counter = 0
index = np.arange(np.shape(X_data)[0])
while True:
index_batch = index[batch_size * counter:batch_size * (counter + 1)]
scaled_array = scaled[index_batch]
X_batch = X_data[index_batch, :].todense()
counter += 1
yield np.hstack((X_batch, scaled_array))
if (counter > number_of_batches):
counter = 0
最终拟合模型
history = btc_model_4.fit(nn_batch_generator(X_train, y_train, scaled=result_train, batch_size=2), steps_per_epoch=#Todetermine,
batch_size=2, epochs=10,
callbacks=[callback])
btc_model_4.evaluate(nn_batch_generator(X_test, y_test, scaled=result_test, batch_size=2), batch_size=2, steps=#Todetermine)
y_pred = btc_model_4.predict(nn_batch_generator_test(X_test, scaled=result_test, batch_size=2), steps=#Todetermine)
只是为了添加到现有答案中,如果您从 Scipy 压缩稀疏行 (CSR) 矩阵转换为 Pandas DataFrame 并将时间戳字符串转换为 datetime64,则模型将开始训练 - 至少在提供的小子集上:
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"]
X_btc['Time'] = X_btc['Time'].astype('datetime64[ns]')
根据您对内存密集度的评论,这就是您处理问题的本质 - 通过使用时间戳进行一次热编码,如果您的特征矩阵具有 n 行,每行包含一个不同的值(我们在处理时间戳时会期望这一点),应用单热编码将生成一个 n x n 矩阵,这可能是巨大的。为了验证,如果您使用测试数据单步执行或打印出在此过程中生成的中间矩阵,您将观察到 X_btc
启动了一个 34 x 1 矩阵,并且在应用编码器 (X_btc = enc.transform(X_btc)
) 后变成 34 x 34 矩阵。
我不确定这个问题的结局 objective 是什么,但是如果您想继续使用这种方法,您可能希望以更细粒度的方式对样本进行分类 - 例如,当一个热编码时,将每个时间戳处理到毫秒,因为它是自己独特的类别,t运行适应小时,然后应用一个热编码:
X_btc['Time'] = X_btc['Time'].astype('datetime64[h]') # convert to units to hours before one hot encoding
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"].astype('datetime64[ns]') # Use 'ns' here to retain the full timestamp information
在提供的示例数据中,由于我们有 2 个不同的小时(12 和 13),当应用一种热编码时,我们现在只有 2 个不同的 类,而不是 34 个。这应该可以减轻内存占用问题,因为与此数据的总记录相比,您的小时数应该少得多。
沿着同样的思路,您可以从时间戳中提取小时(也可能是分钟)到一个热编码中,而不是将 t运行 设置为小时:
X_btc['Time'] = str(X_btc['Time'].astype('datetime64[ns]').dt.hour)
# + ":" + str(X_btc['Time'].astype('datetime64[ns]').dt.minute) # UNCOMMENT TO INCLUDE minute
这种方法的好处是,如果你保存了编码器,你就可以在被引入系统的新数据上重用这个逻辑,而在当前对训练数据进行编码的方法中,你将无法运行 训练集中不包含日期的数据流上的模型。它们将属于一个新类别,需要重新安装编码器和模型。
如果您只使用一个小时,这意味着您将从一个热编码器中获得 24 个不同的 类。如果您也使用分钟,您将有 24 * 60 = 1440 个不同的 类(这应该仍然远远少于您正在处理的记录数)。