在构建模型时使用 datetime64 特征类型?

Using datetime64 feature type in building a model?

我有一个包含大约 50 个特征的数据框。在我的实验中,我有一个分类问题,所以我想通过 "GradientBoostingClassifier" 来训练模型。数据框 (mydata) 被视为训练集。这 50 个特征(feature20)中的一个是日期,我也需要在我的训练集中考虑这个特征,所以我尝试将日期转换为 datetime64,如下所示:

  mydata['feature20']=pd.to_datetime(mydata['feature20'])

现在,当我尝试使用分类器训练模型时,出现以下错误:

  float() argument must be a string or a number, not 'Timestamp'

有解决这个问题的办法吗?

您可以轻松地将日期转换为整数:df["feature20"].astype("int64") // 10**9.

注意: 但保留日期时间功能不是一个好主意,除非您正在处理时间序列。通常您会希望从该日期时间中提取其他信息 - 星期几、月份日期、一年中的星期、月份 # 等


演示:

In [9]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01', periods=10)})

In [10]: df["new"] = df["feature20"].astype("int64") // 10**9

In [11]: df
Out[11]:
   feature20         new
0 2010-01-01  1262304000
1 2010-01-02  1262390400
2 2010-01-03  1262476800
3 2010-01-04  1262563200
4 2010-01-05  1262649600
5 2010-01-06  1262736000
6 2010-01-07  1262822400
7 2010-01-08  1262908800
8 2010-01-09  1262995200
9 2010-01-10  1263081600

In [12]: df["date"] = pd.to_datetime(df["new"], unit="s")

In [13]: df
Out[13]:
   feature20         new       date
0 2010-01-01  1262304000 2010-01-01
1 2010-01-02  1262390400 2010-01-02
2 2010-01-03  1262476800 2010-01-03
3 2010-01-04  1262563200 2010-01-04
4 2010-01-05  1262649600 2010-01-05
5 2010-01-06  1262736000 2010-01-06
6 2010-01-07  1262822400 2010-01-07
7 2010-01-08  1262908800 2010-01-08
8 2010-01-09  1262995200 2010-01-09
9 2010-01-10  1263081600 2010-01-10

如果你有微秒精度:

In [28]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01 01:01:01.123456', freq="123S", periods=10)})

In [29]: df
Out[29]:
                   feature20
0 2010-01-01 01:01:01.123456
1 2010-01-01 01:03:04.123456
2 2010-01-01 01:05:07.123456
3 2010-01-01 01:07:10.123456
4 2010-01-01 01:09:13.123456
5 2010-01-01 01:11:16.123456
6 2010-01-01 01:13:19.123456
7 2010-01-01 01:15:22.123456
8 2010-01-01 01:17:25.123456
9 2010-01-01 01:19:28.123456

In [30]: df["new"] = df["feature20"].astype("int64") // 10**3

In [31]: df
Out[31]:
                   feature20               new
0 2010-01-01 01:01:01.123456  1262307661123456
1 2010-01-01 01:03:04.123456  1262307784123456
2 2010-01-01 01:05:07.123456  1262307907123456
3 2010-01-01 01:07:10.123456  1262308030123456
4 2010-01-01 01:09:13.123456  1262308153123456
5 2010-01-01 01:11:16.123456  1262308276123456
6 2010-01-01 01:13:19.123456  1262308399123456
7 2010-01-01 01:15:22.123456  1262308522123456
8 2010-01-01 01:17:25.123456  1262308645123456
9 2010-01-01 01:19:28.123456  1262308768123456

In [32]: df["date"] = pd.to_datetime(df["new"], unit="us")

In [33]: df
Out[33]:
                   feature20               new                       date
0 2010-01-01 01:01:01.123456  1262307661123456 2010-01-01 01:01:01.123456
1 2010-01-01 01:03:04.123456  1262307784123456 2010-01-01 01:03:04.123456
2 2010-01-01 01:05:07.123456  1262307907123456 2010-01-01 01:05:07.123456
3 2010-01-01 01:07:10.123456  1262308030123456 2010-01-01 01:07:10.123456
4 2010-01-01 01:09:13.123456  1262308153123456 2010-01-01 01:09:13.123456
5 2010-01-01 01:11:16.123456  1262308276123456 2010-01-01 01:11:16.123456
6 2010-01-01 01:13:19.123456  1262308399123456 2010-01-01 01:13:19.123456
7 2010-01-01 01:15:22.123456  1262308522123456 2010-01-01 01:15:22.123456
8 2010-01-01 01:17:25.123456  1262308645123456 2010-01-01 01:17:25.123456
9 2010-01-01 01:19:28.123456  1262308768123456 2010-01-01 01:19:28.123456