Pandas groupby 表示不处理日期时间列

Question

我有一个包含 date_time 列的数据框，格式为 datetime64[ns]：

 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   filename   235 non-null    object        
 1   date_time  235 non-null    datetime64[ns]
 2   r          235 non-null    float64

df:

    filename           date_time       r
0        01_ 2022-05-24 12:07:06 3.2E+05
1        01_ 2022-05-24 12:08:15 3.1E+05
2        01_ 2022-05-24 12:09:23 2.9E+05
3        02_ 2022-05-24 12:10:43 5.0E+06
4        04_ 2022-05-24 12:38:26 5.6E+06
..       ...                 ...     ...
230      91_ 2022-05-26 09:57:50 8.9E+06
231      91_ 2022-05-26 09:59:06 8.3E+06
232      91_ 2022-05-26 10:00:23 8.5E+06
233      91_ 2022-05-26 10:01:40 9.0E+06
234      91_ 2022-05-26 10:02:57 9.1E+06

计算 date_time 的平均值，按文件名分组，仅适用于：

df_.groupby(["filename"]).agg(["mean"])

而不是：

df_.groupby(["filename"]).mean()
df_.groupby(["filename"]).agg("mean")

为什么它只适用于 df_.groupby(["filename"]).agg(["mean"])？

下面是带有示例的代码：

print("works with:")
print(df_.groupby(["filename"]).agg(["mean"]))
print ("doesn't work with: (no date_time column showing)")
print(df_.groupby(["filename"]).mean())
print(df_.groupby(["filename"]).agg("mean"))

OUT: 

works with:
                             date_time       r
                                  mean    mean
filename                                      
01_      2022-05-24 12:08:14.666666752 3.1E+05
02_      2022-05-24 12:10:43.000000000 5.0E+06
04_      2022-05-24 12:39:34.999999744 5.2E+06
05_      2022-05-24 12:42:54.000000000 7.5E+04
06_      2022-05-24 12:47:06.000000000 3.4E+05
...                                ...     ...
87_      2022-05-25 16:44:56.000000000 9.5E+06
88_      2022-05-26 09:15:00.875000064 1.1E+05
89_      2022-05-26 09:29:22.357143040 8.3E+06
90_      2022-05-26 09:45:32.500000000 1.1E+05
91_      2022-05-26 09:55:16.384615424 8.9E+06

[75 rows x 2 columns]
doesn't work with: (no date_time column showing)
               r
filename        
01_      3.1E+05
02_      5.0E+06
04_      5.2E+06
05_      7.5E+04
06_      3.4E+05
...          ...
87_      9.5E+06
88_      1.1E+05
89_      8.3E+06
90_      1.1E+05
91_      8.9E+06

[75 rows x 1 columns]
               r
filename        
01_      3.1E+05
02_      5.0E+06
04_      5.2E+06
05_      7.5E+04
06_      3.4E+05
...          ...
87_      9.5E+06
88_      1.1E+05
89_      8.3E+06
90_      1.1E+05
91_      8.9E+06

[75 rows x 1 columns]

Answer 1

当除了分组列之外只有一个日期时间列时，它在所有情况下都有效。但是当你有更多列时，agg 似乎只适用于数值。

文档字符串说：https://github.com/pandas-dev/pandas/blob/df32e83f36bf485be803be2b87d23135be30540a/pandas/core/base.py#L301

如果 arg 是字符串，则尝试对其进行操作：

    - try to find a function (or attribute) on ourselves
    - try to find a numpy function

他们还提到：人们可能会尝试聚合 non-callable 属性但不要让他们认为他们可以将参数传递给它

我尝试使用 np.mean，结果与使用 ('mean') 时的结果相同。

这是我到的为止。希望这有帮助。

Answer 2

我发布了问题 here on GITHUB，这是提供的解决方案：

问题出在 groupby 参数“numeric_only”= True（设置为默认值）中，它仅包含数字数据（删除 datetime64[ns] 列）。

Doc GroupBy.mean

它在将 ["mean"] 作为列表传递时起作用，因为：在 agg 中使用列表或字典时，DataFrame 在应用每个函数之前被分解为系列。

Pandas groupby 表示不处理日期时间列

Pandas groupby mean not working on datetime column

python

datetime

pandas

pandas-groupby