Pandas: 是否可以对分类列进行下采样？

Question

让我们有一个 DataFrame log 这样的一个：

>>> log
                           state
date_time                       
2020-01-01 00:00:00            0
2020-01-01 00:01:00            0
2020-01-01 00:02:00            0
2020-01-01 00:03:00            1
2020-01-01 00:04:00            1
2020-01-01 00:05:00            1

其中 state 列可以是 0 或 1（或缺失）。如果用 UInt8（支持的最小数字数据类型）表示，可以像这样对数据进行下采样：

>>> log.resample(dt.timedelta(minutes=2)).mean()
                           state
date_time                       
2020-01-01 00:00:00          0.0
2020-01-01 00:02:00          0.5
2020-01-01 00:04:00          1.0

重采样效果很好，只有值 0.5 没有意义，因为它只能是 0 或 1。出于同样的原因，使用 category 是有意义的此列的数据类型。但是，在这种情况下，重采样将不起作用，因为 mean() 方法仅适用于数值数据。

这很有道理 - 然而 - 我可以想象对分类数据进行下采样和平均的过程，只要组中的数据保持相同，结果将是该特定值，否则结果将成为，例如：

categorical_average(['aple', 'aple']) -> 'aple'
categorical_average(['pear', 'pear']) -> 'pear'
categorical_average(['aple', 'pear']) -> <NA>

对于类别为 state 列的呈现 DataFrame log 将导致：

>>> log.resample(dt.timedelta(minutes=2)).probably_some_other_method()
                         state
date_time                       
2020-01-01 00:00:00          0
2020-01-01 00:02:00       <NA>
2020-01-01 00:04:00          1

顺便说一句，我这样做 resample.main() 因为还有许多其他（数字）列，在这些列中它非常有意义，我只是为了简单起见没有在这里明确提及它。

Answer 1

使用自定义函数测试是否具有 if-else:

的唯一值

f = lambda x: x.iat[0] if len(x) > len(set(x)) else pd.NA
a = log.resample(dt.timedelta(minutes=2)).agg({'state':f})
print (a)
                    state
date_time                
2020-01-01 00:00:00     0
2020-01-01 00:02:00  <NA>
2020-01-01 00:04:00     1

Pandas: 是否可以对分类列进行下采样？

Pandas: Is it possible to down-sample categorical column?

python

pandas

categorical-data

pandas-resample