pandas 数据框中的分类列重采样

Resampling of categorical column in pandas data frame

我需要一些帮助来解决这个问题。一直在尝试一些事情,但没有奏效。我有一个 pandas 数据框,如下所示(最后): 数据以不规则的间隔(频率不固定)提供。我希望以固定频率对数据进行采样,例如每 1 分钟一次。如果该列是浮点数,则意味着每 1 分钟工作正常

df1.resample('1T',base = 1).mean()

但由于数据是分类均值没有意义,我还尝试了求和,这从抽样中也没有意义。本质上我需要的是在 1 分钟采样时列的最大计数 为此,我使用以下代码将自定义函数应用于重采样时在 1 分钟内下降的值。 .

    def custome_mod(arraylike):
      vals, counts = np.unique(arraylike, return_counts=True)
  return (np.argwhere(counts == np.max(counts)))

df1.resample('1T',base = 1).apply(custome_mod) 

我期望的输出是:每 1 分钟可用的数据帧和落在该 1 分钟内的数据的最大计数值。 出于某种原因,它似乎不起作用并且给我错误。已经尝试调试了很长时间。有人可以提供一些 inputs/code 支票吗?

我得到的错误如下:

ValueError: zero-size array to reduction operation maximum which has no identity

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    264             try:
--> 265                 return self._python_agg_general(func, *args, **kwargs)
    266             except (ValueError, KeyError):

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    935 
--> 936             result, counts = self.grouper.agg_series(obj, f)
    937             assert result is not None

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
    862         grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863         return grouper.get_result()
    864 

pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()

pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()

pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()

ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    358                 # Check if the function is reducing or not.
--> 359                 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
    360             else:

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   1171             try:
-> 1172                 result[item] = colg.aggregate(func, *args, **kwargs)
   1173 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    268                 #  see see test_groupby.test_basic
--> 269                 result = self._aggregate_named(func, *args, **kwargs)
    270 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
    453             if isinstance(output, (Series, Index, np.ndarray)):
--> 454                 raise ValueError("Must produce aggregated value")
    455             result[name] = output

ValueError: Must produce aggregated value

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
    283             how = func
    284             grouper = None
--> 285             result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
    286 
    287         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    380             # we have a non-reducing function
    381             # try to evaluate
--> 382             result = grouped.apply(how, *args, **kwargs)
    383 
    384         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    733         with option_context("mode.chained_assignment", None):
    734             try:
--> 735                 result = self._python_apply_general(f)
    736             except TypeError:
    737                 # gh-20949

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    749 
    750     def _python_apply_general(self, f):
--> 751         keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
    752 
    753         return self._wrap_applied_output(

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    204             # group might be modified
    205             group_axes = group.axes
--> 206             res = f(group)
    207             if not _is_indexed_like(res, group_axes):
    208                 mutated = True

<command-36984414005658> in custome_mod(arraylike)
      1 def custome_mod(arraylike):
      2   vals, counts = np.unique(arraylike, return_counts=True)
----> 3   return (np.argwhere(counts == np.max(counts)))

<__array_function__ internals> in amax(*args, **kwargs)

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
   2666     """
   2667     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668                           keepdims=keepdims, initial=initial, where=where)
   2669 
   2670 

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     88                 return reduction(axis=axis, out=out, **passkwargs)
     89 
---> 90     return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
     91 
     92 

ValueError: zero-size array to reduction operation maximum which has no identity

示例数据框和预期输出

样本 Df

6/3/2021 1:19:05    0
6/3/2021 1:19:15    1
6/3/2021 1:19:26    1
6/3/2021 1:19:38    1
6/3/2021 1:20:06    0
6/3/2021 1:20:16    0
6/3/2021 1:20:36    1
6/3/2021 1:21:09    1
6/3/2021 1:21:19    1
6/3/2021 1:21:45    0
6/4/2021 1:19:15    0
6/4/2021 1:19:25    0
6/4/2021 1:19:36    0
6/4/2021 1:19:48    1
6/4/2021 1:22:26    1
6/4/2021 1:22:36    0
6/4/2021 1:22:46    0
6/5/2021 2:20:19    0
6/5/2021 2:20:21    1
6/5/2021 2:20:40    0

预期输出

6/3/2021 1:19   1
6/3/2021 1:20   0
6/3/2021 1:21   1
6/4/2021 1:19   0
6/4/2021 1:22   0
6/5/2021 2:20   0

请注意,原始数据帧有不规则频率的可用数据(有时每 5 秒 20 秒等)。预期的输出也显示在上面 - 每 1 分钟需要数据(每分钟重新采样一次,而不是原来的不规则秒数)并且分类列应该在那一分钟内具有最频繁的值。例如:在 19 分钟的原始数据中,有四个数据点,其中最频繁的值为 1。类似地,在 20 分钟,原始数据中有三个数据点,最频繁的是 0。同样,对于 21 分钟,有三个数据点,最频繁的是 1。我正在处理的数据也有 2000 万行。希望它有所帮助,这是减少数据维度的努力。

在预期输出后,我将执行 groupby column 和 count 。这个计数将以分钟为单位,我将能够知道这个列是 1 多长时间(及时)

编辑后更新

out = df.set_index(pd.to_datetime(df.index).floor('T')) \
        .groupby(level=0)['category'] \
        .apply(lambda x: x.value_counts().idxmax())
print(out)

# Output
2021-06-03 01:19:00    1
2021-06-03 01:20:00    0
2021-06-03 01:21:00    1
2021-06-04 01:19:00    0
2021-06-04 01:22:00    0
2021-06-05 02:20:00    0
Name: category, dtype: int64

旧答案

# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
            category
2021-06-03         6
2021-06-04         2
2021-06-06         1
2021-06-08         1
2021-06-25         1
2021-06-29         6
2021-06-30         3

# OR

>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
            category
2021-06-03         2
2021-06-04         0
2021-06-06         1
2021-06-08         1
2021-06-25         0
2021-06-29         3
2021-06-30         1