pandas 数据框中的分类列重采样
Resampling of categorical column in pandas data frame
我需要一些帮助来解决这个问题。一直在尝试一些事情,但没有奏效。我有一个 pandas 数据框,如下所示(最后):
数据以不规则的间隔(频率不固定)提供。我希望以固定频率对数据进行采样,例如每 1 分钟一次。如果该列是浮点数,则意味着每 1 分钟工作正常
df1.resample('1T',base = 1).mean()
但由于数据是分类均值没有意义,我还尝试了求和,这从抽样中也没有意义。本质上我需要的是在 1 分钟采样时列的最大计数 为此,我使用以下代码将自定义函数应用于重采样时在 1 分钟内下降的值。 .
def custome_mod(arraylike):
vals, counts = np.unique(arraylike, return_counts=True)
return (np.argwhere(counts == np.max(counts)))
df1.resample('1T',base = 1).apply(custome_mod)
我期望的输出是:每 1 分钟可用的数据帧和落在该 1 分钟内的数据的最大计数值。
出于某种原因,它似乎不起作用并且给我错误。已经尝试调试了很长时间。有人可以提供一些 inputs/code 支票吗?
我得到的错误如下:
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
264 try:
--> 265 return self._python_agg_general(func, *args, **kwargs)
266 except (ValueError, KeyError):
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
935
--> 936 result, counts = self.grouper.agg_series(obj, f)
937 assert result is not None
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
862 grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863 return grouper.get_result()
864
pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()
pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
358 # Check if the function is reducing or not.
--> 359 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
360 else:
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
1171 try:
-> 1172 result[item] = colg.aggregate(func, *args, **kwargs)
1173
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
268 # see see test_groupby.test_basic
--> 269 result = self._aggregate_named(func, *args, **kwargs)
270
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
453 if isinstance(output, (Series, Index, np.ndarray)):
--> 454 raise ValueError("Must produce aggregated value")
455 result[name] = output
ValueError: Must produce aggregated value
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
283 how = func
284 grouper = None
--> 285 result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
286
287 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
380 # we have a non-reducing function
381 # try to evaluate
--> 382 result = grouped.apply(how, *args, **kwargs)
383
384 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
733 with option_context("mode.chained_assignment", None):
734 try:
--> 735 result = self._python_apply_general(f)
736 except TypeError:
737 # gh-20949
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
749
750 def _python_apply_general(self, f):
--> 751 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
752
753 return self._wrap_applied_output(
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
204 # group might be modified
205 group_axes = group.axes
--> 206 res = f(group)
207 if not _is_indexed_like(res, group_axes):
208 mutated = True
<command-36984414005658> in custome_mod(arraylike)
1 def custome_mod(arraylike):
2 vals, counts = np.unique(arraylike, return_counts=True)
----> 3 return (np.argwhere(counts == np.max(counts)))
<__array_function__ internals> in amax(*args, **kwargs)
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2666 """
2667 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668 keepdims=keepdims, initial=initial, where=where)
2669
2670
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
88 return reduction(axis=axis, out=out, **passkwargs)
89
---> 90 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
91
92
ValueError: zero-size array to reduction operation maximum which has no identity
示例数据框和预期输出
样本 Df
6/3/2021 1:19:05 0
6/3/2021 1:19:15 1
6/3/2021 1:19:26 1
6/3/2021 1:19:38 1
6/3/2021 1:20:06 0
6/3/2021 1:20:16 0
6/3/2021 1:20:36 1
6/3/2021 1:21:09 1
6/3/2021 1:21:19 1
6/3/2021 1:21:45 0
6/4/2021 1:19:15 0
6/4/2021 1:19:25 0
6/4/2021 1:19:36 0
6/4/2021 1:19:48 1
6/4/2021 1:22:26 1
6/4/2021 1:22:36 0
6/4/2021 1:22:46 0
6/5/2021 2:20:19 0
6/5/2021 2:20:21 1
6/5/2021 2:20:40 0
预期输出
6/3/2021 1:19 1
6/3/2021 1:20 0
6/3/2021 1:21 1
6/4/2021 1:19 0
6/4/2021 1:22 0
6/5/2021 2:20 0
请注意,原始数据帧有不规则频率的可用数据(有时每 5 秒 20 秒等)。预期的输出也显示在上面 - 每 1 分钟需要数据(每分钟重新采样一次,而不是原来的不规则秒数)并且分类列应该在那一分钟内具有最频繁的值。例如:在 19 分钟的原始数据中,有四个数据点,其中最频繁的值为 1。类似地,在 20 分钟,原始数据中有三个数据点,最频繁的是 0。同样,对于 21 分钟,有三个数据点,最频繁的是 1。我正在处理的数据也有 2000 万行。希望它有所帮助,这是减少数据维度的努力。
在预期输出后,我将执行 groupby column 和 count 。这个计数将以分钟为单位,我将能够知道这个列是 1 多长时间(及时)
编辑后更新:
out = df.set_index(pd.to_datetime(df.index).floor('T')) \
.groupby(level=0)['category'] \
.apply(lambda x: x.value_counts().idxmax())
print(out)
# Output
2021-06-03 01:19:00 1
2021-06-03 01:20:00 0
2021-06-03 01:21:00 1
2021-06-04 01:19:00 0
2021-06-04 01:22:00 0
2021-06-05 02:20:00 0
Name: category, dtype: int64
旧答案
# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
category
2021-06-03 6
2021-06-04 2
2021-06-06 1
2021-06-08 1
2021-06-25 1
2021-06-29 6
2021-06-30 3
# OR
>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
category
2021-06-03 2
2021-06-04 0
2021-06-06 1
2021-06-08 1
2021-06-25 0
2021-06-29 3
2021-06-30 1
我需要一些帮助来解决这个问题。一直在尝试一些事情,但没有奏效。我有一个 pandas 数据框,如下所示(最后): 数据以不规则的间隔(频率不固定)提供。我希望以固定频率对数据进行采样,例如每 1 分钟一次。如果该列是浮点数,则意味着每 1 分钟工作正常
df1.resample('1T',base = 1).mean()
但由于数据是分类均值没有意义,我还尝试了求和,这从抽样中也没有意义。本质上我需要的是在 1 分钟采样时列的最大计数 为此,我使用以下代码将自定义函数应用于重采样时在 1 分钟内下降的值。 .
def custome_mod(arraylike):
vals, counts = np.unique(arraylike, return_counts=True)
return (np.argwhere(counts == np.max(counts)))
df1.resample('1T',base = 1).apply(custome_mod)
我期望的输出是:每 1 分钟可用的数据帧和落在该 1 分钟内的数据的最大计数值。 出于某种原因,它似乎不起作用并且给我错误。已经尝试调试了很长时间。有人可以提供一些 inputs/code 支票吗?
我得到的错误如下:
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
264 try:
--> 265 return self._python_agg_general(func, *args, **kwargs)
266 except (ValueError, KeyError):
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
935
--> 936 result, counts = self.grouper.agg_series(obj, f)
937 assert result is not None
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
862 grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863 return grouper.get_result()
864
pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()
pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
358 # Check if the function is reducing or not.
--> 359 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
360 else:
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
1171 try:
-> 1172 result[item] = colg.aggregate(func, *args, **kwargs)
1173
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
268 # see see test_groupby.test_basic
--> 269 result = self._aggregate_named(func, *args, **kwargs)
270
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
453 if isinstance(output, (Series, Index, np.ndarray)):
--> 454 raise ValueError("Must produce aggregated value")
455 result[name] = output
ValueError: Must produce aggregated value
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
283 how = func
284 grouper = None
--> 285 result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
286
287 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
380 # we have a non-reducing function
381 # try to evaluate
--> 382 result = grouped.apply(how, *args, **kwargs)
383
384 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
733 with option_context("mode.chained_assignment", None):
734 try:
--> 735 result = self._python_apply_general(f)
736 except TypeError:
737 # gh-20949
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
749
750 def _python_apply_general(self, f):
--> 751 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
752
753 return self._wrap_applied_output(
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
204 # group might be modified
205 group_axes = group.axes
--> 206 res = f(group)
207 if not _is_indexed_like(res, group_axes):
208 mutated = True
<command-36984414005658> in custome_mod(arraylike)
1 def custome_mod(arraylike):
2 vals, counts = np.unique(arraylike, return_counts=True)
----> 3 return (np.argwhere(counts == np.max(counts)))
<__array_function__ internals> in amax(*args, **kwargs)
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2666 """
2667 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668 keepdims=keepdims, initial=initial, where=where)
2669
2670
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
88 return reduction(axis=axis, out=out, **passkwargs)
89
---> 90 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
91
92
ValueError: zero-size array to reduction operation maximum which has no identity
示例数据框和预期输出
样本 Df
6/3/2021 1:19:05 0
6/3/2021 1:19:15 1
6/3/2021 1:19:26 1
6/3/2021 1:19:38 1
6/3/2021 1:20:06 0
6/3/2021 1:20:16 0
6/3/2021 1:20:36 1
6/3/2021 1:21:09 1
6/3/2021 1:21:19 1
6/3/2021 1:21:45 0
6/4/2021 1:19:15 0
6/4/2021 1:19:25 0
6/4/2021 1:19:36 0
6/4/2021 1:19:48 1
6/4/2021 1:22:26 1
6/4/2021 1:22:36 0
6/4/2021 1:22:46 0
6/5/2021 2:20:19 0
6/5/2021 2:20:21 1
6/5/2021 2:20:40 0
预期输出
6/3/2021 1:19 1
6/3/2021 1:20 0
6/3/2021 1:21 1
6/4/2021 1:19 0
6/4/2021 1:22 0
6/5/2021 2:20 0
请注意,原始数据帧有不规则频率的可用数据(有时每 5 秒 20 秒等)。预期的输出也显示在上面 - 每 1 分钟需要数据(每分钟重新采样一次,而不是原来的不规则秒数)并且分类列应该在那一分钟内具有最频繁的值。例如:在 19 分钟的原始数据中,有四个数据点,其中最频繁的值为 1。类似地,在 20 分钟,原始数据中有三个数据点,最频繁的是 0。同样,对于 21 分钟,有三个数据点,最频繁的是 1。我正在处理的数据也有 2000 万行。希望它有所帮助,这是减少数据维度的努力。
在预期输出后,我将执行 groupby column 和 count 。这个计数将以分钟为单位,我将能够知道这个列是 1 多长时间(及时)
编辑后更新:
out = df.set_index(pd.to_datetime(df.index).floor('T')) \
.groupby(level=0)['category'] \
.apply(lambda x: x.value_counts().idxmax())
print(out)
# Output
2021-06-03 01:19:00 1
2021-06-03 01:20:00 0
2021-06-03 01:21:00 1
2021-06-04 01:19:00 0
2021-06-04 01:22:00 0
2021-06-05 02:20:00 0
Name: category, dtype: int64
旧答案
# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
category
2021-06-03 6
2021-06-04 2
2021-06-06 1
2021-06-08 1
2021-06-25 1
2021-06-29 6
2021-06-30 3
# OR
>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
category
2021-06-03 2
2021-06-04 0
2021-06-06 1
2021-06-08 1
2021-06-25 0
2021-06-29 3
2021-06-30 1