在 pandas binning 中输出 bins 和 labels 列
Output both bins and labels column in pandas binning
我有一个数据框列,我想对其执行分箱,例如:
df.head
X
4.6
2.5
3.1
1.7
我想要一列用于 bin 范围和一列用于标签,如下:
df.head
X bin label
4.6 (4,5] 5
2.5 (2,3] 3
3.1 (3,4] 4
1.7 (1,2] 2
显然,按如下方式设置 label
参数只会产生一列用于 bin 标签,但不再用于范围。
df['bin'] = df.X.apply(pd.cut, labels=np.arange(5))
是否有更优雅的解决方案,而不是 运行 pd.cut
2 列 2 次?
谢谢
如果允许 pd.cut
动态设置 bin 边缘,则可以使用 retbins
标志。来自 pd.cut
documentation:
retbins: bool, default False
Whether to return the bins or not. Useful when bins is provided as a scalar.
这将 return 第二个结果:
bins: numpy.ndarray or IntervalIndex.
The computed or specified bins. Only returned when
retbins=True. For scalar or sequence bins, this is
an ndarray with the computed bins. If set
duplicates=drop, bins will drop non-unique bin. For
an IntervalIndex bins, this is equal to bins.
您可以使用它来将 bin 边缘分配给框架:
assignments, edges = pd.cut(df.X, bins=5, labels=False, retbins=True)
df['label'] = assignments
df['bin_floor'] = edges[assignments]
df['bin_ceil'] = edges[assignments + 1]
您的评论表明您想在 groupby 操作中使用它。在这种情况下,您可以将上面的内容包装在一个函数中:
def assign_dynamic_bin_ids_and_labels(
df,
value_col,
nbins,
label_col='label',
bin_floor_col='bin_floor',
bin_ceil_col='bin_ceil',
):
assignments, edges = pd.cut(
df[value_col], bins=5, labels=False, retbins=True
)
df[label_col] = assignments
df[bin_floor_col] = edges[assignments]
df[bin_ceil_col] = edges[assignments + 1]
return df
df.groupby('id').apply(assign_dynamic_bin_ids_and_labels, 'X', 5)
我有一个数据框列,我想对其执行分箱,例如:
df.head
X
4.6
2.5
3.1
1.7
我想要一列用于 bin 范围和一列用于标签,如下:
df.head
X bin label
4.6 (4,5] 5
2.5 (2,3] 3
3.1 (3,4] 4
1.7 (1,2] 2
显然,按如下方式设置 label
参数只会产生一列用于 bin 标签,但不再用于范围。
df['bin'] = df.X.apply(pd.cut, labels=np.arange(5))
是否有更优雅的解决方案,而不是 运行 pd.cut
2 列 2 次?
谢谢
如果允许 pd.cut
动态设置 bin 边缘,则可以使用 retbins
标志。来自 pd.cut
documentation:
retbins: bool, default False
Whether to return the bins or not. Useful when bins is provided as a scalar.
这将 return 第二个结果:
bins: numpy.ndarray or IntervalIndex.
The computed or specified bins. Only returned when
retbins=True. For scalar or sequence bins, this is
an ndarray with the computed bins. If set
duplicates=drop, bins will drop non-unique bin. For
an IntervalIndex bins, this is equal to bins.
您可以使用它来将 bin 边缘分配给框架:
assignments, edges = pd.cut(df.X, bins=5, labels=False, retbins=True)
df['label'] = assignments
df['bin_floor'] = edges[assignments]
df['bin_ceil'] = edges[assignments + 1]
您的评论表明您想在 groupby 操作中使用它。在这种情况下,您可以将上面的内容包装在一个函数中:
def assign_dynamic_bin_ids_and_labels(
df,
value_col,
nbins,
label_col='label',
bin_floor_col='bin_floor',
bin_ceil_col='bin_ceil',
):
assignments, edges = pd.cut(
df[value_col], bins=5, labels=False, retbins=True
)
df[label_col] = assignments
df[bin_floor_col] = edges[assignments]
df[bin_ceil_col] = edges[assignments + 1]
return df
df.groupby('id').apply(assign_dynamic_bin_ids_and_labels, 'X', 5)