Pandas pd.cut 个有下限问题的 bin

Question

样本数据我试图从 1980 年开始按 5 年的时间间隔进行分类将此代码用于 pd.cut

bins = list(range(1980, 2025, 4))    
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, include_lowest=True)

此数据框中 1980 行的 bin 的起始值低于理想值的结果：

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      **(1979.999,** 1984.0]
1981    1.0    34.0    1202     (1979.999, 1984.0]
1982    2.0    35.0    1428     (1979.999, 1984.0]
1983    2.0    37.0    2374     (1979.999, 1984.0]
1984    2.0    46.0    2890     (1979.999, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

并且，删除 include_lowest=True 位，导致 1980 年根本没有 bin：

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      NaN
1981    1.0    34.0    1202     (1980.0, 1984.0]
1982    2.0    35.0    1428     (1980.0, 1984.0]
1983    2.0    37.0    2374     (1980.0, 1984.0]
1984    2.0    46.0    2890     (1980.0, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

所以，这里的测验问题是，如何使用 pd.cut 来获得这个理想的结果：

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      **(1980.0, 1984.0]**
1981    1.0    34.0    1202     (1980.0, 1984.0]
1982    2.0    35.0    1428     (1980.0, 1984.0]
1983    2.0    37.0    2374     (1980.0, 1984.0]
1984    2.0    46.0    2890     (1980.0, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

我遵循了文档和几个示例，上面的代码是最好的结果。我即将开始手动将 bin 列值转换为字符串并编辑“1979.999”部分以读取“1980”，以便这些 bin 对人类有意义。但是，必须有更好的方法。因此，我的问题。

Answer 1

这有点棘手，

但是你可以使用标签。

labels = ['(%d, %d]'%(bins[i], bins[i+1]) for i in range(len(bins)-1)]
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, labels=labels, include_lowest=True)

Pandas pd.cut 个有下限问题的 bin

Pandas pd.cut bins with lower bound issue

python

binning

pandas