.isin() 后仍然显示列值

Question

根据要求，这是一个最小的可重现示例，它会产生 .isin() 的问题，不会丢弃不在 .isin() 中的值，而只是将它们设置为零：

import os
import pandas as pd

df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex-      Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employee', 7: 'Employee', 8: 'Ex-Employee', 9: 'Ex-Employee', 10: 'Employee', 11: 'Employee', 12: 'Ex-Employee', 13: 'Ex-Employee', 14: 'Employee', 15: 'Employee', 16: 'Employee', 17: 'Ex-Employee', 18: 'Employee', 19: 'Employee', 20: 'Ex-Employee', 21: 'Employee', 22: 'Employee', 23: 'Ex-Employee', 24: 'Employee', 25: 'Employee', 26: 'Ex-Employee', 27: 'Employee', 28: 'Employee', 29: 'Ex-Employee', 30: 'Employee', 31: 'Employee', 32: 'Ex-Employee', 33: 'Employee', 34: 'Employee', 35: 'Ex-Employee', 36: 'Employee', 37: 'Employee', 38: 'Ex-Employee', 39: 'Employee', 40: 'Employee'}, 'Years of service': {0: -0.4, 1: -0.3, 2: -0.2, 3: 1.0, 4: 1.0, 5: 1.0, 6: 2.0, 7: 2.0, 8: 2.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 3.0, 13: 4.0, 14: 4.0, 15: 4.0, 16: 5.0, 17: 5.0, 18: 5.0, 19: 5.0, 20: 6.0, 21: 6.0, 22: 6.0, 23: 11.0, 24: 11.0, 25: 11.0, 26: 16.0, 27: 17.0, 28: 18.0, 29: 21.0, 30: 22.0, 31: 23.0, 32: 26.0, 33: 27.0, 34: 28.0, 35: 31.0, 36: 32.0, 37: 33.0, 38: 35.0, 39: 36.0, 40: 37.0}, 'yos_bins': {0: 0, 1: 0, 2: 0, 3: '0-1', 4: '0-1', 5: '0-1', 6: '1-2', 7: '1-2', 8: '1-2', 9: '1-2', 10: '2-3', 11: '2-3', 12: '2-3', 13: '3-4', 14: '3-4', 15: '3-4', 16: '4-5', 17: '4-5', 18: '4-5', 19: '4-5', 20: '5-6', 21: '5-6', 22: '5-6', 23: '10-15', 24: '10-15', 25: '10-15', 26: '15-20', 27: '15-20', 28: '15-20', 29: '20-40', 30: '20-40', 31: '20-40', 32: '20-40', 33: '20-40', 34: '20-40', 35: '20-40', 36: '20-40', 37: '20-40', 38: '20-40', 39: '20-40', 40: '20-40'}})


cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df_example['yos_bins'] = pd.cut(df_example['Years of service'], bins=cut_bins, labels=cut_labels)

print(df_example['yos_bins'].value_counts())
print(len(df_example['yos_bins']))
print(len(df_example))
print(df_example['yos_bins'].value_counts())

test = df_example[df_example['yos_bins'].isin(['0-1', '1-2', '2-3'])]
print('test dataframe:\n',test)
print('\n')
print('test value counts of yos_bins:\n',     test['yos_bins'].value_counts())
print('\n')
dic_test = test.to_dict()
print(dic_test)
print('\n')
print(test.value_counts())ervr

我为“服务年限”的列创建了分类箱：

cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df['yos_bins'] = pd.cut(df['Years of service'], bins=cut_bins, labels=cut_labels)

然后我将 .isin() 应用于名为 'yos_bins' 的数据框列，目的是过滤 select 离子列值。摘自 df.

中的专栏

我用来切片的列称为 'yos_bins'（即合并的服务年限）。我只想 select 3 个范围（0-1、1-2、2-3 年），但显然该列中包含更多范围。

令我惊讶的是，当我应用 value_counts() 时，我仍然从 df 数据帧中得到 all 列的 yos_bins 值（但是0 个计数）。

test.yos_bins.value_counts()

看起来像这样：

这不是故意的，除了 isin() 中的 3 个之外的所有其他 bin 都应该被删除。由此产生的问题是 0 值显示在 sns.countplots 中，因此我最终得到了计数为零的不需要的列。

当我保存 df to_excel() 时，所有“10-15”值字段都显示“带有 2 位年份的文本日期”错误。我没有将该数据帧加载回 python，所以不确定这是否会导致问题？

有谁知道我如何创建仅包含 3 个 yos_bins 值而不是显示所有 yos_bins 值但有些值为零的测试数据框？

Answer 1

一个丑陋的解决方案，因为 numpy 和 pandas 在元素方面的“在”方面存在缺陷。根据我的经验，我用 numpy 数组手动进行比较。

yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
   Requesting as  Years of service yos_bins
3       Employee               1.0      0-1
4    Ex-Employee               1.0      0-1
5       Employee               1.0      0-1
6       Employee               2.0      1-2
7       Employee               2.0      1-2
8    Ex-Employee               2.0      1-2
9    Ex-Employee               2.0      1-2
10      Employee               3.0      2-3
11      Employee               3.0      2-3
12   Ex-Employee               3.0      2-3

说明（使用 x 作为 yos_bins 和 y 作为 yos_bins_sel）

x[:, None] == y[None, :]).all(1) 是主要内容，x[:, None] 将 x 从形状转换为 (n,) 再到 (n, 1)。 y[None, :] 将 y 从形状 (m,) 转换为 (1, m)。将它们与 == 进行比较形成一个形状为 (n, m) 的广播元素布尔数组，我们希望我们的数组是 (n,)-形的，所以我们应用 .any(1) 以便第二个维度如果至少有一个布尔值是 True（即元素在 yos_bins_sel 数组中），则压缩为 True。您最终得到一个布尔数组，可用于屏蔽原始数据框。将 x 替换为包含要比较的值的数组，将 y 替换为 x 的值应包含在其中的数组，您将能够对任何数据集执行此操作。

.isin() 后仍然显示列值

Column Values still shown after .isin()

python

dataframe

python-3.x

pandas

isin