Pandas dataframes unique() 函数产生奇怪的对象

Question

所以我的 df 中有一个列，其中填充了字符串（邮政编码）。我知道在我的约 100.000 个样本的数据集中有 8147 个唯一值。

我对列进行了二进制编码，它只创建了 8 列，这不可能是正确的，因为 2^8 等于 256 个唯一值。它应该是 13 列。

然后我在我的专栏中使用这段代码来查找错误：

X_train["Postcode"].unique()

结果是这样奇怪的事情：

['47623', '26506', '41179', '41063', '42283', ..., '01471', '86922', '47624', '86923', '86941']
Length: 143
Categories (8147, object): ['01067', '01069', '01097', '01099', ..., '99991', '99994', '99996', '99998']

type() 显示这是一个 pandas.core.arrays.categorical.Categorical

但这到底是怎么回事？我超级困惑 Length 是什么意思？它显示了唯一对象的正确数量，但是当我再次执行 len() 时，它又是 returns 143。 143 似乎很重要，因为这似乎是二进制编码器使用的值。但是有 8147 个对象的东西怎么会有 143 的长度呢？

这也是df的一个专栏，即pandas系列。也许你能帮帮我。

Answer 1

如果我没有正确理解你的问题，那是因为类别定义本身有 8147 个条目，但实际 series/df 列只有 143 个条目（或 143 个唯一条目）。

我创建自己的类别来说明：

label_type = pd.api.types.CategoricalDtype(categories=["yes", "no", "maybe"], ordered=False)
label_type

CategoricalDtype(categories=['yes', 'no', 'maybe'], ordered=False)

s = pd.Series(['yes','yes','yes','no','no'], dtype=label_type)

print(s)

0    yes
1    yes
2    yes
3     no
4     no
dtype: category
Categories (3, object): ['yes', 'no', 'maybe']

s.unique()

['yes', 'no']
Categories (3, object): ['yes', 'no', 'maybe']

这显示了该系列中的两个独特条目，但仍显示了 3 个类别 - 即使该系列中不存在 'maybe'。 'maybe' 尽管系列中没有出现，但仍然作为有效类别存在。

另一种演示方式：

s2 = pd.Series(['yes','yes','yes','no','no','maybe'], dtype="category")
s2

0      yes
1      yes
2      yes
3       no
4       no
5    maybe
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']

s2[:-1]

0    yes
1    yes
2    yes
3     no
4     no
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']

s2[:-1].unique()

['yes', 'no']
Categories (3, object): ['maybe', 'no', 'yes']

Pandas dataframes unique() 函数产生奇怪的对象

Pandas dataframes unique() function produces strange object

dataframe

python-3.x

pandas

categorical-data