将两个 Pandas 分类列合并为一个列

Question

我有一个 Pandas DataFrame，它有两个分类列：

df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                   "y": ["L", "L", "M", "M", "M", "N", "M"],
                  }).astype("category", "category")

   x  y
0  A  L
1  A  L
2  B  M
3  B  M
4  B  M
5  C  N
6  C  M

我想合并两列并将结果存储为新的分类列，但用“-”分隔。一种简单的方法是将列转换为字符串：

df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))

   x  y      z
0  A  L  A - L
1  A  L  A - L
2  B  M  B - M
3  B  M  B - M
4  B  M  B - M
5  C  N  C - N
6  C  M  C - M

这适用于一个小玩具示例，但我需要 z 是 category dtype（不是字符串）。但是，我的 x 和 y 包含分类字符串（x 和 y 分别有 88903 和 39132 个类别）可能有 50-100 个字符长和大约 500K 行.因此，首先将这些列转换为字符串会导致内存爆炸。

有没有更有效的方法来获得分类输出而不使用大量内存和花费太长时间？

Answer 1

你可以试试这个：

import pandas as pd
from itertools import product

# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                   "y": ["L", "L", "M", "M", "M", "N", "M"],
                  }).astype("category", "category")

# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories

# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])

# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))

# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])

将两个 Pandas 分类列合并为一个列

Combining Two Pandas Categorical Columns into One Column

dataframe

pandas

categorical-data