如何创建类别列并将其分解为新行

How to create a category column and explode that into new rows

我有这个非常混乱的数据框:

id     letter_1     letter_2    letter_3     number_1    number_2
1      abc                                   123         
2      def           ghi          jkl                    456
3                    mno          pqr        789         

基本上我期望的数据框是:

id     letter/number    data
1      letter           abc
1      number           123
2      letter           def
2      letter           ghi
2      letter           jkl
2      number           456
3      letter           mno
3      letter           pqr
4      number           789

我想我会先用字母再用数字。所以我有我的数据框 'data':

data = pd.DataFrame({'id':['1','2','3'],letter_1':['abc','def',''],'letter_2':['','ghi','mno'],'letter_3':['','jkl','pqr'],'number_1':['123','','789'],'number_2':['','456','']})

1- 通过连接列 'letter_1'、'letter_2' 和 'letter_3' 创建一个 'Category' 格式的列 *这里我遇到了空值不属于类别的困难,但我正在使用:

data['new_col_category'] = data.apply(lambda row: row['letter_1'] + "," + row['letter_2'] + "," + row['letter_3'], axis=1).astype('category')

2- 展开该列,将每个组合变成一个新行:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = data['new_col_category'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'id': np.repeat(data['id'], lens),
                    'number_1': np.repeat(data['number_1'], lens),
                    'number_2': np.repeat(data['number_2'], lens),
                    'new_col_category': chainer(data['new_col_category'])})

之后,我想到创建 'Letter/Number' 列并将所有内容分配为 'Letter'。然后使用数字列重复整个过程,最后分配数据['Letter/Number'] = 'Number'

有道理吗?我想我错过了什么。有帮助吗?

这是使用 stack 的方法。首先从列名称中删除 _n,然后 set_index 列 ID,mask 包含空字符串的单元格,在 stack 数据时将被删除。然后使用 reset_indexrename 来拟合预期的输出。

# to keep original data if needed
res = data.copy()
# remove the _n from columns names
res.columns = [c.split('_')[0] for c in res.columns]
res = (
    res.set_index('id')
       .mask(lambda x: x=='')
       .stack()
       .reset_index(name='data')
       .rename(columns={'level_1':'letter/number'})
)
print(res)
#   id letter/number data
# 0  1        letter  abc
# 1  1        number  123
# 2  2        letter  def
# 3  2        letter  ghi
# 4  2        letter  jkl
# 5  2        number  456
# 6  3        letter  mno
# 7  3        letter  pqr
# 8  3        number  789