通过在 Pandas Dataframe 中添加乘数索引来创建包含重复项的列表

Question

给定这样的数据框：

row1 = ['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA']
row2 = ['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA']
col = {'List': [row1, row2]}
df = pd.DataFrame(col)

这导致：

	List
0	['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA']
1	['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA']

我想生成以下数据框：

	List
0	['AAA', 'BBB x 2', 'CCC', 'AAA x 2']
1	['CCC x 2', 'BBB', 'AAA x 3']

其中最后一列 List 包含一个乘数索引，指示该术语在列表中连续出现的次数。

您能否提出解决此任务的 pandas 指令？

Answer 1

itertools groupby 可以满足您的需求。如果该组中有多个记录，自定义函数会将计数加入值。

from itertools import groupby
import pandas as pd

row1 = ['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA']
row2 = ['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA']
col = {'Lists': [row1, row2]}
df = pd.DataFrame(col)

def count_items(row):
    output = []
    for k, d in groupby(row):
        x = list(d)
        if len(x)>1:
            output.append(' x '.join([k, str(len(x))]))
        else:
            output.append(k)
            
    return output

df['Lists'] = df.Lists.apply(lambda x: count_items(x))

print(df)

输出

                          Lists
0  [AAA, BBB x 2, CCC, AAA x 2]
1       [CCC x 2, BBB, AAA x 3]

Answer 2

在您的情况下，您可能需要检查 explode，然后我们创建包含 cumsum 和 shift

的子组

s = df.explode('List')
s = s.groupby([s.index,s['List'].shift().ne(s['List']).cumsum()])['List'].agg(['first','count'])
out = s['first'] +'x' + s['count'].astype(str)
out = out.mask(s['count']==1,s['first']).groupby(level=0).agg(list)
out
Out[202]: 
0    [AAA, BBBx2, CCC, AAAx2]
1         [CCCx2, BBB, AAAx3]
dtype: object

通过在 Pandas Dataframe 中添加乘数索引来创建包含重复项的列表

Creat a list with repeated terms by adding a multiplier index in Pandas Dataframe

python

list

data-manipulation

dataframe

pandas