如何使用 Pandas groupby() 将列中以逗号分隔的项目聚合到列表中?

How to aggregate string with comma-separated items of a column into a list with Pandas groupby()?

我有如下数据:

NAME    ETHNICITY_RECAT TOTAL_LENGTH    3LETTER_SUBSTRINGS
joseph  fr              14              jos, ose, sep, eph
ann     en              16              ann
anne    ir              14              ann, nne
tom     en              18              tom
tommy   fr              16              tom, omm, mmy
ann     ir              19              ann
... more rows

3LETTER_SUBSTRINGS 值是字符串,它捕获 NAME 变量的所有 3 个字母的子字符串。我想将它聚合到一个列表中,每行将每个逗号分隔的项目附加到列表中,并被视为单个列表项。如下:

ETHNICITY_RECAT TOTAL_LENGTH            3LETTER_SUBSTRINGS
                min max mean            <lambda>
fr              2   26  13.22           [jos, ose, sep, eph, tom, oom, mmy, ...]
en              3   24  11.92           [ann, tom, ...]
ir              4   23  12.03           [ann, nne, ann, ...]

我使用以下代码 "did" 它:

aggregations = {
    'TOTAL_LENGTH': [min, max, 'mean'], 
    '3LETTER_SUBSTRINGS': lambda x: list(x),
    }

self.df_agg = self.df.groupby('ETHNICITY_RECAT', as_index=False).agg(aggregations)

问题是整个字符串 "ann, anne" 被视为最终列表中的一个列表项,而不是将每个字符串都视为单个列表项,例如 "ann"、"anne"。

我想查看子字符串的最高频率,但现在我得到的是整个字符串的频率(而不是单个 3 字母子字符串),当我 运行 以下代码时:

from collections import Counter 
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)

我明白了:

[('jos, ose, sep, eph', 19), ('ann, nee', 5), ...]

而不是我想要的:

[('jos', 19), ('ose', 19), ('sep', 23), ('eph', 19), ('ann', 15), ('nee', 5), ...]

我试过了:

'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(', '),

但是它说 invalid syntax

我认为你的大部分代码都没有问题,你只是误解了错误:它与字符串转换无关。 3LETTER_SUBSTRING 列的每个单元格中都有 lists/tuples。当您使用 lambda x:list(x) 函数时,您创建了一个元组列表。因此,没有什么比 split(",") 更重要的是要转换为字符串并返回 table ...

相反,您只需要在创建新列表时取消嵌套 table。所以这是一个可重现的小代码:(请注意,我专注于您的 tuple/aggregation 问题,因为我相信您会很快找到其余代码)

import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom","en"),("tommy","fr"),("ann","fr")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity, and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})

使用您指定的计数器,我得到了

dfg = df.groupby('ethnicity', as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1, 'tom': 1})

要将其作为元组列表获取,只需使用字典内置函数,例如 dict.items()


UPDATE :使用问题中的预格式化字符串列表:

import pandas as pd
# Create some data
names = [("joseph","fr","jos, ose, sep, eph"),("ann","en","ann"),("anne","ir","ann, nne"),("tom","en","tom"),("tommy","fr","tom, omm, mmy"),("ann","fr","ann")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
    """
    Update this function with regard to your data format
    """
    return x.split(", ")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity', as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))

你要做的第一件事是将字符串转换成列表,然后它只是一个 groupbyagg:

df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(', ')

df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],
                                   '3LETTER_SUBSTRINGS':'sum'})

输出:

                TOTAL_LENGTH                             3LETTER_SUBSTRINGS
                         min max  mean                                  sum
ETHNICITY_RECAT                                                            
en                        16  18  17.0                           [ann, tom]
fr                        14  16  15.0  [jos, ose, sep, eph, tom, omm, mmy]
ir                        14  19  16.5                      [ann, nne, ann]