获取数据框中每列每行的列表中单个项目的频率

Get the frequency of individual items in a list of each row of a column in a dataframe

问题陈述

我有一个 pandas 数据框,其中一个列的值是列表类型。我需要获取该特定列表中每个项目的频率。

例如:

import pandas as pd
data = [
    {
        "name": "fruits",
        "values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
    },
    {
        "name": "cars",
        "values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
    },
    {
        "name": "animals",
        "values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
    }
]
df = pd.DataFrame(data)

如果我们在这里打印 df,我们将看到以下数据帧。

    name    values
0   fruits  [apple, banana, cherry, apple, mango, banana, ...
1   cars    [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2   animals [dogs, cats, tiger, tiger, cats, cats, camel]

现在,我想计算每行值列中每个项目的出现频率。

我做了什么

我做的解决方案效率不高,但这是我到目前为止想出来的。所以试图找到最好的方法来解决它。

我借助 python 循环来计算频率并再次将其转换回数据帧。

frequency_list = []
for idx, row in df.iterrows():
    frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
    # sorting and getting top 5 frequency is optional
    frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])

打印时frequency_list我们会得到。

[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
  {'name': 'fruits', 'value': 'banana', 'frequency': 2},
  {'name': 'fruits', 'value': 'cherry', 'frequency': 1},
  {'name': 'fruits', 'value': 'mango', 'frequency': 1}],
 [{'name': 'cars', 'value': 'Audi', 'frequency': 3},
  {'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
  {'name': 'cars', 'value': 'honda', 'frequency': 1}],
 [{'name': 'animals', 'value': 'cats', 'frequency': 3},
  {'name': 'animals', 'value': 'tiger', 'frequency': 2},
  {'name': 'animals', 'value': 'camel', 'frequency': 1},
  {'name': 'animals', 'value': 'dogs', 'frequency': 1}]]

现在我开始为 frequency_list 中的每个项目创建一个数据框并将它们连接起来。

frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
    temp_df = pd.DataFrame(each_frequency)
    if frequency_df.empty:
        frequency_df = temp_df
    else:
        frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)

frequency_df 持有的数据如下所示:

    name    value   frequency
0   fruits  apple   3
1   fruits  banana  2
2   fruits  cherry  1
3   fruits  mango   1
4   cars    Audi    3
5   cars    Ferrari 2
6   cars    honda   1
7   animals cats    3
8   animals tiger   2
9   animals camel   1
10  animals dogs    1

预期输出

                frequency
name    value   
animals camel   1
        cats    3
        dogs    1
        tiger   2
cars    Audi    3
        Ferrari 2
        honda   1
fruits  apple   3
        banana  2
        cherry  1
        mango   1

尝试:

print(
    df.explode("values")
    .groupby(["name", "values"])
    .size()
    .to_frame(name="frequency")
)

打印:

                 frequency
name    values            
animals camel            1
        cats             3
        dogs             1
        tiger            2
cars    Audi             3
        Ferrari          2
        honda            1
fruits  apple            3
        banana           2
        cherry           1
        mango            1

IIUC,你可以试试

out = (df.groupby('name')
       .apply(lambda g: g[['values']].explode('values').value_counts())
       .to_frame('frequency'))
print(out)

                 frequency
name    values
animals cats             3
        tiger            2
        camel            1
        dogs             1
cars    Audi             3
        Ferrari          2
        honda            1
fruits  apple            3
        banana           2
        cherry           1
        mango            1

让我们 explodevalues 然后做 value_counts

df.explode('values').value_counts().sort_index()

name     values 
animals  camel      1
         cats       3
         dogs       1
         tiger      2
cars     Audi       3
         Ferrari    2
         honda      1
fruits   apple      3
         banana     2
         cherry     1
         mango      1
dtype: int64