获取数据框中每列每行的列表中单个项目的频率
Get the frequency of individual items in a list of each row of a column in a dataframe
问题陈述
我有一个 pandas 数据框,其中一个列的值是列表类型。我需要获取该特定列表中每个项目的频率。
例如:
import pandas as pd
data = [
{
"name": "fruits",
"values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
},
{
"name": "cars",
"values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
},
{
"name": "animals",
"values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
}
]
df = pd.DataFrame(data)
如果我们在这里打印 df,我们将看到以下数据帧。
name values
0 fruits [apple, banana, cherry, apple, mango, banana, ...
1 cars [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2 animals [dogs, cats, tiger, tiger, cats, cats, camel]
现在,我想计算每行值列中每个项目的出现频率。
我做了什么
我做的解决方案效率不高,但这是我到目前为止想出来的。所以试图找到最好的方法来解决它。
我借助 python 循环来计算频率并再次将其转换回数据帧。
frequency_list = []
for idx, row in df.iterrows():
frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
# sorting and getting top 5 frequency is optional
frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])
打印时frequency_list我们会得到。
[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
{'name': 'fruits', 'value': 'banana', 'frequency': 2},
{'name': 'fruits', 'value': 'cherry', 'frequency': 1},
{'name': 'fruits', 'value': 'mango', 'frequency': 1}],
[{'name': 'cars', 'value': 'Audi', 'frequency': 3},
{'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
{'name': 'cars', 'value': 'honda', 'frequency': 1}],
[{'name': 'animals', 'value': 'cats', 'frequency': 3},
{'name': 'animals', 'value': 'tiger', 'frequency': 2},
{'name': 'animals', 'value': 'camel', 'frequency': 1},
{'name': 'animals', 'value': 'dogs', 'frequency': 1}]]
现在我开始为 frequency_list 中的每个项目创建一个数据框并将它们连接起来。
frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
temp_df = pd.DataFrame(each_frequency)
if frequency_df.empty:
frequency_df = temp_df
else:
frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)
frequency_df 持有的数据如下所示:
name value frequency
0 fruits apple 3
1 fruits banana 2
2 fruits cherry 1
3 fruits mango 1
4 cars Audi 3
5 cars Ferrari 2
6 cars honda 1
7 animals cats 3
8 animals tiger 2
9 animals camel 1
10 animals dogs 1
预期输出
frequency
name value
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
尝试:
print(
df.explode("values")
.groupby(["name", "values"])
.size()
.to_frame(name="frequency")
)
打印:
frequency
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
IIUC,你可以试试
out = (df.groupby('name')
.apply(lambda g: g[['values']].explode('values').value_counts())
.to_frame('frequency'))
print(out)
frequency
name values
animals cats 3
tiger 2
camel 1
dogs 1
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
让我们 explode
在 values
然后做 value_counts
df.explode('values').value_counts().sort_index()
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
dtype: int64
问题陈述
我有一个 pandas 数据框,其中一个列的值是列表类型。我需要获取该特定列表中每个项目的频率。
例如:
import pandas as pd
data = [
{
"name": "fruits",
"values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
},
{
"name": "cars",
"values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
},
{
"name": "animals",
"values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
}
]
df = pd.DataFrame(data)
如果我们在这里打印 df,我们将看到以下数据帧。
name values
0 fruits [apple, banana, cherry, apple, mango, banana, ...
1 cars [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2 animals [dogs, cats, tiger, tiger, cats, cats, camel]
现在,我想计算每行值列中每个项目的出现频率。
我做了什么
我做的解决方案效率不高,但这是我到目前为止想出来的。所以试图找到最好的方法来解决它。
我借助 python 循环来计算频率并再次将其转换回数据帧。
frequency_list = []
for idx, row in df.iterrows():
frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
# sorting and getting top 5 frequency is optional
frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])
打印时frequency_list我们会得到。
[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
{'name': 'fruits', 'value': 'banana', 'frequency': 2},
{'name': 'fruits', 'value': 'cherry', 'frequency': 1},
{'name': 'fruits', 'value': 'mango', 'frequency': 1}],
[{'name': 'cars', 'value': 'Audi', 'frequency': 3},
{'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
{'name': 'cars', 'value': 'honda', 'frequency': 1}],
[{'name': 'animals', 'value': 'cats', 'frequency': 3},
{'name': 'animals', 'value': 'tiger', 'frequency': 2},
{'name': 'animals', 'value': 'camel', 'frequency': 1},
{'name': 'animals', 'value': 'dogs', 'frequency': 1}]]
现在我开始为 frequency_list 中的每个项目创建一个数据框并将它们连接起来。
frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
temp_df = pd.DataFrame(each_frequency)
if frequency_df.empty:
frequency_df = temp_df
else:
frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)
frequency_df 持有的数据如下所示:
name value frequency
0 fruits apple 3
1 fruits banana 2
2 fruits cherry 1
3 fruits mango 1
4 cars Audi 3
5 cars Ferrari 2
6 cars honda 1
7 animals cats 3
8 animals tiger 2
9 animals camel 1
10 animals dogs 1
预期输出
frequency
name value
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
尝试:
print(
df.explode("values")
.groupby(["name", "values"])
.size()
.to_frame(name="frequency")
)
打印:
frequency
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
IIUC,你可以试试
out = (df.groupby('name')
.apply(lambda g: g[['values']].explode('values').value_counts())
.to_frame('frequency'))
print(out)
frequency
name values
animals cats 3
tiger 2
camel 1
dogs 1
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
让我们 explode
在 values
然后做 value_counts
df.explode('values').value_counts().sort_index()
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
dtype: int64