Python Pandas 计算最频繁出现的次数
Python Pandas count most frequent occurrences
这是我的示例数据框,其中包含有关订单的数据:
import pandas as pd
my_dict = {
'status' : ["a", "b", "c", "d", "a","a", "d"],
'city' : ["London","Berlin","Paris", "Berlin", "Boston", "Paris", "Boston"],
'components': ["a01, a02, b01, b07, b08, с03, d07, e05, e06",
"a01, b02, b35, b68, с43, d02, d07, e04, e05, e08",
"a02, a05, b08, с03, d02, d06, e04, e05, e06",
"a03, a26, a28, a53, b08, с03, d02, f01, f24",
"a01, a28, a46, b37, с43, d06, e04, e05, f02",
"a02, a05, b35, b68, с43, d02, d07, e04, e05, e08",
"a02, a03, b08, b68, с43, d06, d07, e04, e05, e08"]
}
df = pd.DataFrame(my_dict)
df
我需要统计最频繁的次数:
- 订单中前 n 个同现组件
- 前 n 个最频繁的组件(不考虑共现)
最好的方法是什么?
我也能看到与购物篮分析问题的关系,但不知道该怎么做。
@ScottBoston 的回答显示了实现此目的的矢量化(因此可能更快)方法。
出现次数最多
from collections import Counter
from itertools import chain
n = 3
individual_components = chain.from_iterable(df['components'].str.split(', '))
counter = Counter(individual_components)
print(counter.most_common(n))
# [('e05', 6), ('e04', 5), ('a02', 4)]
Top-n 同现
请注意,我使用了两次 n
,一次用于 "the size of the co-occurrence",一次用于 "top-n" 部分。显然,您可以使用 2 个不同的变量。
from collections import Counter
from itertools import combinations
n = 3
individual_components = []
for components in df['components']:
order_components = sorted(components.split(', '))
individual_components.extend(combinations(order_components, n))
counter = Counter(individual_components)
print(counter.most_common(n))
# [(('e04', 'e05', 'с43'), 4), (('a02', 'b08', 'e05'), 3), (('a02', 'd07', 'e05'), 3)]
这里有更多 "pandas" 方法可以做同样的事情:
获得前三个组件
#Using list comprehension usually faster than .str accessor in pandas
pd.concat([pd.Series(i.split(',')) for i in df.components]).value_counts().head(3)
#OR using "pure" pandas methods
df.components.str.split(',', expand=True).stack().value_counts().head(3)
输出:
e05 6
e04 5
d02 4
dtype: int64
接下来查找同类群组,3 个组成部分一起报告 n=3:
from itertools import combinations
n=3
pd.concat([pd.Series(list(combinations(i.split(','), n))) for i in df.components])\
.value_counts().head(3)
输出:
( с43, e04, e05) 4
(a02, e04, e05) 3
( с43, d07, e05) 3
dtype: int64
这是我的示例数据框,其中包含有关订单的数据:
import pandas as pd
my_dict = {
'status' : ["a", "b", "c", "d", "a","a", "d"],
'city' : ["London","Berlin","Paris", "Berlin", "Boston", "Paris", "Boston"],
'components': ["a01, a02, b01, b07, b08, с03, d07, e05, e06",
"a01, b02, b35, b68, с43, d02, d07, e04, e05, e08",
"a02, a05, b08, с03, d02, d06, e04, e05, e06",
"a03, a26, a28, a53, b08, с03, d02, f01, f24",
"a01, a28, a46, b37, с43, d06, e04, e05, f02",
"a02, a05, b35, b68, с43, d02, d07, e04, e05, e08",
"a02, a03, b08, b68, с43, d06, d07, e04, e05, e08"]
}
df = pd.DataFrame(my_dict)
df
我需要统计最频繁的次数:
- 订单中前 n 个同现组件
- 前 n 个最频繁的组件(不考虑共现)
最好的方法是什么?
我也能看到与购物篮分析问题的关系,但不知道该怎么做。
@ScottBoston 的回答显示了实现此目的的矢量化(因此可能更快)方法。
出现次数最多
from collections import Counter
from itertools import chain
n = 3
individual_components = chain.from_iterable(df['components'].str.split(', '))
counter = Counter(individual_components)
print(counter.most_common(n))
# [('e05', 6), ('e04', 5), ('a02', 4)]
Top-n 同现
请注意,我使用了两次 n
,一次用于 "the size of the co-occurrence",一次用于 "top-n" 部分。显然,您可以使用 2 个不同的变量。
from collections import Counter
from itertools import combinations
n = 3
individual_components = []
for components in df['components']:
order_components = sorted(components.split(', '))
individual_components.extend(combinations(order_components, n))
counter = Counter(individual_components)
print(counter.most_common(n))
# [(('e04', 'e05', 'с43'), 4), (('a02', 'b08', 'e05'), 3), (('a02', 'd07', 'e05'), 3)]
这里有更多 "pandas" 方法可以做同样的事情:
获得前三个组件
#Using list comprehension usually faster than .str accessor in pandas
pd.concat([pd.Series(i.split(',')) for i in df.components]).value_counts().head(3)
#OR using "pure" pandas methods
df.components.str.split(',', expand=True).stack().value_counts().head(3)
输出:
e05 6
e04 5
d02 4
dtype: int64
接下来查找同类群组,3 个组成部分一起报告 n=3:
from itertools import combinations
n=3
pd.concat([pd.Series(list(combinations(i.split(','), n))) for i in df.components])\
.value_counts().head(3)
输出:
( с43, e04, e05) 4
(a02, e04, e05) 3
( с43, d07, e05) 3
dtype: int64