按 pandas 中的集合分组

Grouping by a set in pandas

我有一个例子df:

import pandas as pd 
import numpy as np


df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
                   'orderId':[1,2,3,4],
                   'purchases':[['sofa','sofa','chair'],
                                ['chair','sofa'],
                                ['sofa','chair'],
                                ['sofa','chair','chair']]})

4 个人购买了相同的商品 - sofa & chair 但数量不同,但总的来说他们都购买了 sofachair - 只有一种不同产品的组合,将其视为 set(purchases).

我想回答 每个购买组合被购买了多少次 - 我们知道它是 4 因为 4 人购买了同一套项目。

所以我认为这是一个伪代码,我按每个 purchase 值的集合进行分组:

df = df.groupby(set('purchases')).agg({'orderId':pd.Series.nunique})

但是我得到了一个预期的错误:

TypeError: 'set' object is not callable

我想知道通过值 set 而不是实际值(在本例中是列表)实现分组的最佳方法是什么。

当我尝试简单地按 purchases

分组时
df = df.groupby('purchases').agg({'orderId':pd.Series.nunique})

我得到:

TypeError: unhashable type: 'list'\

我尝试将列表更改为元组:

df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
                   'orderId':[1,2,3,4],
                   'purchases':[('sofa','sofa','chair'),
                                ('chair','sofa'),
                                ('sofa','chair'),
                                ('sofa','chair','chair')]})

然后

df = df.groupby('purchases').agg({'purchases':lambda x:{y for y in x}}) # or set(x)

但这给了

                        purchases
purchases   
(chair, sofa)           {(chair, sofa)}
(sofa, chair)           {(sofa, chair)}
(sofa, chair, chair)    {(sofa, chair, chair)}
(sofa, sofa, chair)     {(sofa, sofa, chair)}

集合中仍然有一个元组,因为它查找相同的元组而不是在元组内部查找?

我试过了:

df['purchases_unique'] = df['purchases'].apply(lambda x: set(x))
df['# of times bought'] = df.apply(lambda x: x.value_counts())

但我得到:

TypeError: unhashable type: 'set' While the Jupyter notebook still provides an answer with log message:

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'set'
{sofa, chair}    4

所以总而言之,我都在寻找一个答案,我怎样才能将值 4 分配给每一行,以便结果如下所示:

name        orderId         purchases                   # of times bought
Josh        1               (sofa, sofa, chair)         4
Paul        2               (chair, sofa)               4
Ivy         3               (sofa, chair)               4
Mark        4               (sofa, chair, chair)        4

如果 python 能够评估 {'chair', 'sofa'} == {'sofa', 'chair'},为什么 pandas 不允许按 set 分组?

使用:

df["times_bought"] = df.groupby(df["purchases"].apply(frozenset))["purchases"].transform("count")
print(df)

输出

   name  orderId             purchases  times_bought
0  Josh        1   [sofa, sofa, chair]             4
1  Paul        2         [chair, sofa]             4
2   Ivy        3         [sofa, chair]             4
3  Mark        4  [sofa, chair, chair]             4

表达式:

df["purchases"].apply(frozenset)

将购买中的每个列表转换为 frozenset:

0    (chair, sofa)
1    (chair, sofa)
2    (chair, sofa)
3    (chair, sofa)

来自文档(强调我的):

The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.

鉴于 .apply 之后的元素是不可变且可散列的,它们可以在 DataFrame.groupby.

中使用

备选方案

最后,考虑到您的问题的限制,您需要将购买中的每个元素映射到相同的标识符。所以你可以直接使用frozenset的散列函数如下:

def _hash(lst):
    import sys
    uniques = set(lst)
    # 
    MAX = sys.maxsize
    MASK = 2 * MAX + 1
    n = len(uniques)
    h = 1927868237 * (n + 1)
    h &= MASK
    for x in uniques:
        hx = hash(x)
        h ^= (hx ^ (hx << 16) ^ 89869747)  * 3644798167
        h &= MASK
    h = h * 69069 + 907133923
    h &= MASK
    if h > MAX:
        h -= MASK + 1
    if h == -1:
        h = 590923713
    return h


df["times_bought"] = df.groupby(df["purchases"].apply(_hash))["purchases"].transform("count")
print(df)

输出

   name  orderId             purchases  times_bought
0  Josh        1   [sofa, sofa, chair]             4
1  Paul        2         [chair, sofa]             4
2   Ivy        3         [sofa, chair]             4
3  Mark        4  [sofa, chair, chair]             4

第二种选择是使用(作为 groupby 的参数):

df["purchases"].apply(lambda x: tuple(sorted(set(x))))

这将找到 que 唯一元素,对它们进行排序,最后将它们转换为可哈希表示(元组)。