在 Pandas 中使用 groupby + transform 时有或没有 .loc 有什么区别

Question

我是 python 的新手。这是我的问题，这对我来说真的很奇怪。

一个简单的数据框如下所示：

a1=pd.DataFrame({'Hash':[1,1,2,2,2,3,4,4],
                 'Card':[1,1,2,2,3,3,4,4]})

我需要对a1进行Hash分组，计算每组有多少行，然后在a1中增加一列来表示行号。所以，我想用groupby + transform.

当我使用时：

a1['CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])

结果正确：

   Card  Hash  CustomerCount
0     1     1              2
1     1     1              2
2     2     2              3
3     2     2              3
4     3     2              3
5     3     3              1
6     4     4              2
7     4     4              2

但是当我使用时：

a1.loc[:,'CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])

结果是：

   Card  Hash  CustomerCount
0     1     1            NaN
1     1     1            NaN
2     2     2            NaN
3     2     2            NaN
4     3     2            NaN
5     3     3            NaN
6     4     4            NaN
7     4     4            NaN

那么，为什么会这样呢？

据我所知，loc 和 iloc（比如 a1.loc[:,'CustomerCount']）总比没有好（比如 a1['CustomerCount']）所以 loc 和 iloc通常推荐使用。但为什么会这样？

此外，我已经多次尝试使用 loc 和 iloc 来在一个数据框中生成一个新列。他们通常工作。那么这跟groupby + transform有关系吗？

Answer 1

区别在于 loc 如何处理将 DataFrame 对象分配给单个列。当您将 DataFrame 分配给 Card 的列时，它会尝试将索引和列名对齐。列没有对齐，您得到了 NaNs。当通过直接列访问分配时，它确定它是一列对另一列，然后就这样做了。

减少到单列

您可以通过将 groupby 操作的结果减少到仅一列来解决此问题，从而轻松解决问题。

a1.loc[:,'CustomerCount'] = a1.groupby(['Hash']).Card.transform('size')
a1

   Hash  Card  CustomerCount
0     1     1              2
1     1     1              2
2     2     2              3
3     2     2              3
4     2     3              3
5     3     3              1
6     4     4              2
7     4     4              2

重命名列

不要真的这样做，另一个答案要简单得多

a1.loc[:, 'CustomerCount'] = a1.groupby('Hash').transform(len).rename(
    columns={'Card': 'CustomerCount'})
a1

`pd.factorize` 和 `np.bincount`

我实际上会做什么

f, u = pd.factorize(a1.Hash)
a1['CustomerCount'] = np.bincount(f)[f]
a1

或内联复制

a1.assign(CustomerCount=(lambda f: np.bincount(f)[f])(pd.factorize(a1.Hash)[0]))

   Hash  Card  CustomerCount
0     1     1              2
1     1     1              2
2     2     2              3
3     2     2              3
4     2     3              3
5     3     3              1
6     4     4              2
7     4     4              2

在 Pandas 中使用 groupby + transform 时有或没有 .loc 有什么区别

what is the difference between with or without .loc when using groupby + transform in Pandas

python

transform

pandas

pandas-groupby

pandas-loc

减少到单列

重命名列

`pd.factorize` 和 `np.bincount`

在 Pandas 中使用 groupby + transform 时有或没有 .loc 有什么区别

what is the difference between with or without .loc when using groupby + transform in Pandas

python

transform

pandas

pandas-groupby

pandas-loc

减少到单列

重命名列

pd.factorize 和 np.bincount

`pd.factorize` 和 `np.bincount`