Python 中多个帐户的多种模式

Question

我有一个包含多个帐户的数据框，这些帐户显示不同模式的动物类别。如何识别拥有不止一种模式的帐户？

例如，注意账号3只有一种模式（即“狗”），但账号1、2、4有多种模式（即不止一种模式）。

test = pd.DataFrame({'account':[1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
             'category':['cat','dog','rabbit','cat','cat','dog','dog','dog','dog','dog','rabbit','rabbit','cat','cat','rabbit']})

我正在寻找的预期输出是这样的：

pd.DataFrame({'account':[1,2,4],'modes':[3,2,2]})

除此之外，我还尝试为所有具有多种模式的帐户采用任何随机最高模式。我想出了以下代码，但是，这只是 returns 每个帐户的第一个（字母顺序）模式。我的直觉告诉我可以在下面的 iloc 括号内写一些东西，也许是一个介于 0 和模式总数之间的随机数组，但我无法完全到达那里。

test.groupby('account')['category'].agg(lambda x: x.mode(dropna=False).iloc[0])

有什么建议吗？非常感谢。

Answer 1

你可以使用 numpy.random.choice

test.groupby('account')['category'].agg(
    lambda x: np.random.choice(x.mode(dropna=False)))

Answer 2

不确定你想要什么，但你可以试试：

out=test.groupby('account')['category'].apply(lambda x: x.mode(dropna=False).values)

out的输出：

account
1    [cat, dog, rabbit]
2            [cat, dog]
3                 [dog]
4         [cat, rabbit]
Name: category, dtype: object

对于随机模式值：

from random import choice

out=test.groupby('account')['category'].agg(
    lambda x: choice(x.mode(dropna=False)))

out的输出（每次你运行你得到不同输出的代码）：

account
1    rabbit
2       dog
3       dog
4    rabbit
Name: category, dtype: object

对于您的预期输出使用：

out=test.groupby('account')['category'].apply(lambda x: x.mode(dropna=False).count()).reset_index()
out=out[out['category'].ne(1)]

out的输出：

account     category
0   1       3
1   2       2
3   4       2

Answer 3

因为你只想任何随机模式，所以你可以使用groupby + size。（本质上是与 @abw333's solution 非常相似的包装器）。这很好，因为它避免了任何 groupby.apply，支持内置的 groupby.size，速度很快。

我们在groupby中使用了sort=False，因此生成的Series按照组在原始DataFrame中出现的顺序排序。然后因为排序算法 'mergesort' 在关系的情况下是稳定的，这将确定性地 return 在 DataFrame 中出现 first （较早的行）的模式关系的情况。因此，如果您想获得随机模式，您可以 .sample(frac=1) 在应用此模式之前，以便它随机播放所有行，然后 returns 模式。

def fast_mode(df, gp_cols, value_col):
    """ 
    Calculate the mode of a column, ignoring null values recognized by pandas. 

    If there is a tie for the mode, the modal value is the modal value that appears **first** in the
    DataFrame. 

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame over which to calcualate the mode. 
    gp_cols : list of str
        Columns to groupby for calculation of mode.
    value_col : str
        Column for which to calculate the mode. 

    Return
    ------ 
    pandas.DataFrame
        One row for the modal value per key_cols 
    """

    return ((df.groupby(gp_cols + [value_col], observed=True, sort=False).size() 
               .to_frame('mode_counts').reset_index() 
               .sort_values('mode_counts', ascending=False, kind='mergesort') 
               .drop_duplicates(subset=gp_cols))
             .reset_index(drop=True))

# Will always return the same one that occurs first in DataFrame
fast_mode(df, gp_cols=['account'], value_col='category')
#   account category  mode_counts
#0        3      dog            3
#1        2      cat            2
#2        4   rabbit            2
#3        1      cat            1

# Sampling allows you to select a "random one" in case of ties
fast_mode(df.sample(frac=1, random_state=12), gp_cols=['account'], value_col='category')
#   account category  mode_counts
#0        3      dog            3
#1        4      cat            2
#2        2      dog            2
#3        1      cat            1

Python 中多个帐户的多种模式

Multiple modes for multiple accounts in Python

python

mode

pandas