按一列中出现的频率对整个 csv 进行排序

Sorting entire csv by frequency of occurence in one column

我有一个很大的 CSV 文件,它是来电者数据的日志。

我的文件的一小段:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

我想按客户出现的频率对整个列表进行排序,因此它会像:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

我已经试过了groupby,但它只打印出公司名称和频率而不是其他列,我也试过

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

但是这些给我错误:

ValueError: The wrong number of items passed 1, indices imply 24

我看过这样的东西:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

但这只打印出两列,我想对整个 CSV 进行排序。我的输出应该是按第一列排序的整个 CSV。

提前感谢您的帮助!

我认为一定有更好的方法,但这应该可行:

正在准备数据:

import io
data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")

并进行转换:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

输出:

    CompanyName HighPriority    QualityIssue
3   Customer3   No  Equipment
5   Customer3   No  User
6   Customer3   Yes User
7   Customer3   Yes Equipment
0   Customer1   Yes User
1   Customer1   Yes User
4   Customer1   No  Neither
8   Customer4   No  User
2   Customer2   No  User

这里发生的事情可能并不直观,但目前我想不出更好的方法。我尽量多发表评论。

这里棘手的部分是 count_df 的索引是客户的(唯一)出现。因此,我将 count_df (left_index=True) 的索引与 df (right_on="CompanyName") 的 CompanyName 列连接起来。

这里的神奇之处在于 count_df 已经按出现次数排序,这就是我们不需要显式排序的原因。所以我们所要做的就是用连接数据框的行对原始数据框的行重新排序,我们就得到了预期的结果。

这似乎可以满足您的要求,基本上是通过执行 groupby and transform with value_counts 添加一个计数列,然后您可以对该列进行排序:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)

输出:

  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

您可以使用 df.drop:

删除无关的列
df.drop('count', axis=1)

输出:

  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User

需要一个小的补充:sort 已弃用,取而代之的是 sort_valuessort_index

sort_values 将像这样工作:

    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1

2021 年更新

and 提出的答案不再有效。


函数 pd.Series.value_counts returns 一个具有唯一值计数的系列。但是我们将 pd.Series.value_counts 函数应用于自身的系列仅包含一个唯一值,因为我们将 groupby 应用于 DataFrame 并拆分了 CompanyName 系列更早地分成一组唯一值。因此,我们应用该函数后的最终输出将如下所示。

Customer3        4
dtype: int64

废话,我们不能把一个Series中的一个值转化成一个完整的Series。不知何故,我们只需要整数 4 而不是整个 Series.


但是,我们可以更早地利用 groupby 函数,通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们组合成最终的频率系列。

我们可以将 pd.Series.value_counts 替换为 pd.Series.count 或者只是简单地使用函数名称 count

import pandas as pd

df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})

df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)

输出

>>> df

  CompanyName HighPriority QualityIssue  Frequency
3   Customer3           No    Equipment          4
5   Customer3           No         User          4
6   Customer3          Yes         User          4
7   Customer3          Yes    Equipment          4
0   Customer1          Yes         User          3
1   Customer1          Yes         User          3
4   Customer1           No      Neither          3
2   Customer2           No         User          1
8   Customer4           No         User          1