按一列中出现的频率对整个 csv 进行排序
Sorting entire csv by frequency of occurence in one column
我有一个很大的 CSV 文件,它是来电者数据的日志。
我的文件的一小段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
我想按客户出现的频率对整个列表进行排序,因此它会像:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
我已经试过了groupby
,但它只打印出公司名称和频率而不是其他列,我也试过
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
但是这些给我错误:
ValueError: The wrong number of items passed 1, indices imply 24
我看过这样的东西:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
但这只打印出两列,我想对整个 CSV 进行排序。我的输出应该是按第一列排序的整个 CSV。
提前感谢您的帮助!
我认为一定有更好的方法,但这应该可行:
正在准备数据:
import io
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")
并进行转换:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
输出:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
这里发生的事情可能并不直观,但目前我想不出更好的方法。我尽量多发表评论。
这里棘手的部分是 count_df
的索引是客户的(唯一)出现。因此,我将 count_df
(left_index=True
) 的索引与 df
(right_on="CompanyName"
) 的 CompanyName
列连接起来。
这里的神奇之处在于 count_df
已经按出现次数排序,这就是我们不需要显式排序的原因。所以我们所要做的就是用连接数据框的行对原始数据框的行重新排序,我们就得到了预期的结果。
这似乎可以满足您的要求,基本上是通过执行 groupby
and transform
with value_counts
添加一个计数列,然后您可以对该列进行排序:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)
输出:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
您可以使用 df.drop
:
删除无关的列
df.drop('count', axis=1)
输出:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
需要一个小的补充:sort
已弃用,取而代之的是 sort_values
和 sort_index
。
sort_values
将像这样工作:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted:
a b count
0 1 1 2
2 1 3 2
1 2 2 1
2021 年更新
and 提出的答案不再有效。
函数 pd.Series.value_counts
returns 一个具有唯一值计数的系列。但是我们将 pd.Series.value_counts
函数应用于自身的系列仅包含一个唯一值,因为我们将 groupby
应用于 DataFrame 并拆分了 CompanyName 系列更早地分成一组唯一值。因此,我们应用该函数后的最终输出将如下所示。
Customer3 4
dtype: int64
废话,我们不能把一个Series中的一个值转化成一个完整的Series。不知何故,我们只需要整数 4
而不是整个 Series.
但是,我们可以更早地利用 groupby
函数,通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们组合成最终的频率系列。
我们可以将 pd.Series.value_counts
替换为 pd.Series.count
或者只是简单地使用函数名称 count
import pandas as pd
df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})
df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
输出
>>> df
CompanyName HighPriority QualityIssue Frequency
3 Customer3 No Equipment 4
5 Customer3 No User 4
6 Customer3 Yes User 4
7 Customer3 Yes Equipment 4
0 Customer1 Yes User 3
1 Customer1 Yes User 3
4 Customer1 No Neither 3
2 Customer2 No User 1
8 Customer4 No User 1
我有一个很大的 CSV 文件,它是来电者数据的日志。
我的文件的一小段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
我想按客户出现的频率对整个列表进行排序,因此它会像:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
我已经试过了groupby
,但它只打印出公司名称和频率而不是其他列,我也试过
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
但是这些给我错误:
ValueError: The wrong number of items passed 1, indices imply 24
我看过这样的东西:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
但这只打印出两列,我想对整个 CSV 进行排序。我的输出应该是按第一列排序的整个 CSV。
提前感谢您的帮助!
我认为一定有更好的方法,但这应该可行:
正在准备数据:
import io
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")
并进行转换:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
输出:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
这里发生的事情可能并不直观,但目前我想不出更好的方法。我尽量多发表评论。
这里棘手的部分是 count_df
的索引是客户的(唯一)出现。因此,我将 count_df
(left_index=True
) 的索引与 df
(right_on="CompanyName"
) 的 CompanyName
列连接起来。
这里的神奇之处在于 count_df
已经按出现次数排序,这就是我们不需要显式排序的原因。所以我们所要做的就是用连接数据框的行对原始数据框的行重新排序,我们就得到了预期的结果。
这似乎可以满足您的要求,基本上是通过执行 groupby
and transform
with value_counts
添加一个计数列,然后您可以对该列进行排序:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)
输出:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
您可以使用 df.drop
:
df.drop('count', axis=1)
输出:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
sort
已弃用,取而代之的是 sort_values
和 sort_index
。
sort_values
将像这样工作:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted: a b count 0 1 1 2 2 1 3 2 1 2 2 1
2021 年更新
函数 pd.Series.value_counts
returns 一个具有唯一值计数的系列。但是我们将 pd.Series.value_counts
函数应用于自身的系列仅包含一个唯一值,因为我们将 groupby
应用于 DataFrame 并拆分了 CompanyName 系列更早地分成一组唯一值。因此,我们应用该函数后的最终输出将如下所示。
Customer3 4
dtype: int64
废话,我们不能把一个Series中的一个值转化成一个完整的Series。不知何故,我们只需要整数 4
而不是整个 Series.
但是,我们可以更早地利用 groupby
函数,通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们组合成最终的频率系列。
我们可以将 pd.Series.value_counts
替换为 pd.Series.count
或者只是简单地使用函数名称 count
import pandas as pd
df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})
df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
输出
>>> df
CompanyName HighPriority QualityIssue Frequency
3 Customer3 No Equipment 4
5 Customer3 No User 4
6 Customer3 Yes User 4
7 Customer3 Yes Equipment 4
0 Customer1 Yes User 3
1 Customer1 Yes User 3
4 Customer1 No Neither 3
2 Customer2 No User 1
8 Customer4 No User 1