Pandas 分层排序
Pandas hierarchical sort
我有一个类别和数量的数据框。可以使用冒号分隔的字符串将类别嵌套到无限级别的子类别中。我想按降序排列。但是以如图所示的分层方式。
我需要如何排序
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
编辑:
数据框:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
注意:这是我想要的顺序。排序前可以是任意顺序。
编辑2:
如果有人在寻找类似的解决方案,我会在此处发布我确定的解决方案:
回答我自己的问题:我找到了一种方法。有点啰嗦,但就是这样。
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')
一种方法是首先 str.split
类别列。
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
然后获取列金额,你想要的是根据以下条件获取每组的最大金额:
- 仅第一列,
- 然后第一列和第二列
- 然后第一-第二和第三列,...
您可以使用 groupby.transform
和 max
执行此操作,并连接创建的每个列。
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
现在您只需要 sort_values
在所有列上以正确的顺序首先是 0,然后是 1,然后是 2...,获取索引并使用 loc 以预期的方式对 df 进行排序
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100
如果你不介意多加一列你可以从类别中提取主要类别,然后按amount/main category/category排序,即:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
输出:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
请注意,只有当您的主要类别是没有 space 的单个单词时,这才会有效。否则,您将需要以不同的方式进行操作,即。提取所有非冒号并去除尾随 space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()
我打包了@Ben。 T's answer into a more generic function,希望这更容易阅读!
编辑: 我已经对函数进行了更改,以便按顺序而不是逐列分组,以解决@Ben 指出的潜在问题。 T在评论里。
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
使用 Python 3.7.3,pandas 0.24.2
我有一个类别和数量的数据框。可以使用冒号分隔的字符串将类别嵌套到无限级别的子类别中。我想按降序排列。但是以如图所示的分层方式。
我需要如何排序
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
编辑: 数据框:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
注意:这是我想要的顺序。排序前可以是任意顺序。
编辑2:
如果有人在寻找类似的解决方案,我会在此处发布我确定的解决方案:
回答我自己的问题:我找到了一种方法。有点啰嗦,但就是这样。
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')
一种方法是首先 str.split
类别列。
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
然后获取列金额,你想要的是根据以下条件获取每组的最大金额:
- 仅第一列,
- 然后第一列和第二列
- 然后第一-第二和第三列,...
您可以使用 groupby.transform
和 max
执行此操作,并连接创建的每个列。
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
现在您只需要 sort_values
在所有列上以正确的顺序首先是 0,然后是 1,然后是 2...,获取索引并使用 loc 以预期的方式对 df 进行排序
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100
如果你不介意多加一列你可以从类别中提取主要类别,然后按amount/main category/category排序,即:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
输出:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
请注意,只有当您的主要类别是没有 space 的单个单词时,这才会有效。否则,您将需要以不同的方式进行操作,即。提取所有非冒号并去除尾随 space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()
我打包了@Ben。 T's answer into a more generic function,希望这更容易阅读!
编辑: 我已经对函数进行了更改,以便按顺序而不是逐列分组,以解决@Ben 指出的潜在问题。 T在评论里。
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
使用 Python 3.7.3,pandas 0.24.2