如何有效地获得许多四分位数?
How to efficiently get many quartiles?
我需要按范围对数值进行编码:低:0,中:1,高:2,非常高:3。我正在为四分位数进行编码。我有以下代码:
import pandas as pd
import numpy as np
def fun(df):
table = df.copy() # pandas dataframe
N = int(table.shape[0])
for header in list(table.columns):
q1 = np.percentile(table[header], 25)
q2 = np.percentile(table[header], 50)
q3 = np.percentile(table[header], 75)
for k in range(0, N):
if( table[header][k] < q1 ):
table[header][k] = int(0)
elif( (table[header][k] >= q1) & (table[header][k] < q2)):
table[header][k] = int(1)
elif( (table[header][k] >= q2) & (table[header][k] < q3)):
table[header][k] = int(2)
else:
table[header][k] = int(3)
pass
table = table.astype(int)
return table
证明
df = pd.DataFrame( {
'A': [30, 28, 32, 25, 25, 25, 22, 24, 35, 40],
'B': [25, 30, 27, 40, 42, 40, 50, 45, 30, 25],
'C': [25.5, 30.1, 27.3, 40.77, 25.1, 25.34, 22.11, 23.81, 33.66, 38.56],
}, columns = [ 'A', 'B', 'C' ] )
结果:
A B C
2 0 1
2 1 2
3 0 2
1 2 3
1 3 0
1 2 1
0 3 0
0 3 0
3 1 3
3 0 3
有什么方法可以有效地做到这一点?
您可以结合使用 np.digitize
和 pd.rank
In [569]: np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True)
Out[569]:
array([[2, 0, 1],
[2, 1, 2],
[3, 1, 2],
[1, 2, 3],
[1, 3, 1],
[1, 2, 1],
[0, 3, 0],
[0, 3, 0],
[3, 1, 3],
[3, 0, 3]], dtype=int64)
详情
In [570]: df.rank(pct=True)
Out[570]:
A B C
0 0.7 0.15 0.5
1 0.6 0.45 0.7
2 0.8 0.30 0.6
3 0.4 0.65 1.0
4 0.4 0.80 0.3
5 0.4 0.65 0.4
6 0.1 1.00 0.1
7 0.2 0.90 0.2
8 0.9 0.45 0.8
9 1.0 0.15 0.9
In [571]: pd.DataFrame(np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True),
columns=df.columns)
Out[571]:
A B C
0 2 0 1
1 2 1 2
2 3 1 2
3 1 2 3
4 1 3 1
5 1 2 1
6 0 3 0
7 0 3 0
8 3 1 3
9 3 0 3
我需要按范围对数值进行编码:低:0,中:1,高:2,非常高:3。我正在为四分位数进行编码。我有以下代码:
import pandas as pd
import numpy as np
def fun(df):
table = df.copy() # pandas dataframe
N = int(table.shape[0])
for header in list(table.columns):
q1 = np.percentile(table[header], 25)
q2 = np.percentile(table[header], 50)
q3 = np.percentile(table[header], 75)
for k in range(0, N):
if( table[header][k] < q1 ):
table[header][k] = int(0)
elif( (table[header][k] >= q1) & (table[header][k] < q2)):
table[header][k] = int(1)
elif( (table[header][k] >= q2) & (table[header][k] < q3)):
table[header][k] = int(2)
else:
table[header][k] = int(3)
pass
table = table.astype(int)
return table
证明
df = pd.DataFrame( {
'A': [30, 28, 32, 25, 25, 25, 22, 24, 35, 40],
'B': [25, 30, 27, 40, 42, 40, 50, 45, 30, 25],
'C': [25.5, 30.1, 27.3, 40.77, 25.1, 25.34, 22.11, 23.81, 33.66, 38.56],
}, columns = [ 'A', 'B', 'C' ] )
结果:
A B C
2 0 1
2 1 2
3 0 2
1 2 3
1 3 0
1 2 1
0 3 0
0 3 0
3 1 3
3 0 3
有什么方法可以有效地做到这一点?
您可以结合使用 np.digitize
和 pd.rank
In [569]: np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True)
Out[569]:
array([[2, 0, 1],
[2, 1, 2],
[3, 1, 2],
[1, 2, 3],
[1, 3, 1],
[1, 2, 1],
[0, 3, 0],
[0, 3, 0],
[3, 1, 3],
[3, 0, 3]], dtype=int64)
详情
In [570]: df.rank(pct=True)
Out[570]:
A B C
0 0.7 0.15 0.5
1 0.6 0.45 0.7
2 0.8 0.30 0.6
3 0.4 0.65 1.0
4 0.4 0.80 0.3
5 0.4 0.65 0.4
6 0.1 1.00 0.1
7 0.2 0.90 0.2
8 0.9 0.45 0.8
9 1.0 0.15 0.9
In [571]: pd.DataFrame(np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True),
columns=df.columns)
Out[571]:
A B C
0 2 0 1
1 2 1 2
2 3 1 2
3 1 2 3
4 1 3 1
5 1 2 1
6 0 3 0
7 0 3 0
8 3 1 3
9 3 0 3