通过将数据分离到 bin 来分配中值
Assign median values by separating data to bins
我有一个数据框,我想将其分成 bin 并为每个 bin 分配该 bin 中值的中值。
POA Egrid
200 1.17
205 0.63
275 1.08
325 1.22
350 0.57
结果应该是这样的
POA Egrid
(200,300) Median of (1.17,0.63,1.08)
(300,400) Median of (1.22,0.57)
我试着写了两个循环,但想不出中间部分。任何帮助都会很好。
使用:pd.cut
和.groupby
和.transform
import pandas as pd
df['POA'] = df['POA'].astype(int)
df['POA'] = pd.cut(df['POA'], [0,99,199, 299, 399], include_lowest=True)
df['Egrid'] = df.groupby('POA')['Egrid'].transform('median')
df = df.drop_duplicates()
df
编辑:
有一个带有 pd.cut
的标志,即 right=False
。如果我们添加这个,那么类别会更清晰,而不是去 99,你可以去 100。
import pandas as pd
df['POA'] = df['POA'].astype(int)
df['POA'] = pd.cut(df['POA'], [0,100,200, 300,400], include_lowest=True, right=False)
df['Egrid'] = df.groupby('POA')['Egrid'].transform('median')
df = df.drop_duplicates()
df
输出:
POA Egrid
0 [200, 300) 1.080
1 [200, 300) 1.080
2 [200, 300) 1.080
3 [300, 400) 0.895
4 [300, 400) 0.895
这当然不是最有效的方法,但这会奏效!
首先,让我们重新创建一个类似的设置:
import numpy as np
import pandas as pd
# make a DataFrame like yours
df = pd.DataFrame([[200, 1.17], [205, 0.63], [275, 1.08], [325, 1.22], [350, 0.57]], columns=["POA", "Egrid"])
然后,让我们添加中位数:
# first, define a list of possible ranges from which you want the medians
list_of_ranges = [(200, 300), (300, 400)]
# initialize a column named "Median"
df["Median"] = [0]*df.shape[0]
# apply median to the desired ranges
for a, b in list_of_ranges:
# calculate the median from the desired subset of the dataframe
median = df[(df['POA'] >= a) & (df['POA'] < b)]["Egrid"].median()
# apply the value where the condition is respected
df.loc[(df['POA'] >= a) & (df['POA'] < b), 'Median'] = median
不清楚的请告知
做
s=df.groupby(pd.cut(df.POA,[100,200,300])).Egrid.median().reset_index()
POA Egrid
0 (100, 200] 1.170
1 (200, 300] 0.855
import pandas as pd
import numpy as np
# Create the dataframe
d = {'POA':[200,205,275,325,350], 'Egrid':[1.17,0.63,1.08,1.22,0.57]}
df = pd.DataFrame(data=d)
# Create bins to group by
bins = [100,200,300,400,500,600,700,800,900,1000]
# For loop to assign each POA to a bin
for bin in bins:
upper_bin = bin + 100
df.loc[(df['POA'] >= bin) & (df['POA'] < upper_bin), 'Bin'] = f'{bin} to {upper_bin}'
# Create a pandas pivot_table to summarize the results
# Displays each bin and its median value
df2 = pd.pivot_table(df, index=['Bin'], values=['Egrid'], aggfunc=np.median)
print(df2)
我有一个数据框,我想将其分成 bin 并为每个 bin 分配该 bin 中值的中值。
POA Egrid
200 1.17
205 0.63
275 1.08
325 1.22
350 0.57
结果应该是这样的
POA Egrid
(200,300) Median of (1.17,0.63,1.08)
(300,400) Median of (1.22,0.57)
我试着写了两个循环,但想不出中间部分。任何帮助都会很好。
使用:pd.cut
和.groupby
和.transform
import pandas as pd
df['POA'] = df['POA'].astype(int)
df['POA'] = pd.cut(df['POA'], [0,99,199, 299, 399], include_lowest=True)
df['Egrid'] = df.groupby('POA')['Egrid'].transform('median')
df = df.drop_duplicates()
df
编辑:
有一个带有 pd.cut
的标志,即 right=False
。如果我们添加这个,那么类别会更清晰,而不是去 99,你可以去 100。
import pandas as pd
df['POA'] = df['POA'].astype(int)
df['POA'] = pd.cut(df['POA'], [0,100,200, 300,400], include_lowest=True, right=False)
df['Egrid'] = df.groupby('POA')['Egrid'].transform('median')
df = df.drop_duplicates()
df
输出:
POA Egrid
0 [200, 300) 1.080
1 [200, 300) 1.080
2 [200, 300) 1.080
3 [300, 400) 0.895
4 [300, 400) 0.895
这当然不是最有效的方法,但这会奏效!
首先,让我们重新创建一个类似的设置:
import numpy as np
import pandas as pd
# make a DataFrame like yours
df = pd.DataFrame([[200, 1.17], [205, 0.63], [275, 1.08], [325, 1.22], [350, 0.57]], columns=["POA", "Egrid"])
然后,让我们添加中位数:
# first, define a list of possible ranges from which you want the medians
list_of_ranges = [(200, 300), (300, 400)]
# initialize a column named "Median"
df["Median"] = [0]*df.shape[0]
# apply median to the desired ranges
for a, b in list_of_ranges:
# calculate the median from the desired subset of the dataframe
median = df[(df['POA'] >= a) & (df['POA'] < b)]["Egrid"].median()
# apply the value where the condition is respected
df.loc[(df['POA'] >= a) & (df['POA'] < b), 'Median'] = median
不清楚的请告知
做
s=df.groupby(pd.cut(df.POA,[100,200,300])).Egrid.median().reset_index()
POA Egrid
0 (100, 200] 1.170
1 (200, 300] 0.855
import pandas as pd
import numpy as np
# Create the dataframe
d = {'POA':[200,205,275,325,350], 'Egrid':[1.17,0.63,1.08,1.22,0.57]}
df = pd.DataFrame(data=d)
# Create bins to group by
bins = [100,200,300,400,500,600,700,800,900,1000]
# For loop to assign each POA to a bin
for bin in bins:
upper_bin = bin + 100
df.loc[(df['POA'] >= bin) & (df['POA'] < upper_bin), 'Bin'] = f'{bin} to {upper_bin}'
# Create a pandas pivot_table to summarize the results
# Displays each bin and its median value
df2 = pd.pivot_table(df, index=['Bin'], values=['Egrid'], aggfunc=np.median)
print(df2)