如何在面板数据中创建不同时期的排名variable/function?
How to create a ranking variable/function for different periods in a panel data?
我有一个数据集,df
,看起来像这样:
Date
Code
City
State
Population
Quantity
QTDPERCAPITA
2020-01
11001
Los Angeles
CA
5000000
100000
0.02
2020-02
11001
Los Angeles
CA
5000000
125000
0.025
2020-03
11001
Los Angeles
CA
5000000
135000
0.027
2020-01
12002
Houston
TX
3000000
150000
0.05
2020-02
12002
Houston
TX
3000000
100000
0.033
2020-03
12002
Houston
TX
3000000
200000
0.066
...
...
...
...
...
...
...
2021-07
11001
Los Angeles
CA
5500499
340000
0.062
2021-07
12002
Houston
TX
3250012
211000
0.065
其中QTDPERCAPITA
就是Quantity/Population
。我有多个城市(更准确地说是 4149 个)。
数量每个月都在变化,人口也是如此。
我想创建一个新变量作为排名,范围从 [0,1]
,其中 0
是当月 QTDPERCAPITA
最低的城市,以及 1
是当月人均拥有量最多的城市。本质上,我想创建一个如下所示的新列:
Date
Code
City
State
Population
Quantity
QTDPERCAPITA
RANKING
2020-01
11001
Los Angeles
CA
5000000
100000
0.02
0
2020-02
11001
Los Angeles
CA
5000000
125000
0.025
0
2020-03
11001
Los Angeles
CA
5000000
135000
0.027
0
2020-01
12002
Houston
TX
3000000
150000
0.05
1
2020-02
12002
Houston
TX
3000000
100000
0.033
1
2020-03
12002
Houston
TX
3000000
200000
0.066
1
...
...
...
...
...
...
...
...
2021-07
11001
Los Angeles
CA
5500499
340000
0.062
0
2021-07
12002
Houston
TX
3250012
211000
0.065
1
如何创建此列以使 RANKING
每个月都发生变化?我在想一个 for
循环,它在每个唯一日期为每个城市提取 QTDPERCAPITA
,并创建一个新列,df['RANKING']
具有相同的 date
和 city
.
您可以使用:
# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
输出:
Date
Code
City
State
Population
Quantity
QTDPERCAPITA
RANKING
2020-01
11001
Los Angeles
CA
5000000
100000
0.02
0
2020-02
11001
Los Angeles
CA
5000000
125000
0.025
0
2020-03
11001
Los Angeles
CA
5000000
135000
0.027
0
2020-01
12002
Houston
TX
3000000
150000
0.05
1
2020-02
12002
Houston
TX
3000000
100000
0.033
1
2020-03
12002
Houston
TX
3000000
200000
0.066
1
2021-07
11001
Los Angeles
CA
5500499
340000
0.618
1
2021-07
12002
Houston
TX
3250012
211000
0.065
0
试试这个:
# get unique values for each month
months = list(set(df['Date'].values))
# initialze a new column the length of our dataframe
rank_col = [None] * 9
for month in months:
# get all the QTDPERCAPITA values for current month
month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']
# normalize such that best city has 1 and worst has 0 for this month
max = month_qtdpc.max()
min = month_qtdpc.min()
rankings = (month_qtdpc - min) / (max - min)
# insert each rank value at the proper index in our list
for i, rank in rankings.iteritems():
rank_col[i] = rank
# add it as a column to the dataframe
df['RANKING'] = rank_col
注意:这会给你分数,而不仅仅是排名。例如,对于三个城市,输出将是(假数据):
index
Date
City
QTDPERCAPITA
RANKING
0
2020-01
Los Angeles
0.02
0.0
1
2020-02
Los Angeles
0.025
0.33
2
2020-03
Los Angeles
0.027
0.0
3
2020-01
Houston
0.05
1.0
4
2020-02
Houston
0.033
1.0
5
2020-03
Houston
0.066
1.0
6
2020-01
Denver
0.03
0.33
7
2020-02
Denver
0.021
0.0
8
2020-03
Denver
0.056
0.74
根据 Corralien 的回答,它将为您提供真正的标准化排名:
index
Date
City
QTDPERCAPITA
RANKING
0
2020-01
Los Angeles
0.02
0.0
1
2020-02
Los Angeles
0.025
0.5
2
2020-03
Los Angeles
0.027
0.0
3
2020-01
Houston
0.05
1.0
4
2020-02
Houston
0.033
1.0
5
2020-03
Houston
0.066
1.0
6
2020-01
Denver
0.03
0.5
7
2020-02
Denver
0.021
0.0
8
2020-03
Denver
0.056
0.5
我有一个数据集,df
,看起来像这样:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA |
---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 |
... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 |
其中QTDPERCAPITA
就是Quantity/Population
。我有多个城市(更准确地说是 4149 个)。
数量每个月都在变化,人口也是如此。
我想创建一个新变量作为排名,范围从 [0,1]
,其中 0
是当月 QTDPERCAPITA
最低的城市,以及 1
是当月人均拥有量最多的城市。本质上,我想创建一个如下所示的新列:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 | 0 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 1 |
如何创建此列以使 RANKING
每个月都发生变化?我在想一个 for
循环,它在每个唯一日期为每个城市提取 QTDPERCAPITA
,并创建一个新列,df['RANKING']
具有相同的 date
和 city
.
您可以使用:
# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
输出:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.618 | 1 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 0 |
试试这个:
# get unique values for each month
months = list(set(df['Date'].values))
# initialze a new column the length of our dataframe
rank_col = [None] * 9
for month in months:
# get all the QTDPERCAPITA values for current month
month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']
# normalize such that best city has 1 and worst has 0 for this month
max = month_qtdpc.max()
min = month_qtdpc.min()
rankings = (month_qtdpc - min) / (max - min)
# insert each rank value at the proper index in our list
for i, rank in rankings.iteritems():
rank_col[i] = rank
# add it as a column to the dataframe
df['RANKING'] = rank_col
注意:这会给你分数,而不仅仅是排名。例如,对于三个城市,输出将是(假数据):
index | Date | City | QTDPERCAPITA | RANKING |
---|---|---|---|---|
0 | 2020-01 | Los Angeles | 0.02 | 0.0 |
1 | 2020-02 | Los Angeles | 0.025 | 0.33 |
2 | 2020-03 | Los Angeles | 0.027 | 0.0 |
3 | 2020-01 | Houston | 0.05 | 1.0 |
4 | 2020-02 | Houston | 0.033 | 1.0 |
5 | 2020-03 | Houston | 0.066 | 1.0 |
6 | 2020-01 | Denver | 0.03 | 0.33 |
7 | 2020-02 | Denver | 0.021 | 0.0 |
8 | 2020-03 | Denver | 0.056 | 0.74 |
根据 Corralien 的回答,它将为您提供真正的标准化排名:
index | Date | City | QTDPERCAPITA | RANKING |
---|---|---|---|---|
0 | 2020-01 | Los Angeles | 0.02 | 0.0 |
1 | 2020-02 | Los Angeles | 0.025 | 0.5 |
2 | 2020-03 | Los Angeles | 0.027 | 0.0 |
3 | 2020-01 | Houston | 0.05 | 1.0 |
4 | 2020-02 | Houston | 0.033 | 1.0 |
5 | 2020-03 | Houston | 0.066 | 1.0 |
6 | 2020-01 | Denver | 0.03 | 0.5 |
7 | 2020-02 | Denver | 0.021 | 0.0 |
8 | 2020-03 | Denver | 0.056 | 0.5 |