如何在面板数据中创建不同时期的排名variable/function?

How to create a ranking variable/function for different periods in a panel data?

我有一个数据集,df,看起来像这样:

Date Code City State Population Quantity QTDPERCAPITA
2020-01 11001 Los Angeles CA 5000000 100000 0.02
2020-02 11001 Los Angeles CA 5000000 125000 0.025
2020-03 11001 Los Angeles CA 5000000 135000 0.027
2020-01 12002 Houston TX 3000000 150000 0.05
2020-02 12002 Houston TX 3000000 100000 0.033
2020-03 12002 Houston TX 3000000 200000 0.066
... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062
2021-07 12002 Houston TX 3250012 211000 0.065

其中QTDPERCAPITA就是Quantity/Population。我有多个城市(更准确地说是 4149 个)。

数量每个月都在变化,人口也是如此。

我想创建一个新变量作为排名,范围从 [0,1],其中 0 是当月 QTDPERCAPITA 最低的城市,以及 1是当月人均拥有量最多的城市。本质上,我想创建一个如下所示的新列:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
... ... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062 0
2021-07 12002 Houston TX 3250012 211000 0.065 1

如何创建此列以使 RANKING 每个月都发生变化?我在想一个 for 循环,它在每个唯一日期为每个城市提取 QTDPERCAPITA,并创建一个新列,df['RANKING'] 具有相同的 datecity.

您可以使用:

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

输出:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
2021-07 11001 Los Angeles CA 5500499 340000 0.618 1
2021-07 12002 Houston TX 3250012 211000 0.065 0

试试这个:

# get unique values for each month
months = list(set(df['Date'].values))

# initialze a new column the length of our dataframe
rank_col = [None] * 9

for month in months:
  # get all the QTDPERCAPITA values for current month
  month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']

  # normalize such that best city has 1 and worst has 0 for this month
  max = month_qtdpc.max()
  min = month_qtdpc.min()
  rankings = (month_qtdpc - min) / (max - min)

  # insert each rank value at the proper index in our list 
  for i, rank in rankings.iteritems(): 
    rank_col[i] = rank

# add it as a column to the dataframe
df['RANKING'] = rank_col

注意:这会给你分数,而不仅仅是排名。例如,对于三个城市,输出将是(假数据):

index Date City QTDPERCAPITA RANKING
0 2020-01 Los Angeles 0.02 0.0
1 2020-02 Los Angeles 0.025 0.33
2 2020-03 Los Angeles 0.027 0.0
3 2020-01 Houston 0.05 1.0
4 2020-02 Houston 0.033 1.0
5 2020-03 Houston 0.066 1.0
6 2020-01 Denver 0.03 0.33
7 2020-02 Denver 0.021 0.0
8 2020-03 Denver 0.056 0.74

根据 Corralien 的回答,它将为您提供真正的标准化排名:

index Date City QTDPERCAPITA RANKING
0 2020-01 Los Angeles 0.02 0.0
1 2020-02 Los Angeles 0.025 0.5
2 2020-03 Los Angeles 0.027 0.0
3 2020-01 Houston 0.05 1.0
4 2020-02 Houston 0.033 1.0
5 2020-03 Houston 0.066 1.0
6 2020-01 Denver 0.03 0.5
7 2020-02 Denver 0.021 0.0
8 2020-03 Denver 0.056 0.5