如何在面板数据中创建不同时期的排名variable/function？

Question

我有一个数据集，df，看起来像这样：

Date	Code	City	State	Population	Quantity	QTDPERCAPITA
2020-01	11001	Los Angeles	CA	5000000	100000	0.02
2020-02	11001	Los Angeles	CA	5000000	125000	0.025
2020-03	11001	Los Angeles	CA	5000000	135000	0.027
2020-01	12002	Houston	TX	3000000	150000	0.05
2020-02	12002	Houston	TX	3000000	100000	0.033
2020-03	12002	Houston	TX	3000000	200000	0.066
...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062
2021-07	12002	Houston	TX	3250012	211000	0.065

其中QTDPERCAPITA就是Quantity/Population。我有多个城市（更准确地说是 4149 个）。

数量每个月都在变化，人口也是如此。

我想创建一个新变量作为排名，范围从 [0,1]，其中 0 是当月 QTDPERCAPITA 最低的城市，以及 1是当月人均拥有量最多的城市。本质上，我想创建一个如下所示的新列：

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
...	...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062	0
2021-07	12002	Houston	TX	3250012	211000	0.065	1

如何创建此列以使 RANKING 每个月都发生变化？我在想一个 for 循环，它在每个唯一日期为每个城市提取 QTDPERCAPITA，并创建一个新列，df['RANKING'] 具有相同的 date 和 city.

Answer 1

您可以使用：

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

输出：

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
2021-07	11001	Los Angeles	CA	5500499	340000	0.618	1
2021-07	12002	Houston	TX	3250012	211000	0.065	0

Answer 2

试试这个：

# get unique values for each month
months = list(set(df['Date'].values))

# initialze a new column the length of our dataframe
rank_col = [None] * 9

for month in months:
  # get all the QTDPERCAPITA values for current month
  month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']

  # normalize such that best city has 1 and worst has 0 for this month
  max = month_qtdpc.max()
  min = month_qtdpc.min()
  rankings = (month_qtdpc - min) / (max - min)

  # insert each rank value at the proper index in our list 
  for i, rank in rankings.iteritems(): 
    rank_col[i] = rank

# add it as a column to the dataframe
df['RANKING'] = rank_col

注意：这会给你分数，而不仅仅是排名。例如，对于三个城市，输出将是（假数据）：

index	Date	City	QTDPERCAPITA	RANKING
0	2020-01	Los Angeles	0.02	0.0
1	2020-02	Los Angeles	0.025	0.33
2	2020-03	Los Angeles	0.027	0.0
3	2020-01	Houston	0.05	1.0
4	2020-02	Houston	0.033	1.0
5	2020-03	Houston	0.066	1.0
6	2020-01	Denver	0.03	0.33
7	2020-02	Denver	0.021	0.0
8	2020-03	Denver	0.056	0.74

根据 Corralien 的回答，它将为您提供真正的标准化排名：

index	Date	City	QTDPERCAPITA	RANKING
0	2020-01	Los Angeles	0.02	0.0
1	2020-02	Los Angeles	0.025	0.5
2	2020-03	Los Angeles	0.027	0.0
3	2020-01	Houston	0.05	1.0
4	2020-02	Houston	0.033	1.0
5	2020-03	Houston	0.066	1.0
6	2020-01	Denver	0.03	0.5
7	2020-02	Denver	0.021	0.0
8	2020-03	Denver	0.056	0.5

如何在面板数据中创建不同时期的排名variable/function？

How to create a ranking variable/function for different periods in a panel data?

python

pandas

ranking-functions