为分层 DataFrame 列创建数字标识符
Creating numeric identifiers for hierarchical DataFrame columns
我有一个 Pandas DataFrame 包含 10 多列数据和几百万行。
三列形成一个具有三个不同级别的层次结构:high
、medium
和 low
。这三列包含没有缺失数据的字符串。每列在 内 整体组合层次结构中按字典顺序排列,例如["A…","B…","C…"]
在 ["H…","A…","B…"]
之前
我想添加三个新的整数列:high_id
、medium_id
、low_id
。这三个 X_id
列中的每一列都应该为每个 DataFrame 行都有一个值。第一行的 X_id
列最初设置为 1。 X_id
列在相应的 X
列值与上一行不同时递增,除非更高级别的值发生更改,从而将 X_id
重置为 1 而不是
示例纯 Python 实现:
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
high_id, medium_id, low_id = 1, 1, 1
ids = [[high_id, medium_id, low_id]]
previous_row = rows[0]
for row in rows[1:]:
# Compare "high"
if previous_row[0] != row[0]:
high_id += 1
medium_id = 1
low_id = 1
# Compare "medium"
elif previous_row[1] != row[1]:
medium_id += 1
low_id = 1
# Compare "low"
elif previous_row[2] != row[2]:
low_id += 1
ids.append([high_id, medium_id, low_id])
previous_row = row
for i, v in enumerate(rows):
print(v + ids[i])
输出:
# high, medium, low, high_id, medium_id, low_id
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low2', 1, 1, 2]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low4', 1, 1, 4]
['high1', 'med2', 'low5', 1, 2, 1] # medium changed; low_id reset
['high1', 'med2', 'low6', 1, 2, 2]
['high1', 'med3', 'low7', 1, 3, 1] # medium changed; low_id reset
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med4', 'low8', 1, 4, 1] # medium changed; low_id reset
['high2', 'med5', 'low9', 2, 1, 1] # high changed; low_id, medium_id reset
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med6', 'lowB', 2, 2, 1] # medium changed; low_id reset
['high3', 'med4', 'lowC', 3, 1, 1] # high changed; low_id, medium_id reset
['high3', 'med7', 'low1', 3, 2, 1] # medium changed; low_id reset
['high3', 'med7', 'lowD', 3, 2, 2]
['high3', 'med7', 'lowE', 3, 2, 3]
请注意,这些列实际上由地理地名组成:因此,medium
和 low
的值原则上可以针对不同的父级别序列重新出现。 (“高”值很少,我可以看到其中 none 是重复的。)
添加这些列的惯用 Pandas 方法是什么,最好是通过矢量化操作?
我已经阅读了许多关于“层次结构”、“计数器”、“标识符”等主题的现有问题,但找不到任何内容来匹配需要“重置”标识符的特定嵌套情况。
我不知道这是否是一种习惯方法,但我们要求提供将它们分组在一起以确定它们各自 ID 所需的信息。逻辑就是把它们组合在一起,匹配到列表的索引就是id信息。但是,我找不到避免循环处理的方法,所以我使用了循环处理。这可能不会令您满意,但我会作为一种方法来回答。
import pandas as pd
import numpy as np
import io
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
df = pd.DataFrame(rows, columns=['high','medium','low'])
df['high_id'] = df['high'].str.extract(r'(\d)')
m = df.groupby('high')['medium'].unique().to_frame().reset_index()
l = df.groupby(['high','medium'])['low'].unique().to_frame().reset_index()
df = df.merge(m, on='high', how='outer')
df.rename(columns={'medium_x':'medium'}, inplace=True)
df = df.merge(l, on=['high','medium'], how='outer')
df.tail()
high medium low_x high_id medium_y low_y
16 high2 med6 lowB 2 [med5, med6] [lowB]
17 high3 med4 lowC 3 [med4, med7] [lowC]
18 high3 med7 low1 3 [med4, med7] [low1, lowD, lowE]
19 high3 med7 lowD 3 [med4, med7] [low1, lowD, lowE]
20 high3 med7 lowE 3 [med4, med7] [low1, lowD, lowE]
df['medium_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'medium'] == df.loc[i,'medium_y'])
df.loc[i,'medium_id'] = int(con[0]) + 1
df['low_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'low_x'] == df.loc[i,'low_y'])
df.loc[i,'low_id'] = int(con[0]) + 1
df = df[['high', 'medium', 'low_x', 'high_id', 'medium_id','low_id']]
df.columns = ['high', 'medium', 'low', 'high_id', 'medium_id','low_id']
df
high medium low high_id medium_id low_id
0 high1 med1 low1 1 1 1
1 high1 med1 low1 1 1 1
2 high1 med1 low2 1 1 2
3 high1 med1 low3 1 1 3
4 high1 med1 low3 1 1 3
5 high1 med1 low3 1 1 3
6 high1 med1 low4 1 1 4
7 high1 med2 low5 1 2 1
8 high1 med2 low6 1 2 2
9 high1 med3 low7 1 3 1
10 high1 med3 low7 1 3 1
11 high1 med3 low7 1 3 1
12 high1 med4 low8 1 4 1
13 high2 med5 low9 2 1 1
14 high2 med5 lowA 2 1 2
15 high2 med5 lowA 2 1 2
16 high2 med6 lowB 2 2 1
17 high3 med4 lowC 3 1 1
18 high3 med7 low1 3 2 1
19 high3 med7 lowD 3 2 2
20 high3 med7 lowE 3 2 3
我有一个 Pandas DataFrame 包含 10 多列数据和几百万行。
三列形成一个具有三个不同级别的层次结构:high
、medium
和 low
。这三列包含没有缺失数据的字符串。每列在 内 整体组合层次结构中按字典顺序排列,例如["A…","B…","C…"]
在 ["H…","A…","B…"]
我想添加三个新的整数列:high_id
、medium_id
、low_id
。这三个 X_id
列中的每一列都应该为每个 DataFrame 行都有一个值。第一行的 X_id
列最初设置为 1。 X_id
列在相应的 X
列值与上一行不同时递增,除非更高级别的值发生更改,从而将 X_id
重置为 1 而不是
示例纯 Python 实现:
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
high_id, medium_id, low_id = 1, 1, 1
ids = [[high_id, medium_id, low_id]]
previous_row = rows[0]
for row in rows[1:]:
# Compare "high"
if previous_row[0] != row[0]:
high_id += 1
medium_id = 1
low_id = 1
# Compare "medium"
elif previous_row[1] != row[1]:
medium_id += 1
low_id = 1
# Compare "low"
elif previous_row[2] != row[2]:
low_id += 1
ids.append([high_id, medium_id, low_id])
previous_row = row
for i, v in enumerate(rows):
print(v + ids[i])
输出:
# high, medium, low, high_id, medium_id, low_id
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low2', 1, 1, 2]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low4', 1, 1, 4]
['high1', 'med2', 'low5', 1, 2, 1] # medium changed; low_id reset
['high1', 'med2', 'low6', 1, 2, 2]
['high1', 'med3', 'low7', 1, 3, 1] # medium changed; low_id reset
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med4', 'low8', 1, 4, 1] # medium changed; low_id reset
['high2', 'med5', 'low9', 2, 1, 1] # high changed; low_id, medium_id reset
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med6', 'lowB', 2, 2, 1] # medium changed; low_id reset
['high3', 'med4', 'lowC', 3, 1, 1] # high changed; low_id, medium_id reset
['high3', 'med7', 'low1', 3, 2, 1] # medium changed; low_id reset
['high3', 'med7', 'lowD', 3, 2, 2]
['high3', 'med7', 'lowE', 3, 2, 3]
请注意,这些列实际上由地理地名组成:因此,medium
和 low
的值原则上可以针对不同的父级别序列重新出现。 (“高”值很少,我可以看到其中 none 是重复的。)
添加这些列的惯用 Pandas 方法是什么,最好是通过矢量化操作?
我已经阅读了许多关于“层次结构”、“计数器”、“标识符”等主题的现有问题,但找不到任何内容来匹配需要“重置”标识符的特定嵌套情况。
我不知道这是否是一种习惯方法,但我们要求提供将它们分组在一起以确定它们各自 ID 所需的信息。逻辑就是把它们组合在一起,匹配到列表的索引就是id信息。但是,我找不到避免循环处理的方法,所以我使用了循环处理。这可能不会令您满意,但我会作为一种方法来回答。
import pandas as pd
import numpy as np
import io
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
df = pd.DataFrame(rows, columns=['high','medium','low'])
df['high_id'] = df['high'].str.extract(r'(\d)')
m = df.groupby('high')['medium'].unique().to_frame().reset_index()
l = df.groupby(['high','medium'])['low'].unique().to_frame().reset_index()
df = df.merge(m, on='high', how='outer')
df.rename(columns={'medium_x':'medium'}, inplace=True)
df = df.merge(l, on=['high','medium'], how='outer')
df.tail()
high medium low_x high_id medium_y low_y
16 high2 med6 lowB 2 [med5, med6] [lowB]
17 high3 med4 lowC 3 [med4, med7] [lowC]
18 high3 med7 low1 3 [med4, med7] [low1, lowD, lowE]
19 high3 med7 lowD 3 [med4, med7] [low1, lowD, lowE]
20 high3 med7 lowE 3 [med4, med7] [low1, lowD, lowE]
df['medium_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'medium'] == df.loc[i,'medium_y'])
df.loc[i,'medium_id'] = int(con[0]) + 1
df['low_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'low_x'] == df.loc[i,'low_y'])
df.loc[i,'low_id'] = int(con[0]) + 1
df = df[['high', 'medium', 'low_x', 'high_id', 'medium_id','low_id']]
df.columns = ['high', 'medium', 'low', 'high_id', 'medium_id','low_id']
df
high medium low high_id medium_id low_id
0 high1 med1 low1 1 1 1
1 high1 med1 low1 1 1 1
2 high1 med1 low2 1 1 2
3 high1 med1 low3 1 1 3
4 high1 med1 low3 1 1 3
5 high1 med1 low3 1 1 3
6 high1 med1 low4 1 1 4
7 high1 med2 low5 1 2 1
8 high1 med2 low6 1 2 2
9 high1 med3 low7 1 3 1
10 high1 med3 low7 1 3 1
11 high1 med3 low7 1 3 1
12 high1 med4 low8 1 4 1
13 high2 med5 low9 2 1 1
14 high2 med5 lowA 2 1 2
15 high2 med5 lowA 2 1 2
16 high2 med6 lowB 2 2 1
17 high3 med4 lowC 3 1 1
18 high3 med7 low1 3 2 1
19 high3 med7 lowD 3 2 2
20 high3 med7 lowE 3 2 3