根据具有预分配唯一标识符的数据框为数据框行分配唯一标识符
Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier
我的数据框具有基于三列分配的唯一标识符,即 [col2,col3,col3]
数据框 1:
col1 col2 col3 col4 col5 unique_id
1 abc bcv zxc www.com 8
2 bcd qwe rty www.@com 12
3 klp oiu ytr www.io 15
4 zxc qwe rty www.com 6
数据预处理后,将导入 Dataframe_2,其列值与上面所示相同,但没有 unique_id。 Dataframe_2 行必须根据 col2、col3、col4 并参考 Dataframe1 分配唯一标识符。
如果 Dataframe_2 有 Dataframe1 中不存在的新行,则分配新的标识符。
Dataframe_2:
col1 col2 col3 col4 col5
1 bcd qwe rty www.@com
2 zxc qwe rty www.com
3 abc bcv zxc www.com
4 kph hir mat www.com
预计 Dataframe_2:
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
由于 Dataframe1 中不存在 Row4,因此分配了一个新的唯一标识符。
首先通过 DataFrame.merge
with left join on
parameter is omitted for merge by columns ['col2','col3','col4']
specified in subset. For not matched values are created missing values, so is used Series.isna
for test them and np.arange
for create new array after maximal value and assign them in DataFrame.loc
添加列 unique_id
df = Dataframe_2.merge(Dataframe_1[['col2','col3','col4', 'unique_id']],
how='left')
mask = df['unique_id'].isna()
maximal = Dataframe_1['unique_id'].max() + 1
df.loc[mask, 'unique_id'] = np.arange(maximal, maximal + mask.sum())
df['unique_id'] = df['unique_id'].astype(int)
print (df)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
on=['col2', 'col3', 'col4', 'col5'], how='left')
# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()
cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)
print(df2n)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
import math
import random
import pandas as pd
import numpy as np
df3 = pd.merge(df1,df2, on=['col2','col3','col4'], how='right')
def return_unique_num(df1):
uniqueIds = list(df1['unique_id'].values)
unique_num = random.randint(1,len(df1)+1)
while True:
if unique_num in uniqueIds:
unique_num = random.randint(1,len(df1)+1)
else:
break
return unique_num
for i, e in enumerate(df3['unique_id']):
if math.isnan(e):
df3.iloc[i, 5] = return_unique_num(df1) #replace nan value with unique integer in df3 unique_id column
df3['unique_id'] = df3['unique_id'].astype(int)
df2['unique_id'] = df3['unique_id']
它将根据 df1
的 unique_id 为 df2 分配唯一 ID
输出
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
我的数据框具有基于三列分配的唯一标识符,即 [col2,col3,col3]
数据框 1:
col1 col2 col3 col4 col5 unique_id
1 abc bcv zxc www.com 8
2 bcd qwe rty www.@com 12
3 klp oiu ytr www.io 15
4 zxc qwe rty www.com 6
数据预处理后,将导入 Dataframe_2,其列值与上面所示相同,但没有 unique_id。 Dataframe_2 行必须根据 col2、col3、col4 并参考 Dataframe1 分配唯一标识符。
如果 Dataframe_2 有 Dataframe1 中不存在的新行,则分配新的标识符。
Dataframe_2:
col1 col2 col3 col4 col5
1 bcd qwe rty www.@com
2 zxc qwe rty www.com
3 abc bcv zxc www.com
4 kph hir mat www.com
预计 Dataframe_2:
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
由于 Dataframe1 中不存在 Row4,因此分配了一个新的唯一标识符。
首先通过 DataFrame.merge
with left join on
parameter is omitted for merge by columns ['col2','col3','col4']
specified in subset. For not matched values are created missing values, so is used Series.isna
for test them and np.arange
for create new array after maximal value and assign them in DataFrame.loc
unique_id
df = Dataframe_2.merge(Dataframe_1[['col2','col3','col4', 'unique_id']],
how='left')
mask = df['unique_id'].isna()
maximal = Dataframe_1['unique_id'].max() + 1
df.loc[mask, 'unique_id'] = np.arange(maximal, maximal + mask.sum())
df['unique_id'] = df['unique_id'].astype(int)
print (df)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
on=['col2', 'col3', 'col4', 'col5'], how='left')
# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()
cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)
print(df2n)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
import math
import random
import pandas as pd
import numpy as np
df3 = pd.merge(df1,df2, on=['col2','col3','col4'], how='right')
def return_unique_num(df1):
uniqueIds = list(df1['unique_id'].values)
unique_num = random.randint(1,len(df1)+1)
while True:
if unique_num in uniqueIds:
unique_num = random.randint(1,len(df1)+1)
else:
break
return unique_num
for i, e in enumerate(df3['unique_id']):
if math.isnan(e):
df3.iloc[i, 5] = return_unique_num(df1) #replace nan value with unique integer in df3 unique_id column
df3['unique_id'] = df3['unique_id'].astype(int)
df2['unique_id'] = df3['unique_id']
它将根据 df1
的 unique_id 为 df2 分配唯一 ID输出
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35