如何减少此 pandas 数据框连接代码
How to reduce this pandas dataframe join code
我有一个元组列表形式的模板,我将使用数据框连接来实例化它。
rule = [('#1', 'X', 'Y'), ('#2', 'X', 'Z'), ('#3', 'Z', 'Y')]
我还有一个模板的每个组件的实例作为字典。
rComp_substitution =
{('#1', 'X', 'Y'): pred subj obj
0 nationality BART USA,
('#2', 'X', 'Z'): pred subj obj
0 placeOfBirth BART NEWYORK
1 hasFather BART HOMMER,
('#3', 'Z', 'Y'): pred subj obj
0 locatedIn NEWYORK USA
1 nationality HOMMER USA }
每个组件对应的实例是一个pandas数据框,有三列。对于('#1', 'X', 'Y')
,#1
对应pred
,X
对应subj
,Y
对应obj
.
比如先实例化('#1','X','Y'),('#2','X','Z').
我们可以查看('#1', 'X', 'Y')和('#2', 'X', 'Z')的公共变量。
并将每个dataframe的公共变量X(subj)与key连接起来,得到('#1', 'X', 'Y'), ('#2', 'X', 'Z').
下面是我的代码。
depth = 0
# step1 check common variable
current_subj = rule[depth][1] #['X']
current_obj = rule[depth][2] #['Y']
next_subj = rule[depth+1][1] #['X']
next_obj = rule[depth+1][2] #['Z']
if current_subj == next_subj or current_subj == next_obj:
comVar = current_subj
elif current_obj == next_subj or current_obj == next_obj:
comVar = current_obj
# step2 Create currnt_rComp with common variable for joining dataframes
current_rComp = rComp_substitution[rule[depth]]
unified_rComp = []
for col in current_rComp.itertuples(index=False):
if comVar == current_subj:
unified_rComp.append([col.subj, [list(col)]])
elif comVar == current_obj:
unified_rComp.append([col.obj, [list(col)]])
current_rComp = pd.DataFrame(unified_rComp, columns=['comVar', 'triples'])
# step3 Create next_rComp with common variable for joining dataframes
next_rComp = rComp_substitution[rule[depth+1]]
unified_rComp = []
for col in next_rComp.itertuples(index=False):
if comVar == next_subj:
unified_rComp.append([col.subj, [list(col)]])
elif comVar == next_obj:
unified_rComp.append([col.obj, [list(col)]])
next_rComp = pd.DataFrame(unified_rComp, columns=['comVar', 'triples'])
# step4 Join currnt_rComp and next_rComp with common variable as key
partial_proof_path = pd.merge(current_rComp, next_rComp, how='inner', on='comVar')
print(partial_proof_path)
这段代码输出为
comVar triples_x triples_y
0 BART [[nationality, BART, USA]] [[placeOfBirth, BART, NEWYORK]]
1 BART [[nationality, BART, USA]] [[hasFather, BART, HOMMER]]
我认为这段代码太长了。有没有办法用更简单的代码来做同样的事情?
输入数据:
rComp_substitution = {('#1', 'X', 'Y'): pd.DataFrame({'pred': ['nationality'], 'subj': ['BART'], 'obj': ['USA']}),
('#2', 'X', 'Z'): pd.DataFrame({'pred': ['placeOfBirth', 'hasFather'], 'subj': ['BART', 'BART'], 'obj': ['NEWYORK', 'HOMMER']}),
('#3', 'Z', 'Y'): pd.DataFrame({'pred': ['locatedIn', 'nationality'], 'subj': ['NEWYORK', 'HOMMER'], 'obj': ['USA', 'USA']})}
rules = list(rComp_substitution.keys())
主要功能:
def merge_from_common_key(rule0, rule1):
# Load dataframes
df0 = rComp_substitution[rule0]
df1 = rComp_substitution[rule1]
# Rename ["pred", "subj", "obj"] by ruleN
df0.columns = rule0
df1.columns = rule1
# Find the common key(s) and merge the two dataframes
key = df0.columns.intersection(df1.columns).tolist()
df = pd.merge(df0, df1, on=key)
# Build the new dataframe
return pd.DataFrame({"common": df["X"].values.tolist(),
"left": df[list(rules[0])].values.tolist(),
"right": df[list(rules[1])].values.tolist()})
用法:
>>> merge_from_common_key(rules[0], rules[1])
common left right
0 BART [nationality, BART, USA] [placeOfBirth, BART, NEWYORK]
1 BART [nationality, BART, USA] [hasFather, BART, HOMMER]
我有一个元组列表形式的模板,我将使用数据框连接来实例化它。
rule = [('#1', 'X', 'Y'), ('#2', 'X', 'Z'), ('#3', 'Z', 'Y')]
我还有一个模板的每个组件的实例作为字典。
rComp_substitution =
{('#1', 'X', 'Y'): pred subj obj
0 nationality BART USA,
('#2', 'X', 'Z'): pred subj obj
0 placeOfBirth BART NEWYORK
1 hasFather BART HOMMER,
('#3', 'Z', 'Y'): pred subj obj
0 locatedIn NEWYORK USA
1 nationality HOMMER USA }
每个组件对应的实例是一个pandas数据框,有三列。对于('#1', 'X', 'Y')
,#1
对应pred
,X
对应subj
,Y
对应obj
.
比如先实例化('#1','X','Y'),('#2','X','Z').
我们可以查看('#1', 'X', 'Y')和('#2', 'X', 'Z')的公共变量。
并将每个dataframe的公共变量X(subj)与key连接起来,得到('#1', 'X', 'Y'), ('#2', 'X', 'Z').
下面是我的代码。
depth = 0
# step1 check common variable
current_subj = rule[depth][1] #['X']
current_obj = rule[depth][2] #['Y']
next_subj = rule[depth+1][1] #['X']
next_obj = rule[depth+1][2] #['Z']
if current_subj == next_subj or current_subj == next_obj:
comVar = current_subj
elif current_obj == next_subj or current_obj == next_obj:
comVar = current_obj
# step2 Create currnt_rComp with common variable for joining dataframes
current_rComp = rComp_substitution[rule[depth]]
unified_rComp = []
for col in current_rComp.itertuples(index=False):
if comVar == current_subj:
unified_rComp.append([col.subj, [list(col)]])
elif comVar == current_obj:
unified_rComp.append([col.obj, [list(col)]])
current_rComp = pd.DataFrame(unified_rComp, columns=['comVar', 'triples'])
# step3 Create next_rComp with common variable for joining dataframes
next_rComp = rComp_substitution[rule[depth+1]]
unified_rComp = []
for col in next_rComp.itertuples(index=False):
if comVar == next_subj:
unified_rComp.append([col.subj, [list(col)]])
elif comVar == next_obj:
unified_rComp.append([col.obj, [list(col)]])
next_rComp = pd.DataFrame(unified_rComp, columns=['comVar', 'triples'])
# step4 Join currnt_rComp and next_rComp with common variable as key
partial_proof_path = pd.merge(current_rComp, next_rComp, how='inner', on='comVar')
print(partial_proof_path)
这段代码输出为
comVar triples_x triples_y
0 BART [[nationality, BART, USA]] [[placeOfBirth, BART, NEWYORK]]
1 BART [[nationality, BART, USA]] [[hasFather, BART, HOMMER]]
我认为这段代码太长了。有没有办法用更简单的代码来做同样的事情?
输入数据:
rComp_substitution = {('#1', 'X', 'Y'): pd.DataFrame({'pred': ['nationality'], 'subj': ['BART'], 'obj': ['USA']}),
('#2', 'X', 'Z'): pd.DataFrame({'pred': ['placeOfBirth', 'hasFather'], 'subj': ['BART', 'BART'], 'obj': ['NEWYORK', 'HOMMER']}),
('#3', 'Z', 'Y'): pd.DataFrame({'pred': ['locatedIn', 'nationality'], 'subj': ['NEWYORK', 'HOMMER'], 'obj': ['USA', 'USA']})}
rules = list(rComp_substitution.keys())
主要功能:
def merge_from_common_key(rule0, rule1):
# Load dataframes
df0 = rComp_substitution[rule0]
df1 = rComp_substitution[rule1]
# Rename ["pred", "subj", "obj"] by ruleN
df0.columns = rule0
df1.columns = rule1
# Find the common key(s) and merge the two dataframes
key = df0.columns.intersection(df1.columns).tolist()
df = pd.merge(df0, df1, on=key)
# Build the new dataframe
return pd.DataFrame({"common": df["X"].values.tolist(),
"left": df[list(rules[0])].values.tolist(),
"right": df[list(rules[1])].values.tolist()})
用法:
>>> merge_from_common_key(rules[0], rules[1])
common left right
0 BART [nationality, BART, USA] [placeOfBirth, BART, NEWYORK]
1 BART [nationality, BART, USA] [hasFather, BART, HOMMER]