遍历 pandas 列以获得 wmd 相似度
loop over pandas column for wmd similarity
我有两个数据框。都有两列。我想使用 wmd 为 source_label
列中的每个实体找到与 target_label
列中的实体最接近的匹配但是,最后我想要一个包含所有 4 列的 DataFrame 关于实体.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
预期结果
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
数据框太大,我正在寻找一些更快的方法来遍历两个标签列。到目前为止我试过这个:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
首先,这段代码运行不佳,因为其他两列直接来自 dfs,没有考虑实体与 uri 的关系。
由于实体数以百万计,如何让它更快。提前致谢。
快速修复:
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column
我有两个数据框。都有两列。我想使用 wmd 为 source_label
列中的每个实体找到与 target_label
列中的实体最接近的匹配但是,最后我想要一个包含所有 4 列的 DataFrame 关于实体.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
预期结果
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
数据框太大,我正在寻找一些更快的方法来遍历两个标签列。到目前为止我试过这个:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
首先,这段代码运行不佳,因为其他两列直接来自 dfs,没有考虑实体与 uri 的关系。 由于实体数以百万计,如何让它更快。提前致谢。
快速修复:
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column