如何使用列表分解 pandas 数据框以标记同一行中具有相同 ID 的数据框?
How to explode pandas dataframe with lists to label the ones in the same row with same id?
例如,我有一个这样的 pandas 数据框:
忽略“名称”列,我想要一个看起来像这样的数据框,用它们的“ID”标记同一组的哈希
这里,我们遍历每一行,遇到“8a43”,就给它赋ID 1,凡是相同的hash值,就赋ID为1,然后继续下一行,遇到 79e2 和 b183。然后我们遍历所有行,无论我们在哪里找到这些值,我们都将它们的 ID 存储为 2。现在,当我们到达“abc7”时,问题就会出现。它将被分配 ID=5,因为它之前在“abc5”中遇到过。但我也希望在当前行之后的行中,无论我在哪里找到“26ea”,也将 ID=5 分配给它们。
我希望这一切都有意义。如果没有,请随时通过评论或消息与我联系。我会尽快清空的。
使用networkx solution for dictionary for common values, select first value in Hash_Value
by str
and use Series.map
:
#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')
import networkx as nx
G=nx.Graph()
for l in df['Hash_Value']:
nx.add_path(G, l)
new = list(nx.connected_components(G))
print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]
mapped = {node: cid for cid, component in enumerate(new) for node in component}
df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1
print (df)
Hash_Value Name ID
0 [8a43] abcl 1
1 [79e2, b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7, 5ea9, 1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee, 26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
使用字典的解决方案
import numpy as np
import pandas as pd
hashvalues = list(df['Hash_Value'])
dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
# convert to list
if isinstance(hashlist, str):
hashlist = hashlist.replace('[','').replace(']', '')
hashlist = hashlist.split(',')
# check if the hash is unknown
if hashlist[0] not in dic:
# Assign a new id
dic[hashlist[0]] = i
k = i
i += 1
else:
# if known use existing id
k = dic[hashlist[0]]
for h in hashlist[1:]:
# set id of the rest of the list hashes
# equal to the first hashes's id
dic[h] = k
id_list.append(k)
else:
id_list.append(np.nan)
print(df)
Hash Name ID
0 [8a43] abc1 1
1 [79e2,b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7,5ea9,1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee,26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
例如,我有一个这样的 pandas 数据框:
忽略“名称”列,我想要一个看起来像这样的数据框,用它们的“ID”标记同一组的哈希
这里,我们遍历每一行,遇到“8a43”,就给它赋ID 1,凡是相同的hash值,就赋ID为1,然后继续下一行,遇到 79e2 和 b183。然后我们遍历所有行,无论我们在哪里找到这些值,我们都将它们的 ID 存储为 2。现在,当我们到达“abc7”时,问题就会出现。它将被分配 ID=5,因为它之前在“abc5”中遇到过。但我也希望在当前行之后的行中,无论我在哪里找到“26ea”,也将 ID=5 分配给它们。
我希望这一切都有意义。如果没有,请随时通过评论或消息与我联系。我会尽快清空的。
使用networkx solution for dictionary for common values, select first value in Hash_Value
by str
and use Series.map
:
#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')
import networkx as nx
G=nx.Graph()
for l in df['Hash_Value']:
nx.add_path(G, l)
new = list(nx.connected_components(G))
print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]
mapped = {node: cid for cid, component in enumerate(new) for node in component}
df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1
print (df)
Hash_Value Name ID
0 [8a43] abcl 1
1 [79e2, b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7, 5ea9, 1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee, 26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
使用字典的解决方案
import numpy as np
import pandas as pd
hashvalues = list(df['Hash_Value'])
dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
# convert to list
if isinstance(hashlist, str):
hashlist = hashlist.replace('[','').replace(']', '')
hashlist = hashlist.split(',')
# check if the hash is unknown
if hashlist[0] not in dic:
# Assign a new id
dic[hashlist[0]] = i
k = i
i += 1
else:
# if known use existing id
k = dic[hashlist[0]]
for h in hashlist[1:]:
# set id of the rest of the list hashes
# equal to the first hashes's id
dic[h] = k
id_list.append(k)
else:
id_list.append(np.nan)
print(df)
Hash Name ID
0 [8a43] abc1 1
1 [79e2,b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7,5ea9,1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee,26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4