如何使用列表分解 pandas 数据框以标记同一行中具有相同 ID 的数据框?

How to explode pandas dataframe with lists to label the ones in the same row with same id?

例如,我有一个这样的 pandas 数据框:

忽略“名称”列,我想要一个看起来像这样的数据框,用它们的“ID”标记同一组的哈希

这里,我们遍历每一行,遇到“8a43”,就给它赋ID 1,凡是相同的hash值,就赋ID为1,然后继续下一行,遇到 79e2 和 b183。然后我们遍历所有行,无论我们在哪里找到这些值,我们都将它们的 ID 存储为 2。现在,当我们到达“abc7”时,问题就会出现。它将被分配 ID=5,因为它之前在“abc5”中遇到过。但我也希望在当前行之后的行中,无论我在哪里找到“26ea”,也将 ID=5 分配给它们。

我希望这一切都有意义。如果没有,请随时通过评论或消息与我联系。我会尽快清空的。

使用networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:

#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')

import networkx as nx

G=nx.Graph()
for l in df['Hash_Value']:
    nx.add_path(G, l)

new = list(nx.connected_components(G))

print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]

mapped =  {node: cid for cid, component in enumerate(new) for node in component}

df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1

print (df)
           Hash_Value   Name  ID
0              [8a43]   abcl   1
1        [79e2, b183]   abc2   2
2              [f82a]   abc3   3
3              [b183]   abc4   2
4  [eaa7, 5ea9, 1cee]   abc5   4
5              [5ea9]   abc6   4
6        [1cee, 26ea]   abc7   4
7              [79e2]   abc8   2
8              [8a43]   abc9   1
9              [26ea]  abc10   4

使用字典的解决方案

import numpy as np
import pandas as pd

hashvalues = list(df['Hash_Value'])

dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
    # convert to list
    if isinstance(hashlist, str):
        hashlist = hashlist.replace('[','').replace(']', '')
        hashlist = hashlist.split(',')

        # check if the hash is unknown
        if hashlist[0] not in dic:
            # Assign a new id
            dic[hashlist[0]] = i
            k = i
            i += 1
        else:
            # if known use existing id
            k = dic[hashlist[0]]
            
        for h in hashlist[1:]:
            # set id of the rest of the list hashes
            # equal to the first hashes's id
            dic[h] = k
            
        id_list.append(k)
    else:
        id_list.append(np.nan)
    
     print(df)

               Hash   Name  ID
0            [8a43]   abc1   1
1       [79e2,b183]   abc2   2
2            [f82a]   abc3   3
3            [b183]   abc4   2
4  [eaa7,5ea9,1cee]   abc5   4
5            [5ea9]   abc6   4
6       [1cee,26ea]   abc7   4
7            [79e2]   abc8   2
8            [8a43]   abc9   1
9            [26ea]  abc10   4