如何使用列表分解 pandas 数据框以标记同一行中具有相同 ID 的数据框？

Question

例如，我有一个这样的 pandas 数据框：

忽略“名称”列，我想要一个看起来像这样的数据框，用它们的“ID”标记同一组的哈希

这里，我们遍历每一行，遇到“8a43”，就给它赋ID 1，凡是相同的hash值，就赋ID为1，然后继续下一行，遇到 79e2 和 b183。然后我们遍历所有行，无论我们在哪里找到这些值，我们都将它们的 ID 存储为 2。现在，当我们到达“abc7”时，问题就会出现。它将被分配 ID=5，因为它之前在“abc5”中遇到过。但我也希望在当前行之后的行中，无论我在哪里找到“26ea”，也将 ID=5 分配给它们。

我希望这一切都有意义。如果没有，请随时通过评论或消息与我联系。我会尽快清空的。

Answer 1

使用networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:

#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')

import networkx as nx

G=nx.Graph()
for l in df['Hash_Value']:
    nx.add_path(G, l)

new = list(nx.connected_components(G))

print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]

mapped =  {node: cid for cid, component in enumerate(new) for node in component}

df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1

print (df)
           Hash_Value   Name  ID
0              [8a43]   abcl   1
1        [79e2, b183]   abc2   2
2              [f82a]   abc3   3
3              [b183]   abc4   2
4  [eaa7, 5ea9, 1cee]   abc5   4
5              [5ea9]   abc6   4
6        [1cee, 26ea]   abc7   4
7              [79e2]   abc8   2
8              [8a43]   abc9   1
9              [26ea]  abc10   4

Answer 2

使用字典的解决方案

import numpy as np
import pandas as pd

hashvalues = list(df['Hash_Value'])

dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
    # convert to list
    if isinstance(hashlist, str):
        hashlist = hashlist.replace('[','').replace(']', '')
        hashlist = hashlist.split(',')

        # check if the hash is unknown
        if hashlist[0] not in dic:
            # Assign a new id
            dic[hashlist[0]] = i
            k = i
            i += 1
        else:
            # if known use existing id
            k = dic[hashlist[0]]
            
        for h in hashlist[1:]:
            # set id of the rest of the list hashes
            # equal to the first hashes's id
            dic[h] = k
            
        id_list.append(k)
    else:
        id_list.append(np.nan)
    
     print(df)

               Hash   Name  ID
0            [8a43]   abc1   1
1       [79e2,b183]   abc2   2
2            [f82a]   abc3   3
3            [b183]   abc4   2
4  [eaa7,5ea9,1cee]   abc5   4
5            [5ea9]   abc6   4
6       [1cee,26ea]   abc7   4
7            [79e2]   abc8   2
8            [8a43]   abc9   1
9            [26ea]  abc10   4

如何使用列表分解 pandas 数据框以标记同一行中具有相同 ID 的数据框？

How to explode pandas dataframe with lists to label the ones in the same row with same id?

python

logic

hashmap

pandas

pandas-explode