如何合并具有共同子字符串的字符串以在 Python 中的数据框中生成一些组
How to merge strings that have substrings in common to produce some groups in a data frame in Python
我有一个示例数据:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
我想做的是合并一些字符串,如果它们有共同的子字符串。因此,在此示例中,字符串 'b,c'、'a'、'a,c,d,e' 应该合并在一起,因为它们可以相互链接。 'j,k,l' 和 'k,l,m' 应该在一组。最后,我希望我能有这样的东西:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
所以,我可以有三个组,并且任何两个组之间没有公共子字符串。
现在,我正在尝试建立一个相似性数据框,其中 1 表示两个字符串具有共同的子字符串。这是我的代码:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
结果是:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
从这里可以看出,如果有1,那么相关的行和列应该合并在一起。
将networkx
与connected_components
一起使用:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
import networkx as nx
from itertools import combinations, chain
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('b', 'c'), ('a', 'c'), ('a', 'd'), ('a', 'e'), ('c', 'd'),
('c', 'e'), ('d', 'e'), ('f', 'g'), ('f', 'h'), ('f', 'i'),
('g', 'h'), ('g', 'i'), ('h', 'i'), ('j', 'k'), ('j', 'l'),
('k', 'l'), ('k', 'l'), ('k', 'm'), ('l', 'm')]
#create the graph from the lists
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)
#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
# create groups by mapping first value of series called splitted
a['group'] = [node2id.get(x[0]) for x in splitted]
print (a)
ACTIVITY group
0 b,c 0
1 a 0
2 a,c,d,e 0
3 f,g,h,i 1
4 j,k,l 2
5 k,l,m 2
我有一个示例数据:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
我想做的是合并一些字符串,如果它们有共同的子字符串。因此,在此示例中,字符串 'b,c'、'a'、'a,c,d,e' 应该合并在一起,因为它们可以相互链接。 'j,k,l' 和 'k,l,m' 应该在一组。最后,我希望我能有这样的东西:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
所以,我可以有三个组,并且任何两个组之间没有公共子字符串。
现在,我正在尝试建立一个相似性数据框,其中 1 表示两个字符串具有共同的子字符串。这是我的代码:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
结果是:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
从这里可以看出,如果有1,那么相关的行和列应该合并在一起。
将networkx
与connected_components
一起使用:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
import networkx as nx
from itertools import combinations, chain
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('b', 'c'), ('a', 'c'), ('a', 'd'), ('a', 'e'), ('c', 'd'),
('c', 'e'), ('d', 'e'), ('f', 'g'), ('f', 'h'), ('f', 'i'),
('g', 'h'), ('g', 'i'), ('h', 'i'), ('j', 'k'), ('j', 'l'),
('k', 'l'), ('k', 'l'), ('k', 'm'), ('l', 'm')]
#create the graph from the lists
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)
#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
# create groups by mapping first value of series called splitted
a['group'] = [node2id.get(x[0]) for x in splitted]
print (a)
ACTIVITY group
0 b,c 0
1 a 0
2 a,c,d,e 0
3 f,g,h,i 1
4 j,k,l 2
5 k,l,m 2