从数据框中的标签共现创建加权网络

Creating a weighted network from co-occurence of hashtags from a dataframe

我有一个推文数据框,其中一列包含有关推文 (tweets_df.hashtags) 中包含的主题标签的信息,作为主题标签列表。

>> tweets_df.hashtags

0                                       [dkpol]
1                              [dkmedier, fv19]
2                    [dkpol, dksocial, dkidræt]
3                                       [dkpol]
4        [røgfrifremtid, folketingsvalg, dkpol]
5                           [biblioteker, fv19]
6                                       [dkpol]
7                                        [fv19]
8              [dkpol, fv19, løgner, mandsling]
9                               [dkpol, valg19]
10                                [dkpol, fv19]

由此我需要创建一个图形对象以导出到 Gephi。 我想要的是每个hashtag作为一个节点,每个同现作为hashtag之间的无向连接。

到目前为止,我已经尝试了以下方法:

col1 = []
col2 = []
for index, row in tweets_df.head(10).iterrows():
    hashtags=row['hashtags']
    hashtags_len = len(hashtags)
    for n in list(itertools.combinations(hashtags, 2)):
        col1.append(n[0])
        col2.append(n[1])
df = pd.DataFrame(list(zip(col1, col2)))

它给出了像

这样的边缘列表
>> df
             0               1
0         dkmedier            fv19
1            dkpol        dksocial
2            dkpol         dkidræt
3         dksocial         dkidræt
4    røgfrifremtid  folketingsvalg
5    røgfrifremtid           dkpol
6   folketingsvalg           dkpol
7      biblioteker            fv19
8            dkpol            fv19
9            dkpol          løgner
10           dkpol       mandsling
11            fv19          løgner
12            fv19       mandsling
13          løgner       mandsling
14           dkpol          valg19

然后通过创建我的网络 g = nx.from_pandas_edgelist(df, 0, 1)

这为我提供了一个具有我需要的连接的网络,但是它没有根据同一连接的多次出现给我权重。

如果有人能帮助我,我将不胜感激。

from_pandas_edgelist 接受允许您设置权重的 edge_attr 参数。因此,您需要做的就是在您的数据框中创建另一列,其中包含您的推文的唯一共现计数,并将其指定为您的 edge_attr

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx

# --------------------------------------------------------------------------------
# create some fake data
nodes = 'abcde'
edges = [(nodes[ii], nodes[jj]) for ii, jj in np.random.randint(len(nodes), size=(100, 2))]

# --------------------------------------------------------------------------------
# create a data frame with columns source, target, count

# you probably don't care about which tweet was named first, so before
# we aggreate edges, we need to sort them
edges = [sorted(edge) for edge in edges]

# create pandas dataframe
df = pd.DataFrame(edges, columns=['source', 'target'])

# aggregate repeated edges
# c.f. 
df = pd.DataFrame({'count' : df.groupby(['source', 'target']).size()}).reset_index()

# --------------------------------------------------------------------------------
# create a weighted network and draw

g = nx.from_pandas_edgelist(df, source='source', target='target', edge_attr='count')

pos = nx.spring_layout(g)
nx.draw(g, pos, with_labels=True)
labels = nx.get_edge_attributes(g, 'count')
nx.draw_networkx_edge_labels(g, pos, edge_labels=labels)
plt.show()