构建以用户为节点,以用户的句子为目标的网络

Building a network with users as nodes and their sentences as targets

我很难从这个数据集构建网络

Node                Sentence      
Mary              I am here to help. What would you like to talk about?
Mary              What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John              There is the football match, tonight. Let's go to the pub!
Christopher       It is a great news! I am so happy for y'all
Catherine         Do not do that! It is extremely dangerous
Matt              I read that news. I was so happy and grateful it was not you. 
Matt              Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah             Nothing to add...
Catherine         Finally a beautiful sunny day!!!
Mary Jane         I do not think it will rain. There is the sun. It is a hot day. Very hot!

名称应该是网络中的节点。对于每个节点,我应该创建一个 link 和句子中的频繁词(不包括停用词)以获得更有意义的关系。 为了从句子中删除停用词,我正在使用 nlkt(效果不佳,但应该没问题):

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
  
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])

然后,对于单词的频率,我会首先创建一个包含所有术语及其相应频率的词汇表,然后我会返回句子来创建一对(word, freq),其中单词是'target' 节点和 freq 应该是目标节点的大小。 在这里我的困难被揭示了,因为这个

word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])

不显示单词及其频率(我正在创建一个新的数据框来显示它们,而不是将此信息添加到我的原始数据框中的另外两列中;后者会更可取)。 为了构建网络,一旦获得节点、目标和权重,我将使用 networkx。

word_dist 结果示例(未排序):

Word        Frequency
help           8
like          12
news          21
day           8
sunny         17
sun           23
football      12
pub           3
home          14
congratulations  3

nltk.FreqDist()classreturns一个collections.counter对象,即 基本上是一本字典。当 pandas 构造一个数据帧并且第一个 argument 是一个字典,每个键被认为是一列,每个值 应为列值列表。所以,就像下面的例子一样, result 将是一个包含两列的空数据框。

要用字典构建数据框,其中每个键都是一行, 你可以简单地将字典分成键和值,比如 在result2的建设中。下一行设置名称 指数,如果你愿意的话。

import pandas as pd

word_dict = {'help': '8',
 'like': '12',
 'news': '21',
 'day': '8',
 'sunny': '17',
 'sun': '23',
 'football': '12',
 'pub': '3',
 'home': '14',
 'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)