构建以用户为节点，以用户的句子为目标的网络

Question

我很难从这个数据集构建网络

Node                Sentence      
Mary              I am here to help. What would you like to talk about?
Mary              What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John              There is the football match, tonight. Let's go to the pub!
Christopher       It is a great news! I am so happy for y'all
Catherine         Do not do that! It is extremely dangerous
Matt              I read that news. I was so happy and grateful it was not you. 
Matt              Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah             Nothing to add...
Catherine         Finally a beautiful sunny day!!!
Mary Jane         I do not think it will rain. There is the sun. It is a hot day. Very hot!

名称应该是网络中的节点。对于每个节点，我应该创建一个 link 和句子中的频繁词（不包括停用词）以获得更有意义的关系。为了从句子中删除停用词，我正在使用 nlkt（效果不佳，但应该没问题）：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
  
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])

然后，对于单词的频率，我会首先创建一个包含所有术语及其相应频率的词汇表，然后我会返回句子来创建一对(word, freq)，其中单词是'target' 节点和 freq 应该是目标节点的大小。在这里我的困难被揭示了，因为这个

word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])

不显示单词及其频率（我正在创建一个新的数据框来显示它们，而不是将此信息添加到我的原始数据框中的另外两列中；后者会更可取）。为了构建网络，一旦获得节点、目标和权重，我将使用 networkx。

word_dist 结果示例（未排序）：

Word        Frequency
help           8
like          12
news          21
day           8
sunny         17
sun           23
football      12
pub           3
home          14
congratulations  3

Answer 1

nltk.FreqDist()classreturns一个collections.counter对象，即基本上是一本字典。当 pandas 构造一个数据帧并且第一个 argument 是一个字典，每个键被认为是一列，每个值应为列值列表。所以，就像下面的例子一样， result 将是一个包含两列的空数据框。

要用字典构建数据框，其中每个键都是一行，你可以简单地将字典分成键和值，比如在result2的建设中。下一行设置名称指数，如果你愿意的话。

import pandas as pd

word_dict = {'help': '8',
 'like': '12',
 'news': '21',
 'day': '8',
 'sunny': '17',
 'sun': '23',
 'football': '12',
 'pub': '3',
 'home': '14',
 'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)

构建以用户为节点，以用户的句子为目标的网络

Building a network with users as nodes and their sentences as targets

python

nltk

networkx

pandas