构建以用户为节点,以用户的句子为目标的网络
Building a network with users as nodes and their sentences as targets
我很难从这个数据集构建网络
Node Sentence
Mary I am here to help. What would you like to talk about?
Mary What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John There is the football match, tonight. Let's go to the pub!
Christopher It is a great news! I am so happy for y'all
Catherine Do not do that! It is extremely dangerous
Matt I read that news. I was so happy and grateful it was not you.
Matt Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah Nothing to add...
Catherine Finally a beautiful sunny day!!!
Mary Jane I do not think it will rain. There is the sun. It is a hot day. Very hot!
名称应该是网络中的节点。对于每个节点,我应该创建一个 link 和句子中的频繁词(不包括停用词)以获得更有意义的关系。
为了从句子中删除停用词,我正在使用 nlkt(效果不佳,但应该没问题):
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])
然后,对于单词的频率,我会首先创建一个包含所有术语及其相应频率的词汇表,然后我会返回句子来创建一对(word, freq)
,其中单词是'target' 节点和 freq
应该是目标节点的大小。
在这里我的困难被揭示了,因为这个
word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])
不显示单词及其频率(我正在创建一个新的数据框来显示它们,而不是将此信息添加到我的原始数据框中的另外两列中;后者会更可取)。
为了构建网络,一旦获得节点、目标和权重,我将使用 networkx。
word_dist
结果示例(未排序):
Word Frequency
help 8
like 12
news 21
day 8
sunny 17
sun 23
football 12
pub 3
home 14
congratulations 3
nltk.FreqDist()
classreturns一个collections.counter
对象,即
基本上是一本字典。当 pandas 构造一个数据帧并且第一个
argument 是一个字典,每个键被认为是一列,每个值
应为列值列表。所以,就像下面的例子一样,
result
将是一个包含两列的空数据框。
要用字典构建数据框,其中每个键都是一行,
你可以简单地将字典分成键和值,比如
在result2
的建设中。下一行设置名称
指数,如果你愿意的话。
import pandas as pd
word_dict = {'help': '8',
'like': '12',
'news': '21',
'day': '8',
'sunny': '17',
'sun': '23',
'football': '12',
'pub': '3',
'home': '14',
'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)
我很难从这个数据集构建网络
Node Sentence
Mary I am here to help. What would you like to talk about?
Mary What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John There is the football match, tonight. Let's go to the pub!
Christopher It is a great news! I am so happy for y'all
Catherine Do not do that! It is extremely dangerous
Matt I read that news. I was so happy and grateful it was not you.
Matt Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah Nothing to add...
Catherine Finally a beautiful sunny day!!!
Mary Jane I do not think it will rain. There is the sun. It is a hot day. Very hot!
名称应该是网络中的节点。对于每个节点,我应该创建一个 link 和句子中的频繁词(不包括停用词)以获得更有意义的关系。 为了从句子中删除停用词,我正在使用 nlkt(效果不佳,但应该没问题):
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])
然后,对于单词的频率,我会首先创建一个包含所有术语及其相应频率的词汇表,然后我会返回句子来创建一对(word, freq)
,其中单词是'target' 节点和 freq
应该是目标节点的大小。
在这里我的困难被揭示了,因为这个
word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])
不显示单词及其频率(我正在创建一个新的数据框来显示它们,而不是将此信息添加到我的原始数据框中的另外两列中;后者会更可取)。 为了构建网络,一旦获得节点、目标和权重,我将使用 networkx。
word_dist
结果示例(未排序):
Word Frequency
help 8
like 12
news 21
day 8
sunny 17
sun 23
football 12
pub 3
home 14
congratulations 3
nltk.FreqDist()
classreturns一个collections.counter
对象,即
基本上是一本字典。当 pandas 构造一个数据帧并且第一个
argument 是一个字典,每个键被认为是一列,每个值
应为列值列表。所以,就像下面的例子一样,
result
将是一个包含两列的空数据框。
要用字典构建数据框,其中每个键都是一行,
你可以简单地将字典分成键和值,比如
在result2
的建设中。下一行设置名称
指数,如果你愿意的话。
import pandas as pd
word_dict = {'help': '8',
'like': '12',
'news': '21',
'day': '8',
'sunny': '17',
'sun': '23',
'football': '12',
'pub': '3',
'home': '14',
'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)