推文分析:获取独特的正面、独特的负面和独特的中性词:优化 solution:Natural 语言处理:
Tweets analysis: Get unique positive, unique negative and unique neutral words : Optimised solution:Natural Language processing:
我有一个数据框 train
,其中有一列 tweet_content
。有一列 sentiment
告诉推文的整体情绪。现在有很多词在中性、积极和消极情绪的推文中很常见。我想找到每个特定情绪所特有的词
火车
tweet_content sentiment
[PM, you, rock, man] Positive
[PM, you, are, a, total, idiot, man] Negative
[PM, I, have, no, opinion, about, you, dear] Neutral and so on..There are 30,000 rows
P.S。请注意,每条推文或每行都是 tweet_content.
列的单词列表
上述推文的预期输出:(unique_positive、unique_negative 等是 df 中所有推文的结果。有 30,000 行。所以unique positive 将是所有 30,000 行组合中积极情绪唯一的单词列表。这里我只是随机取了 3 条推文,例如
unique_positive = [rock] #you and PM occur in Negative and Neutral tweets, man occurs in negative tweet
unique_negative = [are , an, idiot] #you and PM occur in Positive and Neutral tweets, man occurs in positive tweet
unique_positive = [I, have, no, opinion, about, dear] #you and PM occur in Negative and Neutral tweets
哪里
raw_text = [word for word_list in train['content'] for word in word_list] #list of all words
unique_Positive= words_unique('positive', 20, raw_text) #find 20 unique words which are only in positive sentiment from list of all words
问题:
下面的 功能 运行 完美 并为积极、中立和消极情绪找到独特的词。 但是问题是 需要 30 分钟 到 运行。 有没有办法优化此功能并运行它更快?。
找出每个情绪的唯一词的函数:
def words_unique(sentiment,numwords,raw_words):
'''
Input:
segment - Segment category (ex. 'Neutral');
numwords - how many specific words do you want to see in the final result;
raw_words - list for item in train_data[train_data.segments == segments]['tweet_content']:
Output:
dataframe giving information about the numwords number of unique words in a particular sentiment (in descending order based on their counts)..
'''
allother = []
for item in train[train.sentiment != sentiment]['tweet_content']:
for word in item:
allother .append(word)
allother = list(set(allother ))
specificnonly = [x for x in raw_text if x not in allother]
mycounter = Counter()
for item in train[train.sentiment == sentiment]['tweet_content']:
for word in item:
mycounter[word] += 1
keep = list(specificnonly)
for word in list(mycounter):
if word not in keep:
del mycounter[word]
Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
return Unique_words
这应该有效(根据您的需要添加过滤 numwords
等功能):
编辑(添加了解释器评论):
import pandas as pd
df = pd.DataFrame([['Positive','Positive','Negative','Neutral'],[['PM', 'you', 'rock', 'man'],['PM'],['PM', 'you', 'are', 'a', 'total', 'idiot', 'man'] ,['PM', 'I', 'have', 'no', 'opinion', 'about', 'you', 'dear']]]).T
df.columns = ['sentiment','tweet']
# join the list back to a sentence
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x))
# join all the sentences in a group (i.e. sentiment) and then get unique words
_df = df.groupby(['sentiment']).agg({'tweet':lambda x: set(" ".join(x).split(" "))})['tweet']
# group by gives a row per sentiment
neg = _df[0]; neu = _df[1]; pos = _df[2]
# basically, A *minus* (B *union* C)
uniq_pos = pos - (neg.union(neu))
uniq_neu = neu - (pos.union(neg))
uniq_neg = neg - (pos.union(neu))
uniq_pos, uniq_neu, uniq_neg
我有一个数据框 train
,其中有一列 tweet_content
。有一列 sentiment
告诉推文的整体情绪。现在有很多词在中性、积极和消极情绪的推文中很常见。我想找到每个特定情绪所特有的词
火车
tweet_content sentiment
[PM, you, rock, man] Positive
[PM, you, are, a, total, idiot, man] Negative
[PM, I, have, no, opinion, about, you, dear] Neutral and so on..There are 30,000 rows
P.S。请注意,每条推文或每行都是 tweet_content.
列的单词列表上述推文的预期输出:(unique_positive、unique_negative 等是 df 中所有推文的结果。有 30,000 行。所以unique positive 将是所有 30,000 行组合中积极情绪唯一的单词列表。这里我只是随机取了 3 条推文,例如
unique_positive = [rock] #you and PM occur in Negative and Neutral tweets, man occurs in negative tweet
unique_negative = [are , an, idiot] #you and PM occur in Positive and Neutral tweets, man occurs in positive tweet
unique_positive = [I, have, no, opinion, about, dear] #you and PM occur in Negative and Neutral tweets
哪里
raw_text = [word for word_list in train['content'] for word in word_list] #list of all words
unique_Positive= words_unique('positive', 20, raw_text) #find 20 unique words which are only in positive sentiment from list of all words
问题: 下面的 功能 运行 完美 并为积极、中立和消极情绪找到独特的词。 但是问题是 需要 30 分钟 到 运行。 有没有办法优化此功能并运行它更快?。
找出每个情绪的唯一词的函数:
def words_unique(sentiment,numwords,raw_words):
'''
Input:
segment - Segment category (ex. 'Neutral');
numwords - how many specific words do you want to see in the final result;
raw_words - list for item in train_data[train_data.segments == segments]['tweet_content']:
Output:
dataframe giving information about the numwords number of unique words in a particular sentiment (in descending order based on their counts)..
'''
allother = []
for item in train[train.sentiment != sentiment]['tweet_content']:
for word in item:
allother .append(word)
allother = list(set(allother ))
specificnonly = [x for x in raw_text if x not in allother]
mycounter = Counter()
for item in train[train.sentiment == sentiment]['tweet_content']:
for word in item:
mycounter[word] += 1
keep = list(specificnonly)
for word in list(mycounter):
if word not in keep:
del mycounter[word]
Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
return Unique_words
这应该有效(根据您的需要添加过滤 numwords
等功能):
编辑(添加了解释器评论):
import pandas as pd
df = pd.DataFrame([['Positive','Positive','Negative','Neutral'],[['PM', 'you', 'rock', 'man'],['PM'],['PM', 'you', 'are', 'a', 'total', 'idiot', 'man'] ,['PM', 'I', 'have', 'no', 'opinion', 'about', 'you', 'dear']]]).T
df.columns = ['sentiment','tweet']
# join the list back to a sentence
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x))
# join all the sentences in a group (i.e. sentiment) and then get unique words
_df = df.groupby(['sentiment']).agg({'tweet':lambda x: set(" ".join(x).split(" "))})['tweet']
# group by gives a row per sentiment
neg = _df[0]; neu = _df[1]; pos = _df[2]
# basically, A *minus* (B *union* C)
uniq_pos = pos - (neg.union(neu))
uniq_neu = neu - (pos.union(neg))
uniq_neg = neg - (pos.union(neu))
uniq_pos, uniq_neu, uniq_neg