推文分析:获取独特的正面、独特的负面和独特的中性词:优化 solution:Natural 语言处理:

Tweets analysis: Get unique positive, unique negative and unique neutral words : Optimised solution:Natural Language processing:

我有一个数据框 train,其中有一列 tweet_content。有一列 sentiment 告诉推文的整体情绪。现在有很多词在中性、积极和消极情绪的推文中很常见。我想找到每个特定情绪所特有的词

火车

tweet_content                                sentiment 
[PM, you, rock, man]                         Positive
[PM, you, are, a, total, idiot, man]         Negative
[PM, I, have, no, opinion, about, you, dear] Neutral and so on..There are 30,000 rows

P.S。请注意,每条推文或每行都是 tweet_content.

列的单词列表

上述推文的预期输出:(unique_positive、unique_negative 等是 df 中所有推文的结果。有 30,000 行。所以unique positive 将是所有 30,000 行组合中积极情绪唯一的单词列表。这里我只是随机取了 3 条推文,例如

unique_positive = [rock] #you and PM occur in Negative and Neutral tweets, man occurs in negative tweet
unique_negative = [are , an, idiot] #you and PM occur in Positive and Neutral tweets, man occurs in positive tweet 
unique_positive = [I, have, no, opinion, about, dear] #you and PM occur in Negative and Neutral tweets

哪里

raw_text = [word for word_list in train['content'] for word in word_list] #list of all words
unique_Positive= words_unique('positive', 20, raw_text) #find 20 unique words which are only in positive sentiment from list of all words 

问题: 下面的 功能 运行 完美 并为积极、中立和消极情绪找到独特的词。 但是问题是 需要 30 分钟 到 运行。 有没有办法优化此功能并运行它更快?

找出每个情绪的唯一词的函数:

def words_unique(sentiment,numwords,raw_words):
    '''
    Input:
        segment - Segment category (ex. 'Neutral');
        numwords - how many specific words do you want to see in the final result; 
        raw_words - list  for item in train_data[train_data.segments == segments]['tweet_content']:
    Output: 
        dataframe giving information about the numwords number of unique words in a particular sentiment (in descending order based on their counts)..

    '''
    allother = []
    for item in train[train.sentiment != sentiment]['tweet_content']:
        for word in item:
            allother .append(word)
    allother  = list(set(allother ))
    
    specificnonly = [x for x in raw_text if x not in allother]
    
    mycounter = Counter()
    
    for item in train[train.sentiment == sentiment]['tweet_content']:
        for word in item:
            mycounter[word] += 1
    keep = list(specificnonly)
    
    for word in list(mycounter):
        if word not in keep:
            del mycounter[word]
    
    Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
    
    return Unique_words

这应该有效(根据您的需要添加过滤 numwords 等功能):

编辑添加了解释器评论):

import pandas as pd
df = pd.DataFrame([['Positive','Positive','Negative','Neutral'],[['PM', 'you', 'rock', 'man'],['PM'],['PM', 'you', 'are', 'a', 'total', 'idiot', 'man'] ,['PM', 'I', 'have', 'no', 'opinion', 'about', 'you', 'dear']]]).T
df.columns = ['sentiment','tweet']
# join the list back to a sentence
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x))


# join all the sentences in a group (i.e. sentiment) and then get unique words
_df = df.groupby(['sentiment']).agg({'tweet':lambda x: set(" ".join(x).split(" "))})['tweet']
# group by gives a row per sentiment
neg = _df[0]; neu = _df[1]; pos = _df[2]

# basically, A *minus* (B *union* C)
uniq_pos = pos - (neg.union(neu))
uniq_neu = neu - (pos.union(neg))
uniq_neg = neg - (pos.union(neu))

uniq_pos, uniq_neu, uniq_neg