正面、中性和负面词频

Question

在最终提交之前，我需要对我的项目进行一些更正。我需要计算代码中正面、中性和负面的词。我之前在尝试在输出正常的文本中查找词频时也做过同样的事情。

def gen_freq(text):
    word_list=[] #stores the list of words
        
    for words in text.split(): #Loop over all the reviews and extract words into word_list
        word_list.extend(words)

    word_freq=pd.Series(word_list).value_counts() #Create word frequencies using word_list

    word_freq[:20]

     #Print top 20 word
    print(word_freq)
    return word_freq[:20]
      
gen_freq(dataset.text.str)

我已经尝试做同样的事情来生成正面词的词频：

def positive_freq(text):
    positive_list=[] #stores the list of words
        
    for words in text.split(): #Loop over all the reviews and extract words into word_list
        positive_list.extend(words)

    word_freq=pd.Series(positive_list).value_counts() #Create word frequencies using word_list

    word_freq[:20]

     #Print top 20 word
    print(word_freq)
    return word_freq[:20]
      
positive_freq(dataset.text.str)

我使用此代码获取数据：

with open('reviews.json') as project_file:    
    data = json.load(project_file)
dataset=pd.json_normalize(data) 
print(dataset.head())

正频的输出是这样的：

and                   136
a                     127
the                   114
iPad                  102
I                      69
                     ...
"fully                  1
didn't.                 1
would                   1
instructions...but      1
these                   1

不应该是这种情况，因为被确定为正面的形容词是这些：

Positive:
   polarity  adjectives
1  0.209881       right
1  0.209881         mad
1  0.209881        full
1  0.209881        full
1  0.209881        iPad
1  0.209881        iPad
1  0.209881         bad
1  0.209881   different
1  0.209881   wonderful
1  0.209881        much
1  0.209881  affordable
2  0.633333        stop
2  0.633333       great
2  0.633333     awesome
3  0.437143     awesome
4  0.398333         max
4  0.398333        high
4  0.398333        high
4  0.398333    Gorgeous
5  0.466667      decent
5  0.466667        easy
6  0.265146      itâ€™s
6  0.265146      bright
6  0.265146   wonderful
6  0.265146     amazing
6  0.265146        full
6  0.265146         few
6  0.265146        such
6  0.265146      facial
6  0.265146         Big
6  0.265146        much
8  0.161979         old
8  0.161979      little
8  0.161979        Easy
8  0.161979       daily
8  0.161979    thatâ€™s
8  0.161979        late
9  0.084762         few
9  0.084762        huge
9  0.084762  storage.If
9  0.084762         few

另外，在生成频率时，我想绘制一个频率与每个单词的条形图，比如如果 right 的频率为 1，awesome 的频率为 2，它应该显示在图表上。对于中性和负面也是如此。请帮忙。

Answer 1

您的问题是您希望机器知道 positive/negative/neutral 个单词。机器如何从 .split() 中知道正面词？您需要首先提供 pre-define 个 positive/negative/neutral 单词的列表，然后在拆分后您应该检查每个标记是否存在于列表中。您可以通过诸如 sentiwordnet、sentistrengh 或许多其他词典或现有的 python 软件包之类的情感词典来访问这样的列表。示例：

from textblob import TextBlob

sent = 'a very simple and good sample'
pos_word_list = []
neg_word_list = []
neu_word_list = []

for word in sent.split():
    testimonial = TextBlob(word)
    if testimonial.sentiment.polarity >= 0.5:
        pos_word_list.append(word)
    elif testimonial.sentiment.polarity <= -0.5:
        neg_word_list.append(word)
    else:
        neu_word_list.append(word)

输出：

正面、中性和负面词频

positive, neutral and negative words frequency

python

nlp

pandas