Python 初学者:在 python 中预处理法语文本并使用词典计算极性
Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon
我正在 python 中编写一个算法,它处理一列句子,然后给出我的句子列中每个单元格的极性(正或负)。该脚本使用 NRC 情感词典(法语版)中的负面和正面词列表我在编写预处理函数时遇到问题。我已经编写了计数函数和极性函数,但由于我在编写预处理函数时遇到了一些困难,所以我不确定这些函数是否有效。
正面词和负面词在同一个文件(词库)中,但我把正面词和负面词分开导出,因为我不知道如何使用原样的词库。
我的函数计数正数和负数的出现次数不起作用,我不知道为什么它总是向我发送 0。我在每个句子中添加了正数词,因此应该出现在数据框中:
堆栈跟踪:
[4 rows x 6 columns]
id Verbatim ... word_positive word_negative
0 15 Je n'ai pas bien compris si c'était destiné a ... ... 0 0
1 44 Moi aérien affable affaire agent de conservati... ... 0 0
2 45 Je affectueux affirmative te hais et la Foret ... ... 0 0
3 47 Je absurde accidentel accusateur accuser affli... ... 0 0
=>
def count_occurences_Pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))
这是我的csv_data:第44、45行包含肯定词,第47行更多否定词,但在肯定词和否定词列中,它总是空的,函数没有return单词数和最后一列始终为正,而最后一句话为负
id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue
这里是完整的代码:
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict, Counter
csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())
stopWords = set(stopwords.words('french'))
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
def process_text(text):
'''extract lemma and lowerize then removing stopwords.'''
text_preprocess =[]
text_without_stopwords= []
text = tagger.tag_text(text)
for word in text:
parts = word.split('\t')
try:
if parts[2] == '':
text_preprocess.append(parts[1])
else:
text_preprocess.append(parts[2])
except:
print(parts)
text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
return text_without_stopwords
csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)
lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)
lexiconneg = open('negative.txt', 'r', encoding='utf-8')
def count_occurences_neg(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, lexiconpos)
negatives_text =count_occurences_neg(text, lexiconneg)
if positives_text > negatives_text :
return "positive"
else :
return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)
如果你也能看看剩下的代码好不好谢谢
我发现了你的错误!
它来自 Polarity_score
函数。
这只是一个错字:
在您的 if 语句中,您比较的是函数 count_occurences_Pos and count_occurences_Neg
而不是比较函数 count_occurences_pos and count_occurences_peg
的结果
你的代码应该是这样的:
def Polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
count_text_pos =count_occurences_Pos(text, word_list)
count_text_neg =count_occurences_Neg(text, word_list)
if count_occurences_pos > count_occurences_peg :
return "Positive"
else :
return "negative"
将来,您需要学习如何为变量取有意义的名称,以避免出现此类错误
使用正确的变量名称,您的函数应该是:
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, word_list)
negatives_text =count_occurences_neg(text, word_list)
if positives_text > negatives_text :
return "Positive"
else :
return "negative"
您可以在 count_occurences_pos 和 count_occurences_neg 函数中进行的另一项改进是使用集合而不是列表。您的文本和 world_list 可以转换为集合,您可以使用集合交集检索 them.Because 集合中的正文本比列表快
我正在 python 中编写一个算法,它处理一列句子,然后给出我的句子列中每个单元格的极性(正或负)。该脚本使用 NRC 情感词典(法语版)中的负面和正面词列表我在编写预处理函数时遇到问题。我已经编写了计数函数和极性函数,但由于我在编写预处理函数时遇到了一些困难,所以我不确定这些函数是否有效。
正面词和负面词在同一个文件(词库)中,但我把正面词和负面词分开导出,因为我不知道如何使用原样的词库。
我的函数计数正数和负数的出现次数不起作用,我不知道为什么它总是向我发送 0。我在每个句子中添加了正数词,因此应该出现在数据框中:
堆栈跟踪:
[4 rows x 6 columns]
id Verbatim ... word_positive word_negative
0 15 Je n'ai pas bien compris si c'était destiné a ... ... 0 0
1 44 Moi aérien affable affaire agent de conservati... ... 0 0
2 45 Je affectueux affirmative te hais et la Foret ... ... 0 0
3 47 Je absurde accidentel accusateur accuser affli... ... 0 0
=>
def count_occurences_Pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))
这是我的csv_data:第44、45行包含肯定词,第47行更多否定词,但在肯定词和否定词列中,它总是空的,函数没有return单词数和最后一列始终为正,而最后一句话为负
id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue
这里是完整的代码:
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict, Counter
csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())
stopWords = set(stopwords.words('french'))
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
def process_text(text):
'''extract lemma and lowerize then removing stopwords.'''
text_preprocess =[]
text_without_stopwords= []
text = tagger.tag_text(text)
for word in text:
parts = word.split('\t')
try:
if parts[2] == '':
text_preprocess.append(parts[1])
else:
text_preprocess.append(parts[2])
except:
print(parts)
text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
return text_without_stopwords
csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)
lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)
lexiconneg = open('negative.txt', 'r', encoding='utf-8')
def count_occurences_neg(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, lexiconpos)
negatives_text =count_occurences_neg(text, lexiconneg)
if positives_text > negatives_text :
return "positive"
else :
return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)
如果你也能看看剩下的代码好不好谢谢
我发现了你的错误!
它来自 Polarity_score
函数。
这只是一个错字:
在您的 if 语句中,您比较的是函数 count_occurences_Pos and count_occurences_Neg
而不是比较函数 count_occurences_pos and count_occurences_peg
你的代码应该是这样的:
def Polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
count_text_pos =count_occurences_Pos(text, word_list)
count_text_neg =count_occurences_Neg(text, word_list)
if count_occurences_pos > count_occurences_peg :
return "Positive"
else :
return "negative"
将来,您需要学习如何为变量取有意义的名称,以避免出现此类错误 使用正确的变量名称,您的函数应该是:
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, word_list)
negatives_text =count_occurences_neg(text, word_list)
if positives_text > negatives_text :
return "Positive"
else :
return "negative"
您可以在 count_occurences_pos 和 count_occurences_neg 函数中进行的另一项改进是使用集合而不是列表。您的文本和 world_list 可以转换为集合,您可以使用集合交集检索 them.Because 集合中的正文本比列表快