Python NLTK 和 Pandas - 文本分类器 - (新手) - 以类似于提供的示例的格式导入我的数据
Python NLTK and Pandas - text classifier - (newbie ) - importing my data in a format similar to provided example
我是文本分类的新手,但我了解大部分概念。简而言之,我在 Excel 数据集中有一个餐厅评论列表,我想将它们用作我的训练数据。我挣扎的地方是将实际评论和分类(1 = pos,0 = neg)作为我的训练数据集的一部分导入的示例语法。如果我在一个元组中手动创建我的数据集(即,我当前的数据集在训练中已经#'ed out),我知道如何做到这一点。任何帮助表示赞赏。
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_excel("reviewclasses.xlsx")
customerreview= df.customerreview.tolist() #I want this to be what's in
"train" below (i.e., "this is a negative review")
reviewrating= df.reviewrating.tolist() #I also want this to be what's in
"train" below (e.g., 0)
#train = [("Great place to be when you are in Bangalore.", "1"),
# ("The place was being renovated when I visited so the seating was
limited.", "0"),
# ("Loved the ambiance, loved the food", "1"),
# ("The food is delicious but not over the top.", "0"),
# ("Service - Little slow, probably because too many people.", "0"),
# ("The place is not easy to locate", "0"),
# ("Mushroom fried rice was spicy", "1"),
#]
dictionary = set(word.lower() for passage in train for word in
word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in dictionary}, x[1])
for x in train]
# Step 4 – the classifier is trained with sample data
classifier = nltk.NaiveBayesClassifier.train(t)
test_data = "The food sucked and I couldn't wait to leave the terrible
restaurant."
test_data_features = {word.lower(): (word in
word_tokenize(test_data.lower())) for word in dictionary}
print (classifier.classify(test_data_features))
我明白了。我基本上只需要将两个列表组合成一个元组。
def merge(customerreview, reviewrating):
merged_list = [(customerreview[i], reviewrating[i]) for i in range(0,
len(customerreview))]
return merged_list
train = (merge(customerreview, reviewrating))
我是文本分类的新手,但我了解大部分概念。简而言之,我在 Excel 数据集中有一个餐厅评论列表,我想将它们用作我的训练数据。我挣扎的地方是将实际评论和分类(1 = pos,0 = neg)作为我的训练数据集的一部分导入的示例语法。如果我在一个元组中手动创建我的数据集(即,我当前的数据集在训练中已经#'ed out),我知道如何做到这一点。任何帮助表示赞赏。
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_excel("reviewclasses.xlsx")
customerreview= df.customerreview.tolist() #I want this to be what's in
"train" below (i.e., "this is a negative review")
reviewrating= df.reviewrating.tolist() #I also want this to be what's in
"train" below (e.g., 0)
#train = [("Great place to be when you are in Bangalore.", "1"),
# ("The place was being renovated when I visited so the seating was
limited.", "0"),
# ("Loved the ambiance, loved the food", "1"),
# ("The food is delicious but not over the top.", "0"),
# ("Service - Little slow, probably because too many people.", "0"),
# ("The place is not easy to locate", "0"),
# ("Mushroom fried rice was spicy", "1"),
#]
dictionary = set(word.lower() for passage in train for word in
word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in dictionary}, x[1])
for x in train]
# Step 4 – the classifier is trained with sample data
classifier = nltk.NaiveBayesClassifier.train(t)
test_data = "The food sucked and I couldn't wait to leave the terrible
restaurant."
test_data_features = {word.lower(): (word in
word_tokenize(test_data.lower())) for word in dictionary}
print (classifier.classify(test_data_features))
我明白了。我基本上只需要将两个列表组合成一个元组。
def merge(customerreview, reviewrating):
merged_list = [(customerreview[i], reviewrating[i]) for i in range(0,
len(customerreview))]
return merged_list
train = (merge(customerreview, reviewrating))