ValueError: not enough values to unpack
ValueError: not enough values to unpack
我正在尝试学习(在 Python3)如何对 NLP 进行情感分析,我正在使用 Kaggle 上可用的 "UMICH SI650 - Sentiment Classification" 数据库:https://www.kaggle.com/c/si650winter11
目前我正在尝试用一些循环生成词汇表,代码如下:
import collections
import nltk
import os
Directory = "../Databases"
# Read training data and generate vocabulary
max_length = 0
freqs = collections.Counter()
num_recs = 0
training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
for line in training:
if not line:
continue
label, sentence = line.strip().split("\t".encode())
words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
if len(words) > max_length:
max_length = len(words)
for word in words:
freqs[word] += 1
num_recs += 1
training.close()
我一直收到这个错误,我不完全理解:
in label, sentence = line.strip().split("\t".encode())
ValueError: not enough values to unpack (expected 2, got 1)
我尝试添加
if not line:
continue
喜欢这里的建议:
但这对我的情况不起作用。我该如何解决这个错误?
非常感谢,
您应该检查是否有错误的字段数:
if not line:
continue
fields = line.strip().split("\t".encode())
if len(fields) != 2:
# you could print(fields) here to help debug
continue
label, sentence = fields
解决此问题的最简单方法是将解包语句放入 try/except
块中。类似于:
try:
label, sentence = line.strip().split("\t".encode())
except ValueError:
print(f'Error line: {line}')
continue
我猜你的某些行的标签后面只有空格。
这是从 https://www.kaggle.com/c/si650winter11
中读取数据集的一种更简洁的方法
首先,context manager是你的朋友,使用它,http://book.pythontips.com/en/latest/context_managers.html
其次,如果是文本文件,避免将其作为二进制文件读取,即open(filename, 'r')
而不是open(filename, 'rb')
,那么就没有必要乱用str/byte 和 encode/decode.
现在:
from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
for line in fin:
label, text = line.strip().split('\t')
# Avoid lowercasing before tokenization.
# lowercasing after tokenization is much better
# just in case the tokenizer uses captialization as cues.
word_counts.update(map(str.lower, word_tokenize(text)))
print(word_counts)
我正在尝试学习(在 Python3)如何对 NLP 进行情感分析,我正在使用 Kaggle 上可用的 "UMICH SI650 - Sentiment Classification" 数据库:https://www.kaggle.com/c/si650winter11
目前我正在尝试用一些循环生成词汇表,代码如下:
import collections
import nltk
import os
Directory = "../Databases"
# Read training data and generate vocabulary
max_length = 0
freqs = collections.Counter()
num_recs = 0
training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
for line in training:
if not line:
continue
label, sentence = line.strip().split("\t".encode())
words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
if len(words) > max_length:
max_length = len(words)
for word in words:
freqs[word] += 1
num_recs += 1
training.close()
我一直收到这个错误,我不完全理解:
in label, sentence = line.strip().split("\t".encode()) ValueError: not enough values to unpack (expected 2, got 1)
我尝试添加
if not line:
continue
喜欢这里的建议:
非常感谢,
您应该检查是否有错误的字段数:
if not line:
continue
fields = line.strip().split("\t".encode())
if len(fields) != 2:
# you could print(fields) here to help debug
continue
label, sentence = fields
解决此问题的最简单方法是将解包语句放入 try/except
块中。类似于:
try:
label, sentence = line.strip().split("\t".encode())
except ValueError:
print(f'Error line: {line}')
continue
我猜你的某些行的标签后面只有空格。
这是从 https://www.kaggle.com/c/si650winter11
中读取数据集的一种更简洁的方法首先,context manager是你的朋友,使用它,http://book.pythontips.com/en/latest/context_managers.html
其次,如果是文本文件,避免将其作为二进制文件读取,即open(filename, 'r')
而不是open(filename, 'rb')
,那么就没有必要乱用str/byte 和 encode/decode.
现在:
from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
for line in fin:
label, text = line.strip().split('\t')
# Avoid lowercasing before tokenization.
# lowercasing after tokenization is much better
# just in case the tokenizer uses captialization as cues.
word_counts.update(map(str.lower, word_tokenize(text)))
print(word_counts)