ValueError: not enough values to unpack

ValueError: not enough values to unpack

我正在尝试学习(在 Python3)如何对 NLP 进行情感分析,我正在使用 Kaggle 上可用的 "UMICH SI650 - Sentiment Classification" 数据库:https://www.kaggle.com/c/si650winter11

目前我正在尝试用一些循环生成词汇表,代码如下:

    import collections
    import nltk
    import os

    Directory = "../Databases"


    # Read training data and generate vocabulary
    max_length = 0
    freqs = collections.Counter()
    num_recs = 0
    training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
    for line in training:
        if not line:
            continue
        label, sentence = line.strip().split("\t".encode())
        words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
        if len(words) > max_length:
            max_length = len(words)
        for word in words:
            freqs[word] += 1
        num_recs += 1
    training.close()

我一直收到这个错误,我不完全理解:

in label, sentence = line.strip().split("\t".encode()) ValueError: not enough values to unpack (expected 2, got 1)

我尝试添加

if not line:
        continue

喜欢这里的建议: 但这对我的情况不起作用。我该如何解决这个错误?

非常感谢,

您应该检查是否有错误的字段数:

 if not line:
     continue
 fields = line.strip().split("\t".encode())
 if len(fields) != 2:
     # you could print(fields) here to help debug
     continue
 label, sentence = fields

解决此问题的最简单方法是将解包语句放入 try/except 块中。类似于:

try:
    label, sentence = line.strip().split("\t".encode())
except ValueError:
    print(f'Error line: {line}')
    continue

我猜你的某些行的标签后面只有空格。

这是从 https://www.kaggle.com/c/si650winter11

中读取数据集的一种更简洁的方法

首先,context manager是你的朋友,使用它,http://book.pythontips.com/en/latest/context_managers.html

其次,如果是文本文件,避免将其作为二进制文件读取,即open(filename, 'r')而不是open(filename, 'rb'),那么就没有必要乱用str/byte 和 encode/decode.

现在

from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
    for line in fin:
        label, text = line.strip().split('\t')
        # Avoid lowercasing before tokenization.
        # lowercasing after tokenization is much better
        # just in case the tokenizer uses captialization as cues. 
        word_counts.update(map(str.lower, word_tokenize(text)))

print(word_counts)