正在使用 torchtext 加载 json 文件

Question

我正在处理 dailydialog 数据集，我已将其转换为 JSON 文件看起来像这样：

[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]

因此，每个项目都有两个键 - 响应和消息。

这是我第一次使用 PyTorch，所以我关注了一些在线可用资源。这些是我的代码的相关片段：

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

src = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

fields = {'response': ('r', src)}

train_data, test_data, validation_data = TabularDataset.splits(     
                                        path = 'FilePath',
                                        train = 'trainset.json',
                                        test = 'testset.json',
                                        validation = 'validationset.json',
                                        format = 'json',
                                        fields = fields        
)

虽然没有出现错误，尽管我的 JSON 文件中有很多项目，但奇怪的是，训练、测试和验证数据集各只有 1 个示例，如下图所示： Image Showing the length of train_data, test_data and validation_data

如果有人能向我指出错误，我将不胜感激。

编辑：我发现由于文件中缺少缩进，整个文件被视为单个文本字符串。但是，如果我缩进 JSON 文件，TabularDataset 函数会向我抛出 JSONDecodeError，表明它无法再解码该文件。我怎样才能摆脱这个问题？

Answer 1

我认为代码没问题，但问题出在您的 JSON 文件上。您可以尝试删除文件开头和结尾的方括号（“[]”）吗？可能这就是您的 Python 文件将其作为单个对象读取的原因。

正在使用 torchtext 加载 json 文件

Loading json file using torchtext

json

nlp

pytorch

torchtext

dataloader