正在使用 torchtext 加载 json 文件
Loading json file using torchtext
我正在处理 dailydialog 数据集,我已将其转换为
JSON 文件看起来像这样:
[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]
因此,每个项目都有两个键 - 响应和消息。
这是我第一次使用 PyTorch,所以我关注了一些在线可用资源。这些是我的代码的相关片段:
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
src = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True)
fields = {'response': ('r', src)}
train_data, test_data, validation_data = TabularDataset.splits(
path = 'FilePath',
train = 'trainset.json',
test = 'testset.json',
validation = 'validationset.json',
format = 'json',
fields = fields
)
虽然没有出现错误,尽管我的 JSON 文件中有很多项目,但奇怪的是,训练、测试和验证数据集各只有 1 个示例,如下图所示:
Image Showing the length of train_data, test_data and validation_data
如果有人能向我指出错误,我将不胜感激。
编辑:我发现由于文件中缺少缩进,整个文件被视为单个文本字符串。但是,如果我缩进 JSON 文件,TabularDataset 函数会向我抛出 JSONDecodeError,表明它无法再解码该文件。我怎样才能摆脱这个问题?
我认为代码没问题,但问题出在您的 JSON 文件上。您可以尝试删除文件开头和结尾的方括号(“[]”)吗?
可能这就是您的 Python 文件将其作为单个对象读取的原因。
我正在处理 dailydialog 数据集,我已将其转换为 JSON 文件看起来像这样:
[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]
因此,每个项目都有两个键 - 响应和消息。
这是我第一次使用 PyTorch,所以我关注了一些在线可用资源。这些是我的代码的相关片段:
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
src = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True)
fields = {'response': ('r', src)}
train_data, test_data, validation_data = TabularDataset.splits(
path = 'FilePath',
train = 'trainset.json',
test = 'testset.json',
validation = 'validationset.json',
format = 'json',
fields = fields
)
虽然没有出现错误,尽管我的 JSON 文件中有很多项目,但奇怪的是,训练、测试和验证数据集各只有 1 个示例,如下图所示: Image Showing the length of train_data, test_data and validation_data
如果有人能向我指出错误,我将不胜感激。
编辑:我发现由于文件中缺少缩进,整个文件被视为单个文本字符串。但是,如果我缩进 JSON 文件,TabularDataset 函数会向我抛出 JSONDecodeError,表明它无法再解码该文件。我怎样才能摆脱这个问题?
我认为代码没问题,但问题出在您的 JSON 文件上。您可以尝试删除文件开头和结尾的方括号(“[]”)吗? 可能这就是您的 Python 文件将其作为单个对象读取的原因。