如何从 jsonline 文件中的每一行中提取元素?
How to extract elements from each line in a jsonline file?
我有一个 jsonl 文件,其中每行包含一个句子和在该句子中找到的标记。我希望从 JSON 行文件的每一行中提取标记,但我的循环仅 returns 最后一行的标记。
这是输入。
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
我试过运行下面的代码:
with jsonlines.open('path/to/file') as reader:
for obj in reader:
data = obj['tokens'] # just extract the tokens
data = [(i['text'], i['id']) for i in data] # elements from the tokens
data
实际结果:
[('This', 0),
('is', 1),
('the', 2),
('first', 3),
('sentence', 4),
('.', 5)]
我想得到的结果是什么:
附加问题
有些标记包含 "label" 而不是 "id"。我如何将其合并到代码中?一个例子是:
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
for sentence_no,obj in enumerate(reader):
data = obj['tokens']
for i in data:
print(sentence_no+1,i['text'], i['id']+1,file=f)
代码中有issues/changes
您每次都在循环中重新分配变量 data
,因此您只会看到最后 json 行的结果,而不是每次都想扩展列表
您想在 reader
迭代器上使用 enumerate
来获取元组的第一项
然后代码变为
import jsonlines
data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
#Iterate over the each line on the reader via enumerate
for idx, obj in enumerate(reader):
#Append the data to the result
data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']]) # elements from the tokens
print(data)
或者通过在列表理解本身中制作双 for-loop 来更紧凑
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
输出将是
[
(1, 'This', 1),
(1, 'is', 2),
(1, 'the', 3),
(1, 'first', 4),
(1, 'sentence', 5),
(1, '.', 6),
(2, 'This', 1),
(2, 'is', 2),
(2, 'the', 3),
(2, 'second', 4),
(2, 'sentence', 5),
(2, '.', 6)
]
我有一个 jsonl 文件,其中每行包含一个句子和在该句子中找到的标记。我希望从 JSON 行文件的每一行中提取标记,但我的循环仅 returns 最后一行的标记。
这是输入。
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
我试过运行下面的代码:
with jsonlines.open('path/to/file') as reader:
for obj in reader:
data = obj['tokens'] # just extract the tokens
data = [(i['text'], i['id']) for i in data] # elements from the tokens
data
实际结果:
[('This', 0), ('is', 1), ('the', 2), ('first', 3), ('sentence', 4), ('.', 5)]
我想得到的结果是什么:
附加问题
有些标记包含 "label" 而不是 "id"。我如何将其合并到代码中?一个例子是:
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
for sentence_no,obj in enumerate(reader):
data = obj['tokens']
for i in data:
print(sentence_no+1,i['text'], i['id']+1,file=f)
代码中有issues/changes
您每次都在循环中重新分配变量
data
,因此您只会看到最后 json 行的结果,而不是每次都想扩展列表您想在
reader
迭代器上使用enumerate
来获取元组的第一项
然后代码变为
import jsonlines
data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
#Iterate over the each line on the reader via enumerate
for idx, obj in enumerate(reader):
#Append the data to the result
data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']]) # elements from the tokens
print(data)
或者通过在列表理解本身中制作双 for-loop 来更紧凑
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
输出将是
[
(1, 'This', 1),
(1, 'is', 2),
(1, 'the', 3),
(1, 'first', 4),
(1, 'sentence', 5),
(1, '.', 6),
(2, 'This', 1),
(2, 'is', 2),
(2, 'the', 3),
(2, 'second', 4),
(2, 'sentence', 5),
(2, '.', 6)
]