如何从具有变化元素的 JSONL 文件中提取元素?
How to extract elements from a JSONL file with changing elements?
我想从 JSONL 文件中的标记中提取 "text"。如果存在标签,那么我也想提取它。如果它不存在,那么我想插入 "O" 作为标签的值
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
如果没有标签,可用于从令牌中提取文本和 ID 的代码如下:(感谢我之前的@DeveshKumarSingh )
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
预期输出:
您可以使用 dict.get
查找存在的标签,否则将其替换为默认值 O
,即 i.get('label','O')
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i.get('label','O')) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
输出将是
[(1, 'This', 'O'),
(1, 'is', 'O'),
(1, 'the', 'O'),
(1, 'first', 'O'),
(1, 'sentence', 'O'),
(1, '.', 'O'),
(2, 'This', 'O'),
(2, 'is', 'O'),
(2, 'coded', 'O'),
(2, 'in', 'O'),
(2, 'python', 'Programming'),
(2, '.', 'O')]
我想从 JSONL 文件中的标记中提取 "text"。如果存在标签,那么我也想提取它。如果它不存在,那么我想插入 "O" 作为标签的值
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
如果没有标签,可用于从令牌中提取文本和 ID 的代码如下:(感谢我之前的@DeveshKumarSingh
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
预期输出:
您可以使用 dict.get
查找存在的标签,否则将其替换为默认值 O
,即 i.get('label','O')
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i.get('label','O')) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
输出将是
[(1, 'This', 'O'),
(1, 'is', 'O'),
(1, 'the', 'O'),
(1, 'first', 'O'),
(1, 'sentence', 'O'),
(1, '.', 'O'),
(2, 'This', 'O'),
(2, 'is', 'O'),
(2, 'coded', 'O'),
(2, 'in', 'O'),
(2, 'python', 'Programming'),
(2, '.', 'O')]