将具有特定格式特征的文本标记化
tokenizing text with features in specif format
你好,我正在尝试创建具有某些特征的标记,并使用以下文本示例以某种 JSON 格式排列它们:
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
{
indexSentence:0,
tokens: [{
"indexWord": 1,
"word": "The",
"len": 3
},
{ "indexWord": 2,
"word": "study",
"len": 5},
{"indexWord": 3,
"word": "of",
"len": 2
},
{"indexWord": 4,
"word": "aviation",
"len": 8},
...
]
},
{
"indexSentence" : 1,
"tokens" : [{
...
}]
},
....
]}
我尝试使用以下代码但没有成功...
t_d = {len(i):i for i in words}
[{'Lon' : len(t_d[i]),
'tex' : t_d[i],
'Sub' : [{'index' : j,
'token': [{
'word':['word: ' + j for i,j in enumerate(str(t_d[i]).split(' '))]
}],
'lenTo' : len(str(t_d[i]).split(' '))
}
],
'Sub1':[{'index' : j}]
} for j,i in enumerate(t_d)]
下面的解决方案假设标记化使用 str.split
函数按空格拆分句子。该解决方案应该仍然能够与任何其他标记化功能一起使用。
from collections import defaultdict
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
sentence = defaultdict(list)
for idx,i in enumerate(words):
struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
"word":w,
"len":len(w)} for idx_w, w in enumerate(i.split())]}
sentence['sentence'].append(struct)
dict(sentence)
>>
{'sentence': [{'indexSentence': 0,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
{'indexWord': 1, 'word': 'study', 'len': 5},
{'indexWord': 2, 'word': 'of', 'len': 2},
{'indexWord': 3, 'word': 'aviation', 'len': 8},
{'indexWord': 4, 'word': 'safety', 'len': 6},
{'indexWord': 5, 'word': 'report', 'len': 6},
{'indexWord': 6, 'word': 'in', 'len': 2},
{'indexWord': 7, 'word': 'the', 'len': 3},
{'indexWord': 8, 'word': 'aviation', 'len': 8},
{'indexWord': 9, 'word': 'industry', 'len': 8},
{'indexWord': 10, 'word': 'usually', 'len': 7},
{'indexWord': 11, 'word': 'relies', 'len': 6}]},
{'indexSentence': 1,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
...
}
您可以利用 defaultdict
首先创建您的列表或数组,然后在顶部附加所需的结构。要模仿 json
结构,您可以返回 dict
.
你好,我正在尝试创建具有某些特征的标记,并使用以下文本示例以某种 JSON 格式排列它们:
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
{
indexSentence:0,
tokens: [{
"indexWord": 1,
"word": "The",
"len": 3
},
{ "indexWord": 2,
"word": "study",
"len": 5},
{"indexWord": 3,
"word": "of",
"len": 2
},
{"indexWord": 4,
"word": "aviation",
"len": 8},
...
]
},
{
"indexSentence" : 1,
"tokens" : [{
...
}]
},
....
]}
我尝试使用以下代码但没有成功...
t_d = {len(i):i for i in words}
[{'Lon' : len(t_d[i]),
'tex' : t_d[i],
'Sub' : [{'index' : j,
'token': [{
'word':['word: ' + j for i,j in enumerate(str(t_d[i]).split(' '))]
}],
'lenTo' : len(str(t_d[i]).split(' '))
}
],
'Sub1':[{'index' : j}]
} for j,i in enumerate(t_d)]
下面的解决方案假设标记化使用 str.split
函数按空格拆分句子。该解决方案应该仍然能够与任何其他标记化功能一起使用。
from collections import defaultdict
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
sentence = defaultdict(list)
for idx,i in enumerate(words):
struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
"word":w,
"len":len(w)} for idx_w, w in enumerate(i.split())]}
sentence['sentence'].append(struct)
dict(sentence)
>>
{'sentence': [{'indexSentence': 0,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
{'indexWord': 1, 'word': 'study', 'len': 5},
{'indexWord': 2, 'word': 'of', 'len': 2},
{'indexWord': 3, 'word': 'aviation', 'len': 8},
{'indexWord': 4, 'word': 'safety', 'len': 6},
{'indexWord': 5, 'word': 'report', 'len': 6},
{'indexWord': 6, 'word': 'in', 'len': 2},
{'indexWord': 7, 'word': 'the', 'len': 3},
{'indexWord': 8, 'word': 'aviation', 'len': 8},
{'indexWord': 9, 'word': 'industry', 'len': 8},
{'indexWord': 10, 'word': 'usually', 'len': 7},
{'indexWord': 11, 'word': 'relies', 'len': 6}]},
{'indexSentence': 1,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
...
}
您可以利用 defaultdict
首先创建您的列表或数组,然后在顶部附加所需的结构。要模仿 json
结构,您可以返回 dict
.