撤消 python 中的标记化
undo the tokenization in python
我想撤销对我的数据应用的标记化。
data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
预期输出:
['this is a sentence', 'this is a sentence 2']
我尝试使用以下代码块执行此操作:
from nltk.tokenize.treebank import TreebankWordDetokenizer
data_untoken= []
for i, text in enumerate(data):
data_untoken.append(text)
data_untoken = TreebankWordDetokenizer().detokenize(text)
但是我有以下错误
'str' object has no attribute 'append'
使用join()
:
def untokenize(data):
for tokens in data:
yield ' '.join(tokens)
data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
untokenized_data = list(untokenize(data))
我想撤销对我的数据应用的标记化。
data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
预期输出:
['this is a sentence', 'this is a sentence 2']
我尝试使用以下代码块执行此操作:
from nltk.tokenize.treebank import TreebankWordDetokenizer
data_untoken= []
for i, text in enumerate(data):
data_untoken.append(text)
data_untoken = TreebankWordDetokenizer().detokenize(text)
但是我有以下错误
'str' object has no attribute 'append'
使用join()
:
def untokenize(data):
for tokens in data:
yield ' '.join(tokens)
data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
untokenized_data = list(untokenize(data))