撤消 python 中的标记化

Question

我想撤销对我的数据应用的标记化。

data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]

预期输出：

['this is a sentence', 'this is a sentence 2']

我尝试使用以下代码块执行此操作：

from nltk.tokenize.treebank import TreebankWordDetokenizer
data_untoken= []
for i, text in enumerate(data):
    data_untoken.append(text)
    data_untoken = TreebankWordDetokenizer().detokenize(text)

但是我有以下错误

'str' object has no attribute 'append'

Answer 1

使用join():

def untokenize(data):
    for tokens in data:
        yield ' '.join(tokens)


data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
untokenized_data = list(untokenize(data))

撤消 python 中的标记化

undo the tokenization in python

python

text-mining