如何使用 nltk 标记单词列表?
How to tokenize a list of words using nltk?
我有一个文本数据集。这些数据集由许多行组成,每行由两个由制表符分隔的句子组成,如下所示:
this is string 1, first sentence. this is string 2, first sentence.
this is string 1, second sentence. this is string 2, second sentence.
然后我用这段代码拆分了数据文本:
#file readdata.py
from globalvariable import *
import os
class readdata:
def dataAyat(self):
global kalimatayat
fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
line = []
for line in fo.readlines():
datatxt = line.rstrip('\n').split('\t')
newdatatxt = [x.split('\t') for x in datatxt]
kalimatayat.append(newdatatxt)
print newdatatxt
readdata().dataAyat()
有效,输出为:
[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]
我想做的是使用 nltk word tokenize 对这些列表进行标记,我期望的输出是这样的:
[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]
有人知道如何标记化为像上面的输出吗?
我想在 "tokenizer.py" 中编写一个标记化函数并在 "mainfile.py"
中调用它
要标记句子列表,对其进行迭代并将结果存储在列表中:
data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
sentence_results = []
for s in sentence:
sentence_results.append(nltk.word_tokenize(sentence))
results.append(sentence_results)
结果将类似于
[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],
['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']],
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]
我有一个文本数据集。这些数据集由许多行组成,每行由两个由制表符分隔的句子组成,如下所示:
this is string 1, first sentence. this is string 2, first sentence.
this is string 1, second sentence. this is string 2, second sentence.
然后我用这段代码拆分了数据文本:
#file readdata.py
from globalvariable import *
import os
class readdata:
def dataAyat(self):
global kalimatayat
fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
line = []
for line in fo.readlines():
datatxt = line.rstrip('\n').split('\t')
newdatatxt = [x.split('\t') for x in datatxt]
kalimatayat.append(newdatatxt)
print newdatatxt
readdata().dataAyat()
有效,输出为:
[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]
我想做的是使用 nltk word tokenize 对这些列表进行标记,我期望的输出是这样的:
[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]
有人知道如何标记化为像上面的输出吗? 我想在 "tokenizer.py" 中编写一个标记化函数并在 "mainfile.py"
中调用它要标记句子列表,对其进行迭代并将结果存储在列表中:
data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
sentence_results = []
for s in sentence:
sentence_results.append(nltk.word_tokenize(sentence))
results.append(sentence_results)
结果将类似于
[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],
['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']],
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]