python 中的标记化和 dtMatrix with nltk

Question

我有一个包含 2 列的 csv 文件——句子和标签。我想为这些句子制作一个文档术语矩阵。我是 Python 的新手，到目前为止我可以做到这一点：

import nltk
import csv
import numpy
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('my_file.csv', 'rU'), delimiter= ";",quotechar = '"')
for line in reader:
for field in line:
    tokens = word_tokenize(field)

但我不知道如何只使用一列进行标记化并创建这样的矩阵。

我在 Whosebug 上阅读了一些关于同一问题的主题，但在我能找到的所有示例中，csv 文件仅包含 1 列或者它们是硬编码文本。

如有任何答复，我将不胜感激。提前致谢！

Answer 1

假设您的文件 example.csv 如下所示：

label;sentence
"class1";"This is an example sentence."
"class1";"This is another example sentence."
"class2";"The third one is random."

使用 DictReader 而不是 reader 读取文件 以便它为您提供每一行作为字典

import csv
reader = csv.DictReader(open('example.csv', 'r'), delimiter= ";",quotechar = '"')
lines = list(reader) # this is a list each is dictionary
sentences = [l['sentence'] for l in lines] # get only

使用 scikit-learn 的文档术语矩阵

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(lowercase=True) 
X_count = count_vect.fit_transform(sentences)

可以使用 count_vect.vocabulary_ 访问字典（单词到索引字典），X_count 是您的文档术语矩阵，

X_count.toarray()
# [[1 0 1 1 0 0 1 0 0 1]
#  [0 1 1 1 0 0 1 0 0 1]
#  [0 0 0 1 1 1 0 1 1 0]]

文档术语矩阵使用 nltk 这有点像 scikit-learn 但你可以自己构建字典并将句子转换为文档术语矩阵

from nltk import word_tokenize
from itertools import chain, groupby
import scipy.sparse as sp

word_tokenize_matrix = [word_tokenize(sent) for sent in sentences]
vocab = set(chain.from_iterable(word_tokenize_matrix))
vocabulary = dict(zip(vocab, range(len(vocab)))) # dictionary of vocabulary to index

words_index = []
for r, words in enumerate(word_tokenize_matrix):
    for word in sorted(words):
        words_index.append((r, vocabulary.get(word), 1))

每个句子得到 row/column/value 后，您可以应用 groupby 并统计出现次数超过一次的单词。

rows, cols, data = [], [], []
for gid, g in groupby(words_index, lambda x: (x[0], x[1])):
    rows.append(gid[0])
    cols.append(gid[1])
    data.append(len(list(g)))
X_count = sp.csr_matrix((data, (rows, cols)))

在这里，您可以构建自己的文档术语矩阵！

python 中的标记化和 dtMatrix with nltk

Tokenization and dtMatrix in python with nltk

python

nlp

text-analysis

text-mining

nltk