如何向量化单词标记字典(词袋实现)

How to vectorize dictionary of word tokens (bag of words implementation)

我正在创建自己的词袋算法,但我被卡住了。到目前为止,我已经标记了单词(字符串列表和用户输入的字符串)并将它们放入字典中。现在我想创建词向量,其中 0 表示该词不在文档中,1 表示它存在。我的想法是创建一个零向量,其大小对应于唯一单词的数量。 然后复制该基本向量,更新每个文档的向量值,并将它们存储在数组中。这是我卡住的部分。

import more_itertools as mit
import re
from collections import OrderedDict

def get_vector(lexicon, text):
   
    # Creates a dictionary with inital value 0 for all unique words in the vocabulary
    zero_vector = OrderedDict((token, 0) for token in lexicon)
    corpus_tokens = list(mit.collapse(text.split()))

def BoW(corpus: list, search_doc: str):
    
    word_count = {}
    
    # Regex to grab words here because its just a string
    search_doc_tokens = re.split(r'[-\s.,;!?]+', search_doc)
    
    # I have to do all this business here because it's a list of strings
    grab_words = [word.split() for word in corpus]
    corpus_tokens = list(mit.collapse(grab_words))
    
    # Concatenating the two lists
    vocabulary = corpus_tokens + search_doc_tokens
    
    # Filling dictionary
    for token in vocabulary:
        if token not in word_count:
            word_count[token] = 1
        else:
            word_count[token] += 1
                    
    
    # Unique words in vocab. Used determine size of zero vector
    lexicon = sorted(set(vocabulary))
    zero_vector = OrderedDict((token, 0) for token in lexicon)
    
    print(zero_vector)

documents = ["This is a text document", "This is another text document", "Get the picture?"]
BoW(documents, "hello there") 

我认为你应该只从语料库列表构建词典字典。

我想你可以这样写:

import more_itertools as mit
import re
from collections import OrderedDict

def get_vector(lexicon, text):
    zero_vector = OrderedDict((token, 0) for token in lexicon)
    corpus_tokens = list(mit.collapse(text.split()))
    for token in corpus_tokens:
        if token in zero_vector:
            zero_vector[token] = 1
    return zero_vector
    

def BoW(corpus: list, search_doc: str):
    
    word_count = {}
    
    # Regex to grab words here because its just a string
    search_doc_tokens = re.split(r'[-\s.,;!?]+', search_doc)
    
    # I have to do all this business here because it's a list of strings
    grab_words = [word.split() for word in corpus]
    corpus_tokens = list(mit.collapse(grab_words))
    
    # Concatenating the two lists  (why???)
    vocabulary = corpus_tokens #  + search_doc_tokens
    
    # Filling dictionary
    for token in vocabulary:
        if token not in word_count:
            word_count[token] = 1
        else:
            word_count[token] += 1
                    
    
    # Unique words in vocab. Used determine size of zero vector
    lexicon = sorted(set(vocabulary))
    
    for text in corpus:
        text_vector = get_vector(lexicon, text)
        print(text_vector)
        
    text_vector = get_vector(lexicon, search_doc)
    print(text_vector)

但是如果向量不是有序的字典而是 numpy 数组会更好。

要转换有序的字典,你可以使用这样的东西:

import numpy as np
tv_vec = np.array(list(test_vector.values()))

那么问题来了:为什么需要这个BoW?你想如何用矢量化文本构建最终矩阵?是否要将所有语料库文本和 search_doc 一起包含在矩阵中?

编辑:

我想你可以这样做:

    corpus_mat = np.zeros((len(lexicon), len(corpus)))
    for ind, text in enumerate(corpus):
        text_vector = get_vector(lexicon, text)
        corpus_mat[:, ind] = np.array(list(text_vector.values()))
        
    text_vector = get_vector(lexicon, search_doc)
    text_vector = np.array(list(text_vector.values()))
    return corpus_mat, text_vector

然后使用corpus_mat和text_vector计算与点积的相似度:

cm, tv = BoW(documents, "hello there") 
print(cm.T @ tv)

输出将是 3 个零,因为 search_doc 文本与语料库文本没有共同词。

也许看看这个 post

import re
import nltk
import numpy

corpus = ["Joe waited for the train", 
"The train was late", 
"Mary and Samantha took the bus", 
"I looked for Mary and Samantha at the bus station", 
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text

def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        words = sorted(list(set(words)))
    return words

def generate_bow(allsentences):
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab))

    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1
        print("{0}\n{1}\n".format(sentence,numpy.array(bag_vector)))


print(generate_bow(corpus))

结果:

Joe waited for the train
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

The train was late
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Mary and Samantha took the bus
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Mary and Samantha at the bus station
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]