Python 中的文档矢量化表示
Document Vectorization Representation in Python
我在 python 3 中尝试进行情感分析,并使用 TDF-IDF 向量化器和词袋模型对文档进行向量化。
因此,对于任何熟悉它的人来说,很明显生成的矩阵表示是稀疏的。
这是我的代码片段。首先是文件。
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
以及以下用于对文档进行矢量化的代码。
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
当我打印vectorized
时,它没有输出正常的矩阵。相反,这个:
如果我没记错的话,这一定是一个稀疏矩阵表示。但是,我无法理解它的格式,以及每个术语的含义。
此外,还有 30 个文件。所以,这解释了第一列中的 0-29。如果那是趋势,那么我猜第二列是单词的索引,最后一个值是 tf-idf?当我输入问题时,它让我印象深刻,但如果我错了,请纠正我。
有这方面经验的人可以帮助我更好地理解它吗?
是的,从技术上讲,前两个元组表示行列位置,第三列是该位置的值。所以它基本上显示了非零值的位置和值。
我在 python 3 中尝试进行情感分析,并使用 TDF-IDF 向量化器和词袋模型对文档进行向量化。
因此,对于任何熟悉它的人来说,很明显生成的矩阵表示是稀疏的。
这是我的代码片段。首先是文件。
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
以及以下用于对文档进行矢量化的代码。
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
当我打印vectorized
时,它没有输出正常的矩阵。相反,这个:
如果我没记错的话,这一定是一个稀疏矩阵表示。但是,我无法理解它的格式,以及每个术语的含义。
此外,还有 30 个文件。所以,这解释了第一列中的 0-29。如果那是趋势,那么我猜第二列是单词的索引,最后一个值是 tf-idf?当我输入问题时,它让我印象深刻,但如果我错了,请纠正我。
有这方面经验的人可以帮助我更好地理解它吗?
是的,从技术上讲,前两个元组表示行列位置,第三列是该位置的值。所以它基本上显示了非零值的位置和值。