在巨大 pandas 数据框的文本列上创建 TfidfVectorizer
Creating a TfidfVectorizer over a text column of huge pandas dataframe
我需要从存储在巨大 dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer 列中的文本中获取 TF-IDF 特征矩阵。我想我在编写生成器方法时做错了什么 ChunkIterator
如下所示。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):
for chunk in pd.read_csv(csvfilename, chunksize=1):
yield chunk['text_column'].values
corpus = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
任何人都可以建议如何修改上面的 ChunkIterator
方法,或使用 dataframe. I would like to avoid creating separate text files for each row in the dataframe 的任何其他方法。以下是一些用于重新创建场景的虚拟 csv 文件数据。
id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"
该方法可以很好地接受生成器。但它需要一个可迭代的原始文档,即字符串。您的生成器是 numpy.ndarray
对象的可迭代对象。所以尝试这样的事情:
def ChunkIterator(filename):
for chunk in pd.read_csv(csvfilename, chunksize=1):
for document in chunk['text_column'].values:
yield document
请注意,我不太明白你为什么在这里使用 pandas。只需使用常规 csv
模块,例如:
import csv
def doc_generator(filepath, textcol=0, skipheader=True):
with open(filepath) as f:
reader = csv.reader(f)
if skipheader:
next(reader, None)
for row in reader:
yield row[textcol]
因此,在您的情况下,将 1
传递给 textcol,例如:
In [1]: from sklearn.feature_extraction.text import TfidfVectorizer
In [2]: import csv
...: def doc_generator(filepath, textcol=0, skipheader=True):
...: with open(filepath) as f:
...: reader = csv.reader(f)
...: if skipheader:
...: next(reader, None)
...: for row in reader:
...: yield row[textcol]
...:
In [3]: vectorizer = TfidfVectorizer()
In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))
In [5]: result
Out[5]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
In [6]: result.todense()
Out[6]:
matrix([[ 0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524],
[ 0. , 0.6876236 , 0. , 0.28108867, 0. ,
0.53864762, 0.28108867, 0. , 0.28108867],
[ 0.51184851, 0. , 0. , 0.26710379, 0.51184851,
0. , 0.26710379, 0.51184851, 0.26710379],
[ 0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524]])
我需要从存储在巨大 dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer 列中的文本中获取 TF-IDF 特征矩阵。我想我在编写生成器方法时做错了什么 ChunkIterator
如下所示。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):
for chunk in pd.read_csv(csvfilename, chunksize=1):
yield chunk['text_column'].values
corpus = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
任何人都可以建议如何修改上面的 ChunkIterator
方法,或使用 dataframe. I would like to avoid creating separate text files for each row in the dataframe 的任何其他方法。以下是一些用于重新创建场景的虚拟 csv 文件数据。
id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"
该方法可以很好地接受生成器。但它需要一个可迭代的原始文档,即字符串。您的生成器是 numpy.ndarray
对象的可迭代对象。所以尝试这样的事情:
def ChunkIterator(filename):
for chunk in pd.read_csv(csvfilename, chunksize=1):
for document in chunk['text_column'].values:
yield document
请注意,我不太明白你为什么在这里使用 pandas。只需使用常规 csv
模块,例如:
import csv
def doc_generator(filepath, textcol=0, skipheader=True):
with open(filepath) as f:
reader = csv.reader(f)
if skipheader:
next(reader, None)
for row in reader:
yield row[textcol]
因此,在您的情况下,将 1
传递给 textcol,例如:
In [1]: from sklearn.feature_extraction.text import TfidfVectorizer
In [2]: import csv
...: def doc_generator(filepath, textcol=0, skipheader=True):
...: with open(filepath) as f:
...: reader = csv.reader(f)
...: if skipheader:
...: next(reader, None)
...: for row in reader:
...: yield row[textcol]
...:
In [3]: vectorizer = TfidfVectorizer()
In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))
In [5]: result
Out[5]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
In [6]: result.todense()
Out[6]:
matrix([[ 0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524],
[ 0. , 0.6876236 , 0. , 0.28108867, 0. ,
0.53864762, 0.28108867, 0. , 0.28108867],
[ 0.51184851, 0. , 0. , 0.26710379, 0.51184851,
0. , 0.26710379, 0.51184851, 0.26710379],
[ 0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524]])