使用 pandas 方法对所有列的项目进行计数
count items across all columns using pandas method
我有这个数据框,我可以使用向量化器获取每行每个项目的计数。但这适用于单行(例如 col1)。我如何将它应用于整个数据框(所有 3 列)?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")
单列的结果是正确的:
apple rice
0 1 0
1 0 1
2 1 0
3 0 1
4 1 0
所有列的预期结果:
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 0 0
3 0 1 0 2 0
4 1 0 1 1 0
还有其他方法可以得到相同的结果吗?
您可以堆叠并获得假人。然后按索引取最大值(sum
如果你想要计数而不是假人)
pd.get_dummies(df.stack()).max(level=0)
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 0 1
3 0 0 0 1 1
4 1 1 0 1 0
或者,get_dummies
在整个 DataFrame 上使用空白前缀并沿列轴分组。
pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)
您可以通过连接所有现有列并在其上应用 CountVectorizer
来创建单独的列。请参考下面的示例代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Red Chillies", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
vocab = set(df.values.flatten())
v = [i.lower() for i in vocab]
df['new'] = df.apply(' '.join, axis=1)
因此您的新数据框将如下所示
col1 col2 col3 new
0 Apple Bread Fridge Apple Bread Fridge
1 Rice Bread Milk Rice Bread Milk
2 Apple Rice Bread Apple Rice Bread
3 Rice Milk Milk Rice Milk Milk
4 Red Chillies Bread Milk Red Chillies Bread Milk
现在您可以在新列上应用 CountVectorizer
,如下所示:
CV = CountVectorizer(vocabulary=vocab, , ngram_range=(1,5))
cv_matrix=CV.fit_transform(df.new)
您可以使用以下方法获得所需的数据框:
pd.DataFrame(cv_matrix.toarray(), columns= CV.get_feature_names())
bread milk fridge rice apple red chillies
0 1 0 1 0 1 0
1 1 1 0 1 0 0
2 1 0 0 1 1 0
3 0 2 0 1 0 0
4 1 1 0 0 0 1
希望对您有所帮助!
如果您发现创建一个合并所有单独列的新列是一项开销,您可以使用生成器,它可以让您适合大数据。
此外,在 pandas 数据帧中读取稀疏矩阵的推荐方法是 sparse.from_spmatrix
。阅读更多 here
cv = CountVectorizer()
pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(
' '.join(x) for x in shopping_list),
columns=cv.get_feature_names())
如果需要在Dataframe中应用CountVectorizer
,则使用
' '.join(x[1:]) for x in df.itertuples()
我有这个数据框,我可以使用向量化器获取每行每个项目的计数。但这适用于单行(例如 col1)。我如何将它应用于整个数据框(所有 3 列)?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")
单列的结果是正确的:
apple rice
0 1 0
1 0 1
2 1 0
3 0 1
4 1 0
所有列的预期结果:
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 0 0
3 0 1 0 2 0
4 1 0 1 1 0
还有其他方法可以得到相同的结果吗?
您可以堆叠并获得假人。然后按索引取最大值(sum
如果你想要计数而不是假人)
pd.get_dummies(df.stack()).max(level=0)
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 0 1
3 0 0 0 1 1
4 1 1 0 1 0
或者,get_dummies
在整个 DataFrame 上使用空白前缀并沿列轴分组。
pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)
您可以通过连接所有现有列并在其上应用 CountVectorizer
来创建单独的列。请参考下面的示例代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Red Chillies", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
vocab = set(df.values.flatten())
v = [i.lower() for i in vocab]
df['new'] = df.apply(' '.join, axis=1)
因此您的新数据框将如下所示
col1 col2 col3 new
0 Apple Bread Fridge Apple Bread Fridge
1 Rice Bread Milk Rice Bread Milk
2 Apple Rice Bread Apple Rice Bread
3 Rice Milk Milk Rice Milk Milk
4 Red Chillies Bread Milk Red Chillies Bread Milk
现在您可以在新列上应用 CountVectorizer
,如下所示:
CV = CountVectorizer(vocabulary=vocab, , ngram_range=(1,5))
cv_matrix=CV.fit_transform(df.new)
您可以使用以下方法获得所需的数据框:
pd.DataFrame(cv_matrix.toarray(), columns= CV.get_feature_names())
bread milk fridge rice apple red chillies
0 1 0 1 0 1 0
1 1 1 0 1 0 0
2 1 0 0 1 1 0
3 0 2 0 1 0 0
4 1 1 0 0 0 1
希望对您有所帮助!
如果您发现创建一个合并所有单独列的新列是一项开销,您可以使用生成器,它可以让您适合大数据。
此外,在 pandas 数据帧中读取稀疏矩阵的推荐方法是 sparse.from_spmatrix
。阅读更多 here
cv = CountVectorizer()
pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(
' '.join(x) for x in shopping_list),
columns=cv.get_feature_names())
如果需要在Dataframe中应用CountVectorizer
,则使用
' '.join(x[1:]) for x in df.itertuples()