如何计算Pandas中数组结构中每一列的单词数
how to count number of words for each column which is in array structure in Pandas
我的数据框中有字符串列,我在其中将句子拆分为单词。现在我需要计算该词的出现并将它们转换为列。基本上创建一个文档术语矩阵
0 [kubernetes, client, bootstrapping, ponda]
1 [micro, insu]
2 [motor, upi]
3 [secure, app, installation]
4 [health, insu, express, credit, customer]
5 [secure, app, installation]
6 [aap, insta]
7 [loan, house, loan, customers]
输出:
kubernetes client bootstrapping ponda loan customers installation
0 1 1 1 1 0 0 0
1 0 0 0 0 1 0 1
2 0 2 0 0 0 0 0
3 1 1 1 1 0 0 0
到目前为止的代码
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
countvec.fit_transform(df.new)
错误:
AttributeError: 'list' object has no attribute 'lower'
如果值是列表,首先 join
它们一起然后使用 CountVectorizer
:
print (type(df.loc[0, 'new']))
<class 'list'>
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
counts = countvec.fit_transform(df['new'].str.join(' '))
df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())
另一个 pandas 解决方案 get_dummies
和 sum
:
df1 = pd.DataFrame(df['new'].values.tolist())
df = pd.get_dummies(df1, prefix='', prefix_sep='').sum(axis=1, level=0)
print (df)
aap app bootstrapping client credit customer customers express \
0 0 0 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 0 0 0 1 1 0 1
5 0 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0
health house insta installation insu kubernetes loan micro motor \
0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1 0
2 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 0
4 1 0 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 2 0 0
ponda secure upi
0 1 0 0
1 0 0 0
2 0 0 1
3 0 1 0
4 0 0 0
5 0 1 0
6 0 0 0
7 0 0 0
要按照您使用的方式使用 CountVectorizer
,您的 DataFrame 需要如下所示:
string
0 kubernetes client bootstrapping ponda
1 micro insu
2 motor upi
3 secure app installation
4 health insu express credit customer
5 secure app installation
6 aap insta
7 loan house loan customers
此刻,你是这样的:
stringList
0 [kubernetes, client, bootstrapping, ponda]
1 [micro, insu]
2 [motor, upi]
3 [secure, app, installation]
4 [health, insu, express, credit, customer]
5 [secure, app, installation]
6 [aap, insta]
7 [loan, house, loan, customers]
下面是如何按照您需要的方式对其进行转换,以便您使用 CountVectorizer
这是一个可重现的例子:
df = pd.DataFrame([[['kubernetes', 'client', 'bootstrapping', 'ponda']], [['micro', 'insu']], [['motor', 'upi']],[['secure', 'app', 'installation']],[['health', 'insu', 'express', 'credit', 'customer']],[['secure', 'app', 'installation']],[['aap', 'insta']],[['loan', 'house', 'loan', 'customers']]])
df.columns = ['new']
我将你的专栏称为 new
,就像它最初在你的 DataFrame 中一样。
df['string'] = ""
我正在创建一个空列,我将在其中连接该单词列表中的每个单词。
for i in df.index:
df.at[i, 'string'] = " ".join(item for item in df.at[i, 'new'])
我按行扫描,并将字符串列表中的每个项目与 " "
连接起来,并将其添加到 string
列。
df.drop(['new'], axis = 1, inplace = True)
现在,不需要包含字符串列表的列!所以我放弃了它。
现在您的 DataFrame 已按照您想要的方式准备就绪!现在您可以使用 CountVectorizer
!
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
counts = countvec.fit_transform(df['string'])
vocab = pd.DataFrame(counts.toarray())
vocab.columns = countvec.get_feature_names()
print(vocab)
给予
aap app bootstrapping client credit customer customers express \
0 0 0 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 0 0 0 1 1 0 1
5 0 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0
health house insta installation insu kubernetes loan micro motor \
0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1 0
2 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 0
4 1 0 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 2 0 0
ponda secure upi
0 1 0 0
1 0 0 0
2 0 0 1
3 0 1 0
4 0 0 0
5 0 1 0
6 0 0 0
7 0 0 0
我的数据框中有字符串列,我在其中将句子拆分为单词。现在我需要计算该词的出现并将它们转换为列。基本上创建一个文档术语矩阵
0 [kubernetes, client, bootstrapping, ponda]
1 [micro, insu]
2 [motor, upi]
3 [secure, app, installation]
4 [health, insu, express, credit, customer]
5 [secure, app, installation]
6 [aap, insta]
7 [loan, house, loan, customers]
输出:
kubernetes client bootstrapping ponda loan customers installation
0 1 1 1 1 0 0 0
1 0 0 0 0 1 0 1
2 0 2 0 0 0 0 0
3 1 1 1 1 0 0 0
到目前为止的代码
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
countvec.fit_transform(df.new)
错误:
AttributeError: 'list' object has no attribute 'lower'
如果值是列表,首先 join
它们一起然后使用 CountVectorizer
:
print (type(df.loc[0, 'new']))
<class 'list'>
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
counts = countvec.fit_transform(df['new'].str.join(' '))
df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())
另一个 pandas 解决方案 get_dummies
和 sum
:
df1 = pd.DataFrame(df['new'].values.tolist())
df = pd.get_dummies(df1, prefix='', prefix_sep='').sum(axis=1, level=0)
print (df)
aap app bootstrapping client credit customer customers express \
0 0 0 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 0 0 0 1 1 0 1
5 0 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0
health house insta installation insu kubernetes loan micro motor \
0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1 0
2 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 0
4 1 0 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 2 0 0
ponda secure upi
0 1 0 0
1 0 0 0
2 0 0 1
3 0 1 0
4 0 0 0
5 0 1 0
6 0 0 0
7 0 0 0
要按照您使用的方式使用 CountVectorizer
,您的 DataFrame 需要如下所示:
string
0 kubernetes client bootstrapping ponda
1 micro insu
2 motor upi
3 secure app installation
4 health insu express credit customer
5 secure app installation
6 aap insta
7 loan house loan customers
此刻,你是这样的:
stringList
0 [kubernetes, client, bootstrapping, ponda]
1 [micro, insu]
2 [motor, upi]
3 [secure, app, installation]
4 [health, insu, express, credit, customer]
5 [secure, app, installation]
6 [aap, insta]
7 [loan, house, loan, customers]
下面是如何按照您需要的方式对其进行转换,以便您使用 CountVectorizer
这是一个可重现的例子:
df = pd.DataFrame([[['kubernetes', 'client', 'bootstrapping', 'ponda']], [['micro', 'insu']], [['motor', 'upi']],[['secure', 'app', 'installation']],[['health', 'insu', 'express', 'credit', 'customer']],[['secure', 'app', 'installation']],[['aap', 'insta']],[['loan', 'house', 'loan', 'customers']]])
df.columns = ['new']
我将你的专栏称为 new
,就像它最初在你的 DataFrame 中一样。
df['string'] = ""
我正在创建一个空列,我将在其中连接该单词列表中的每个单词。
for i in df.index:
df.at[i, 'string'] = " ".join(item for item in df.at[i, 'new'])
我按行扫描,并将字符串列表中的每个项目与 " "
连接起来,并将其添加到 string
列。
df.drop(['new'], axis = 1, inplace = True)
现在,不需要包含字符串列表的列!所以我放弃了它。
现在您的 DataFrame 已按照您想要的方式准备就绪!现在您可以使用 CountVectorizer
!
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
counts = countvec.fit_transform(df['string'])
vocab = pd.DataFrame(counts.toarray())
vocab.columns = countvec.get_feature_names()
print(vocab)
给予
aap app bootstrapping client credit customer customers express \
0 0 0 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 0 0 0 1 1 0 1
5 0 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0
health house insta installation insu kubernetes loan micro motor \
0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1 0
2 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 0
4 1 0 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 2 0 0
ponda secure upi
0 1 0 0
1 0 0 0
2 0 0 1
3 0 1 0
4 0 0 0
5 0 1 0
6 0 0 0
7 0 0 0