"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]" 如果 col B 包含字符串,则设置 col A 值
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]" setting col A value if col B contains string
我有一个包含一列 (tweet
) 和 2 行的数据框(称为 corpus
):
['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
我在列中有一个唯一单词列表(称为 vocab
):
['check',
'tihs',
'out',
'this',
'bear',
'love',
'jumping',
'on',
'plant',
'i',
'can',
't',
'the',
'noise',
'from',
'that',
'power',
'it',
'make',
'me',
'jump']
我想为 vocab 中的每个单词添加一个新列。我希望新列的所有值都为零,除非 tweet
包含单词,在这种情况下我希望单词列的值为 1.
所以我尝试了运行下面的代码:
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
...并显示以下错误:
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
如何检查推文是否包含该词,如果是,则将该词的新列的值设置为 1?
你的corpus['tweet']
是list类型的,每一个都是骨架。所以 .str.contains
会 returns NaN
。你可能想做:
# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]
# one-hot-encode
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
但这可能不是您想要的,因为 contains
将搜索所有子字符串,例如this girl goes to school
将在 is
和 this
.
两列中 returns 1
根据您的数据,您可以:
corpus["tweet"] = [x[0] for x in corpus['tweet']]
corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
.reindex(vocab, axis=1, fill_value=0)
)
这样做:
from sklearn.feature_extraction.text import CountVectorizer
l = ['check, this, out, this, bear, love, jumping, on, this, plant',
'i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
vect = CountVectorizer()
X = pd.DataFrame(vect.fit_transform(l).toarray())
X.columns = vect.get_feature_names()
输出:
bear can check from it jump ... out plant power that the this
0 1 0 1 0 0 0 ... 1 1 0 0 0 3
1 1 1 0 1 1 1 ... 0 1 1 1 1 0
我有一个包含一列 (tweet
) 和 2 行的数据框(称为 corpus
):
['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
我在列中有一个唯一单词列表(称为 vocab
):
['check',
'tihs',
'out',
'this',
'bear',
'love',
'jumping',
'on',
'plant',
'i',
'can',
't',
'the',
'noise',
'from',
'that',
'power',
'it',
'make',
'me',
'jump']
我想为 vocab 中的每个单词添加一个新列。我希望新列的所有值都为零,除非 tweet
包含单词,在这种情况下我希望单词列的值为 1.
所以我尝试了运行下面的代码:
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
...并显示以下错误:
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
如何检查推文是否包含该词,如果是,则将该词的新列的值设置为 1?
你的corpus['tweet']
是list类型的,每一个都是骨架。所以 .str.contains
会 returns NaN
。你可能想做:
# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]
# one-hot-encode
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
但这可能不是您想要的,因为 contains
将搜索所有子字符串,例如this girl goes to school
将在 is
和 this
.
1
根据您的数据,您可以:
corpus["tweet"] = [x[0] for x in corpus['tweet']]
corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
.reindex(vocab, axis=1, fill_value=0)
)
这样做:
from sklearn.feature_extraction.text import CountVectorizer
l = ['check, this, out, this, bear, love, jumping, on, this, plant',
'i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
vect = CountVectorizer()
X = pd.DataFrame(vect.fit_transform(l).toarray())
X.columns = vect.get_feature_names()
输出:
bear can check from it jump ... out plant power that the this
0 1 0 1 0 0 0 ... 1 1 0 0 0 3
1 1 1 0 1 1 1 ... 0 1 1 1 1 0