Return pandas 单元格中每个单词的列表以及整个列中该单词的总数
Return the list of each word in a pandas cell and the total count of that word in the entire column
我有一个 pandas 数据框,df 如下所示:
column1
0 apple is a fruit
1 fruit sucks
2 apple tasty fruit
3 fruits what else
4 yup apple map
5 fire in the hole
6 that is true
我想生成一个column2,它是行中每个单词的列表和整列中每个单词的总计数。所以输出会是这样的....
column1 column2
0 apple is a fruit [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1 fruit sucks [('fruit', 3),('sucks', 1)]
我尝试使用sklearn,但未能实现上述目标。需要帮助。
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])
这是一种可以得到您想要的结果的方法,尽管完全避免了 sklearn
:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
我不知道你是否可以使用 scikit-learn
来做到这一点,但你可以编写一个函数,然后使用 apply()
将它应用到你的 DataFrame
或 [=15] =].
以下是您可以如何处理您的示例:
test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])
def a_function(row):
splitted_row = str(row.values[0]).split()
word_occurences = []
for word in splitted_row:
column_occurences = test.A.str.count(word).sum()
word_occurences.append((word, column_occurences))
return word_occurences
test.apply(a_function, axis = 1)
# Output
0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1 [(fruit, 3), (sucks, 1)]
2 [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object
如您所见,主要问题是 test.A.str.count(word)
将计算 word
的所有出现次数,只要分配给 word
的模式在字符串中。这就是 "a"
显示为出现 4 次的原因。这应该可以通过一些正则表达式轻松解决(我不太擅长)。
或者如果您愿意少说一些的话,您可以在上面的函数中使用这个解决方法:
if word not in ['a', 'is']: # you can add here more useless words
word_occurences.append((word, column_occurences))
我有一个 pandas 数据框,df 如下所示:
column1
0 apple is a fruit
1 fruit sucks
2 apple tasty fruit
3 fruits what else
4 yup apple map
5 fire in the hole
6 that is true
我想生成一个column2,它是行中每个单词的列表和整列中每个单词的总计数。所以输出会是这样的....
column1 column2
0 apple is a fruit [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1 fruit sucks [('fruit', 3),('sucks', 1)]
我尝试使用sklearn,但未能实现上述目标。需要帮助。
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])
这是一种可以得到您想要的结果的方法,尽管完全避免了 sklearn
:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
我不知道你是否可以使用 scikit-learn
来做到这一点,但你可以编写一个函数,然后使用 apply()
将它应用到你的 DataFrame
或 [=15] =].
以下是您可以如何处理您的示例:
test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])
def a_function(row):
splitted_row = str(row.values[0]).split()
word_occurences = []
for word in splitted_row:
column_occurences = test.A.str.count(word).sum()
word_occurences.append((word, column_occurences))
return word_occurences
test.apply(a_function, axis = 1)
# Output
0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1 [(fruit, 3), (sucks, 1)]
2 [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object
如您所见,主要问题是 test.A.str.count(word)
将计算 word
的所有出现次数,只要分配给 word
的模式在字符串中。这就是 "a"
显示为出现 4 次的原因。这应该可以通过一些正则表达式轻松解决(我不太擅长)。
或者如果您愿意少说一些的话,您可以在上面的函数中使用这个解决方法:
if word not in ['a', 'is']: # you can add here more useless words
word_occurences.append((word, column_occurences))