在 Dataframe 的句子中查找多个单词并转换为分数的总和
Lookup multiple words in a sentence in a Dataframe and convert to a sum of scores
我有以下数据框:
Sentence
0 Cat is a big lion
1 Dogs are descendants of wolf
2 Elephants are pachyderm
3 Pachyderm animals include rhino, Elephants and hippopotamus
我需要创建一个 python 代码来查看上面句子中的单词,并根据以下不同的数据框计算每个单词的分数总和。
Name Score
cat 1
dog 2
wolf 2
lion 3
elephants 5
rhino 4
hippopotamus 5
例如,对于第 0 行,分数将为 1(猫)+ 3(狮子)= 4
我希望创建如下所示的输出。
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and hippopotamus 14
作为第一步,您可以尝试基于 split
和 map
的方法,然后使用 groupby
.
计算分数
v = df1['Sentence'].str.split(r'[\s.!?,]+', expand=True).stack().str.lower()
df1['Value'] = (
v.map(df2.set_index('Name')['Score'])
.sum(level=0)
.fillna(0, downcast='infer'))
df1
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4 # s/dog/dogs in df2
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14
nltk
您可能需要下载内容
import nltk
nltk.download('punkt')
然后设置词干提取和分词
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
创建一个方便的字典
m = dict(zip(map(ps.stem, scores.Name), scores.Score))
并生成分数
def f(s):
return sum(filter(None, map(m.get, map(ps.stem, word_tokenize(s)))))
df.assign(Score=[*map(f, df.Sentence)])
Sentence Score
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14
尝试将 findall
与 re
re.I
一起使用
df.Sentence.str.findall(df1.Name.str.cat(sep='|'),flags=re.I).\
map(lambda x : sum([df1.loc[df1.Name==str.lower(y),'Score' ].values for y in x])[0])
Out[49]:
0 4
1 4
2 5
3 14
Name: Sentence, dtype: int64
我有以下数据框:
Sentence
0 Cat is a big lion
1 Dogs are descendants of wolf
2 Elephants are pachyderm
3 Pachyderm animals include rhino, Elephants and hippopotamus
我需要创建一个 python 代码来查看上面句子中的单词,并根据以下不同的数据框计算每个单词的分数总和。
Name Score
cat 1
dog 2
wolf 2
lion 3
elephants 5
rhino 4
hippopotamus 5
例如,对于第 0 行,分数将为 1(猫)+ 3(狮子)= 4
我希望创建如下所示的输出。
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and hippopotamus 14
作为第一步,您可以尝试基于 split
和 map
的方法,然后使用 groupby
.
v = df1['Sentence'].str.split(r'[\s.!?,]+', expand=True).stack().str.lower()
df1['Value'] = (
v.map(df2.set_index('Name')['Score'])
.sum(level=0)
.fillna(0, downcast='infer'))
df1
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4 # s/dog/dogs in df2
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14
nltk
您可能需要下载内容
import nltk
nltk.download('punkt')
然后设置词干提取和分词
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
创建一个方便的字典
m = dict(zip(map(ps.stem, scores.Name), scores.Score))
并生成分数
def f(s):
return sum(filter(None, map(m.get, map(ps.stem, word_tokenize(s)))))
df.assign(Score=[*map(f, df.Sentence)])
Sentence Score
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14
尝试将 findall
与 re
re.I
df.Sentence.str.findall(df1.Name.str.cat(sep='|'),flags=re.I).\
map(lambda x : sum([df1.loc[df1.Name==str.lower(y),'Score' ].values for y in x])[0])
Out[49]:
0 4
1 4
2 5
3 14
Name: Sentence, dtype: int64