如何根据数据帧中的单词检测分配 points/score?
How do I assign points/score based on word detection in a dataframe?
我是 python 的新手,正在尝试学习单词检测。我有一个包含文字的数据框
sharina['transcript']
Out[25]:
0 thank you for calling my name is Tiffany and we want to let you know this call is recorded...
1 Maggie
2 through the time
3 that you can find I have a question about a claim and our contact is..
4 three to like even your box box and thank you for your help...
我创建了一个应用程序来检测来自这个的单词:
def search_multiple_strings_in_file(file_name, list_of_strings):
"""Get line from the file along with line numbers, which contains any string from the list"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open("sharina.csv", 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
line_number += 1
# For each line, check if line contains any string from the list of strings
for string_to_search in list_of_strings:
if string_to_search in line:
# If any string is found in line, then append that line along with line number in list
list_of_results.append((string_to_search, line_number, line.rstrip()))
# Return list of tuples containing matched string, line numbers and lines where string is found
return list_of_results
# search for given strings in the file 'sample.txt'
matched_lines = search_multiple_strings_in_file('sharina.csv', ['recorded','thank'])
print('Total Matched lines : ', len(matched_lines))
for elem in matched_lines:
print('Word = ', elem[0], ' :: Line Number = ', elem[1], ' :: Line = ', elem[2])
例如,如果在数据框中检测到某些单词,我想分配一个分数
如果 'recorded' 这个词被提到 = 7 分
如果提到 'thank' 这个词 = 5 分
然后输出给出总和 points/score = 12 在这种情况下。我该怎么做?
既然你提到你已经有一个 DataFrame:
这可以通过 Series.str.extractall
相对简单地完成。首先,我们创建捕获组,它是所有单词的 '|'.join
,夹在括号之间。这允许您在一个 Series 中获取所有需要的单词,该 Series 的索引表示它所属的行。该系列还有一个 'match' 索引级别,指示在该行上匹配的项目数,在这种情况下并不重要。
pat = '(' + '|'.join(words) + ')'
#'(recorded|thank)'
df['transcript'].str.extractall(pat)
# 0
# match
#0 0 thank # `'thank'` on line 0
# 1 recorded
#4 0 thank # `'thank'` also on line 4
如果要打分的话,一个好的组织就是dict,key就是word,value就是points。然后就可以通过连接键来制作图案,通过映射值来获得点数:
d = {'thank': 5, 'recorded': 7}
pat = '(' + '|'.join(d.keys()) + ')'
df1 = df['transcript'].str.extractall(pat).rename(columns={0: 'word'})
df1['points'] = df1['word'].map(d)
# word points
# match
#0 0 thank 5
# 1 recorded 7
#4 0 thank 5
如果你只想计算一次单词那么drop_duplicates:
df1.drop_duplicates('word').points.sum()
#12
设置数据
df = pd.DataFrame({'transcript':
['thank you for calling my name is Tiffany and we want to let you know this call is recorded',
'Maggie',
'through the time',
'that you can find I have a question about a claim and our contact is',
'three to like even your box box and thank you for your help']})
我是 python 的新手,正在尝试学习单词检测。我有一个包含文字的数据框
sharina['transcript']
Out[25]:
0 thank you for calling my name is Tiffany and we want to let you know this call is recorded...
1 Maggie
2 through the time
3 that you can find I have a question about a claim and our contact is..
4 three to like even your box box and thank you for your help...
我创建了一个应用程序来检测来自这个的单词:
def search_multiple_strings_in_file(file_name, list_of_strings):
"""Get line from the file along with line numbers, which contains any string from the list"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open("sharina.csv", 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
line_number += 1
# For each line, check if line contains any string from the list of strings
for string_to_search in list_of_strings:
if string_to_search in line:
# If any string is found in line, then append that line along with line number in list
list_of_results.append((string_to_search, line_number, line.rstrip()))
# Return list of tuples containing matched string, line numbers and lines where string is found
return list_of_results
# search for given strings in the file 'sample.txt'
matched_lines = search_multiple_strings_in_file('sharina.csv', ['recorded','thank'])
print('Total Matched lines : ', len(matched_lines))
for elem in matched_lines:
print('Word = ', elem[0], ' :: Line Number = ', elem[1], ' :: Line = ', elem[2])
例如,如果在数据框中检测到某些单词,我想分配一个分数
如果 'recorded' 这个词被提到 = 7 分 如果提到 'thank' 这个词 = 5 分
然后输出给出总和 points/score = 12 在这种情况下。我该怎么做?
既然你提到你已经有一个 DataFrame:
这可以通过 Series.str.extractall
相对简单地完成。首先,我们创建捕获组,它是所有单词的 '|'.join
,夹在括号之间。这允许您在一个 Series 中获取所有需要的单词,该 Series 的索引表示它所属的行。该系列还有一个 'match' 索引级别,指示在该行上匹配的项目数,在这种情况下并不重要。
pat = '(' + '|'.join(words) + ')'
#'(recorded|thank)'
df['transcript'].str.extractall(pat)
# 0
# match
#0 0 thank # `'thank'` on line 0
# 1 recorded
#4 0 thank # `'thank'` also on line 4
如果要打分的话,一个好的组织就是dict,key就是word,value就是points。然后就可以通过连接键来制作图案,通过映射值来获得点数:
d = {'thank': 5, 'recorded': 7}
pat = '(' + '|'.join(d.keys()) + ')'
df1 = df['transcript'].str.extractall(pat).rename(columns={0: 'word'})
df1['points'] = df1['word'].map(d)
# word points
# match
#0 0 thank 5
# 1 recorded 7
#4 0 thank 5
如果你只想计算一次单词那么drop_duplicates:
df1.drop_duplicates('word').points.sum()
#12
设置数据
df = pd.DataFrame({'transcript':
['thank you for calling my name is Tiffany and we want to let you know this call is recorded',
'Maggie',
'through the time',
'that you can find I have a question about a claim and our contact is',
'three to like even your box box and thank you for your help']})