如何根据数据帧中的单词检测分配 points/score?

How do I assign points/score based on word detection in a dataframe?

我是 python 的新手,正在尝试学习单词检测。我有一个包含文字的数据框

sharina['transcript']
Out[25]: 
0      thank you for calling my name is Tiffany and we want to let you know this call is recorded...
1                                                Maggie 
2                                  through the time 
3      that you can find I have a question about a claim and our contact is..
4                       three to like even your box box and thank you for your help...

我创建了一个应用程序来检测来自这个的单词:

def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    line_number = 0
    list_of_results = []
    # Open the file in read only mode
    with open("sharina.csv", 'r') as read_obj:
        # Read all lines in the file one by one
        for line in read_obj:
            line_number += 1
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append((string_to_search, line_number, line.rstrip()))
 
    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results

# search for given strings in the file 'sample.txt'

matched_lines = search_multiple_strings_in_file('sharina.csv', ['recorded','thank'])
 
print('Total Matched lines : ', len(matched_lines))
for elem in matched_lines:
    print('Word = ', elem[0], ' :: Line Number = ', elem[1], ' :: Line = ', elem[2])

例如,如果在数据框中检测到某些单词,我想分配一个分数

如果 'recorded' 这个词被提到 = 7 分 如果提到 'thank' 这个词 = 5 分

然后输出给出总和 points/score = 12 在这种情况下。我该怎么做?

既然你提到你已经有一个 DataFrame:

这可以通过 Series.str.extractall 相对简单地完成。首先,我们创建捕获组,它是所有单词的 '|'.join,夹在括号之间。这允许您在一个 Series 中获取所有需要的单词,该 Series 的索引表示它所属的行。该系列还有一个 'match' 索引级别,指示在该行上匹配的项目数,在这种情况下并不重要。

pat = '(' + '|'.join(words) + ')'
#'(recorded|thank)'

df['transcript'].str.extractall(pat)
#                0
#  match          
#0 0         thank     # `'thank'` on line 0
#  1      recorded
#4 0         thank     # `'thank'` also on line 4

如果要打分的话,一个好的组织就是dict,key就是word,value就是points。然后就可以通过连接键来制作图案,通过映射值来获得点数:

d = {'thank': 5, 'recorded': 7}
pat = '(' + '|'.join(d.keys()) + ')'

df1 = df['transcript'].str.extractall(pat).rename(columns={0: 'word'})
df1['points'] = df1['word'].map(d)
#             word  points
#  match                  
#0 0         thank       5
#  1      recorded       7
#4 0         thank       5

如果你只想计算一次单词那么drop_duplicates:

df1.drop_duplicates('word').points.sum()
#12

设置数据

df = pd.DataFrame({'transcript': 
                   ['thank you for calling my name is Tiffany and we want to let you know this call is recorded',
                    'Maggie',
                    'through the time',
                    'that you can find I have a question about a claim and our contact is',
                    'three to like even your box box and thank you for your help']})