如何在目录中的所有 csvs 文件中执行 python 关键字搜索和字数统计并写入单个 csv?

How can I do a python keyword search and word counter within all csvs files in directory and write to a single csv?

我是 python 的新手,正在尝试了解某些​​库。不确定如何将 csv 上传到 SO 但此脚本适用于任何 csv,只需替换 'SwitchedProviders_TopicModel'

我的 objective 是遍历文件目录中的所有 csv - C:\Users\jj\Desktop\autotranscribe 并将我的 python 脚本输出按文件写入 csv。

让我们举个例子,我在上面的文件夹中有这些 csv 文件-

'1003391793_1003391784_01bc7e411408166f7c5468f0.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9882d.csv'

我希望我的 python 应用程序(如下)为 folder/directory 中的每个 csv 做一个单词计数器并将输出写入这样的数据帧 -

csvname                                            pre existing  exclusions  limitations  fourteen
1003391793_1003391784_01bc7e411408166f7c5468f0.csv    1           2           0            1

我的脚本-

import pandas as pd
from collections import defaultdict

def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    line_number = 0
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open("SwitchedProviders_TopicModel.csv", 'r') as read_obj:
        # Read all lines in the file one by one
        for line in read_obj:
            line_number += 1
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append((string_to_search, line_number, line.rstrip()))
 
    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)


matched_lines, count = search_multiple_strings_in_file('SwitchedProviders_TopicModel.csv', [ 'pre existing ', 'exclusions','limitations','fourteen'])
    
df = pd.DataFrame.from_dict(count, orient='index').reset_index()
df.columns = ['Word', 'Count']

print(df)

我怎样才能做到这一点?正如您在我的脚本中看到的那样,只寻找计数器特定的单词,例如“十四”,而不是寻找所有单词的计数器

其中一个 csvs 的示例数据 - 信用用户 Umar H

df = pd.read_csv('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv')
print(df.head(10).to_dict())
{'transcript': {0: 'hi thanks for calling ACCA  this is many speaking could have the pleasure speaking with ', 1: 'so ', 2: 'hi ', 3: 'I have the pleasure speaking with my name is B. as in boy E. V. D. N. ', 4: 'thanks yes and I think I have your account pulled up could you please verify your email ', 5: "sure is yeah it's on _ 00 ", 6: 'I T. O.com ', 7: 'thank you how can I help ', 8: 'all right I mean I do have an insurance with you guys I just want to cancel the insurance ', 9: 'sure I can help with that what was the reason for cancellation '}, 'confidence': {0: 0.73, 1: 0.18, 2: 0.88, 3: 0.72, 4: 0.83, 5: 0.76, 6: 0.83, 7: 0.98, 8: 0.89, 9: 0.95}, 'from': {0: 1.69, 1: 1.83, 2: 2.06, 3: 2.13, 4: 2.36, 5: 2.98, 6: 3.17, 7: 3.65, 8: 3.78, 9: 3.93}, 'to': {0: 1.83, 1: 2.06, 2: 2.13, 3: 2.36, 4: 2.98, 5: 3.17, 6: 3.65, 7: 3.78, 8: 3.93, 9: 4.14}, 'speaker': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'Negative': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.116, 9: 0.0}, 'Neutral': {0: 0.694, 1: 1.0, 2: 1.0, 3: 0.802, 4: 0.603, 5: 0.471, 6: 1.0, 7: 0.366, 8: 0.809, 9: 0.643}, 'Positive': {0: 0.306, 1: 0.0, 2: 0.0, 3: 0.198, 4: 0.397, 5: 0.529, 6: 0.0, 7: 0.634, 8: 0.075, 9: 0.357}, 'compound': {0: 0.765, 1: 0.0, 2: 0.0, 3: 0.5719, 4: 0.7845, 5: 0.5423, 6: 0.0, 7: 0.6369, 8: -0.1779, 9: 0.6124}}

步骤 -

  1. 定义输入路径
  2. 提取所有 CSV 文件
  3. 计数
  4. 创建 1 个结果字典,添加文件名和计数器字典。
  5. 最后,将生成的dict转换为dataframe并进行Transpose。 (如果需要,用 0 填充 NAN 值)

import string
from collections import Counter, defaultdict
from pathlib import Path

import pandas as pd

inp_dir = Path(r'C:/Users/jj/Desktop/Bulk_Wav_Completed')  # current dir


def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open(file_name, 'r') as read_obj:
        # Read all lines in the file one by one
        for line_number, line in enumerate(read_obj, start=1):
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append(
                        (string_to_search, line_number, line.rstrip()))

    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)


result = {}
for csv_file in inp_dir.glob('**/*.csv'):
    print(csv_file) # for debugging
    matched_lines, count = search_multiple_strings_in_file(csv_file, ['nation', 'nation wide', 'trupanion', 'pet plan', 'best', 'embrace', 'healthy paws', 'pet first', 'pet partners', 'lemon',
                                                                    'AKC', 'akc', 'kennel club', 'club', 'american kennel', 'american', 'lemonade'
                                                                    'kennel', 'figo', 'companion protect', 'true companion',
                                                                    'true panion', 'trusted pals', 'partners' 'lemonade', 'partner',
                                                                    'wagmo', 'vagmo', 'bivvy', 'bivy', 'bee' '4paws', 'paws', 'pet best',
                                                                    'pets best', 'pet best'])
    print(count)  # for debugging
    result[csv_file.name] = count
df = pd.DataFrame(result).T.fillna(0).astype(int)

输出-

       exclusions  limitations  pre existing
1.csv           1            3             1
2.csv           1            3             1

由于您已标记 pandas,我们可以使用 .str.extractall 来搜索单词和行号。

您可以扩展功能并添加一些错误处理(例如如果给定的 csv 文件中不存在转录本会发生什么)。

from pathlib import Path
import pandas as pd

def get_files_to_parse(start_dir : str) -> list:
    
    files = [f for f in Path(start_dir).glob('*.csv')]
    return files 

def search_multiple_files(list_of_paths : list,key_words : list) -> pd.DataFrame:
    dfs = []
    for file in list_of_paths:
        df = pd.read_csv(file)
        word_df = df['transcript'].str.extractall(f"({'|'.join(key_words)})")\
                        .droplevel(1,0)\
                        .reset_index()\
                        .rename(columns={'index' : f"{file.parent}_{file.stem}")\
                        .set_index(0).T
        dfs.append(word_df)
    return pd.concat(dfs)
    
    

用法。

使用您的示例数据框(我从您的列表中添加了几个关键词)

files = get_files_to_parse('target\dir\folder')


[WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv'),
 WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c_copy.csv')]

search_multiple_files(files,['pre existing', 'exclusions','limitations','fourteen'])