Python、Pandas:比 os.path 更快的文件搜索?

Python, Pandas: Faster File Search than os.path?

我有一个 pandas df,其文件名需要在目录树中 searched/matched。

我一直在使用以下内容,但它在使用较大的目录结构时会崩溃。我记录它们是否存在于 2 个列表中。

found = []
missed = []

for target_file in df_files['Filename']:
    
    for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
        if target_file in filenames:
            found.append(os.path.join(dirpath,target_file))
        else:
            missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)

我读到 scandir 更快并且可以处理更大的目录树。如果为真,如何重写?

我的尝试:

found = []
missed = []

for target_file in df_files['Filename']:
    
    for item in os.scandir(DIRECTORY_TREE):
        if item.is_file() and item.name() == target_file:
            found.append(os.path.join(dirpath,target_file))
        else:
            missed.append(target_file)
            
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)

这运行(快速),但所有内容最终都在“错过”列表中。

仅扫描一次目录并将其转换为数据框。

我的 venv 目录中的示例:

import pandas as pd
import pathlib

DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])

df_files = pd.DataFrame({'Filename': ['__init__.py']})

现在您可以使用 df_pathdf_filesmerge:

中查找文件名
out = (df_files.merge(df_path, on='Filename', how='left')
               .value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())

# Output
      Filename  Found  Missed
0  __init__.py   5837  105418