Python、Pandas:比 os.path 更快的文件搜索?
Python, Pandas: Faster File Search than os.path?
我有一个 pandas df,其文件名需要在目录树中 searched/matched。
我一直在使用以下内容,但它在使用较大的目录结构时会崩溃。我记录它们是否存在于 2 个列表中。
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
我读到 scandir 更快并且可以处理更大的目录树。如果为真,如何重写?
我的尝试:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
这运行(快速),但所有内容最终都在“错过”列表中。
仅扫描一次目录并将其转换为数据框。
我的 venv
目录中的示例:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
现在您可以使用 df_path
从 df_files
和 merge
:
中查找文件名
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418
我有一个 pandas df,其文件名需要在目录树中 searched/matched。
我一直在使用以下内容,但它在使用较大的目录结构时会崩溃。我记录它们是否存在于 2 个列表中。
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
我读到 scandir 更快并且可以处理更大的目录树。如果为真,如何重写?
我的尝试:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
这运行(快速),但所有内容最终都在“错过”列表中。
仅扫描一次目录并将其转换为数据框。
我的 venv
目录中的示例:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
现在您可以使用 df_path
从 df_files
和 merge
:
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418