过滤目录中的所有文件以匹配多个正则表达式的单词
Filter all files in directory for words that match multiple regexes
我正在尝试过滤我目录中的所有文件(pdf、txt、csv、ipynp 等)以查找与我的正则表达式匹配的单词。到目前为止,我制作了一个可以读取 csv 和 pdf 文件的程序(如下所示),但是读取所有其他文件类型的 else 语句一直给我一个错误(显示在底部)。我是不是在 else: 语句之后输入了错误的内容?我什么都试过了,还是没用。
import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)
#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')
#Search for Locations
regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")
for file in folder_contents:
if re.search(r".*(?=pdf$)",file):
#this is pdf
with open(file, 'rb') as pdfFileObj:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
read_file = pageObj.extractText()
#print("{}".format(file))
elif re.search(r".*(?=csv$)",file):
#this is csv
with open(file,"r+",encoding="utf-8") as csv:
read_file = csv.read()
else:
with open(file,"rt", encoding='latin-1') as allOtherFiles:
continue
if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
print ("YES, This file containts PHI")
print(file)
else:
print("No, This file DOES NOT contain PHI")
print(file)
我收到一条错误消息说 IsAdirectoryError:[Errno 21] 是一个目录:你知道为什么每当我 运行 代码时都会显示此错误消息。
---------------------------------------------------------------------------
IsADirectoryError Traceback (most recent call last)
<ipython-input-40-fdb88fbf61ab> in <module>()
29 read_file = csv.read()
30 else:
---> 31 with open(file,"rt", encoding='latin-1') as allOtherFiles:
32 continue
33 if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
IsADirectoryError: [Errno 21] Is a directory: '/home/jupyter_shared_notebooks'
能否尝试将 with open(file,"rt") as allOtherFiles:
语句更改为
with open(file,"rt", encoding='latin-1') as allOtherFiles:
运行 再次输入代码,看看是否遇到同样的错误。如果还是出错,我们就得试试其他的编码格式了。
编辑:
要解决您的下一个错误:
IsADirectoryError: [Errno 21] Is a directory: /home/e136320/jupyter_shared_notebooks
这是由您的文件夹中名为 jupyter_shared_notebooks
的文件或文件夹引起的。
因为 python 不知道如何打开 jupyter_shared_notebooks
因为它没有文件扩展名格式。它抛出了这个错误。
要解决这个问题,您可以尝试
if '.' not in file:
continue
else:
with open(file,"rt", encoding='latin-1') as allOtherFiles:
#rest of your code here
我正在尝试过滤我目录中的所有文件(pdf、txt、csv、ipynp 等)以查找与我的正则表达式匹配的单词。到目前为止,我制作了一个可以读取 csv 和 pdf 文件的程序(如下所示),但是读取所有其他文件类型的 else 语句一直给我一个错误(显示在底部)。我是不是在 else: 语句之后输入了错误的内容?我什么都试过了,还是没用。
import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)
#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')
#Search for Locations
regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")
for file in folder_contents:
if re.search(r".*(?=pdf$)",file):
#this is pdf
with open(file, 'rb') as pdfFileObj:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
read_file = pageObj.extractText()
#print("{}".format(file))
elif re.search(r".*(?=csv$)",file):
#this is csv
with open(file,"r+",encoding="utf-8") as csv:
read_file = csv.read()
else:
with open(file,"rt", encoding='latin-1') as allOtherFiles:
continue
if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
print ("YES, This file containts PHI")
print(file)
else:
print("No, This file DOES NOT contain PHI")
print(file)
我收到一条错误消息说 IsAdirectoryError:[Errno 21] 是一个目录:你知道为什么每当我 运行 代码时都会显示此错误消息。
---------------------------------------------------------------------------
IsADirectoryError Traceback (most recent call last)
<ipython-input-40-fdb88fbf61ab> in <module>()
29 read_file = csv.read()
30 else:
---> 31 with open(file,"rt", encoding='latin-1') as allOtherFiles:
32 continue
33 if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
IsADirectoryError: [Errno 21] Is a directory: '/home/jupyter_shared_notebooks'
能否尝试将 with open(file,"rt") as allOtherFiles:
语句更改为
with open(file,"rt", encoding='latin-1') as allOtherFiles:
运行 再次输入代码,看看是否遇到同样的错误。如果还是出错,我们就得试试其他的编码格式了。
编辑: 要解决您的下一个错误:
IsADirectoryError: [Errno 21] Is a directory: /home/e136320/jupyter_shared_notebooks
这是由您的文件夹中名为 jupyter_shared_notebooks
的文件或文件夹引起的。
因为 python 不知道如何打开 jupyter_shared_notebooks
因为它没有文件扩展名格式。它抛出了这个错误。
要解决这个问题,您可以尝试
if '.' not in file:
continue
else:
with open(file,"rt", encoding='latin-1') as allOtherFiles:
#rest of your code here