如何只将文件保存在包含特定字符串的目录中?
How do I only keep files in a directory that contain a specific string?
我试图打开目录中的所有 HTML 个文件,读取 HTML 个文件,并且只保留 HTML 个包含短语 "apples and oranges." 的文件
我尝试打开目录中的每个文件,然后对其应用 BeautifulSoup 函数。
import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
directory = "/directorypath"
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)):
soup = BeautifulSoup(files, 'html.parser')
print(soup.prettify())
soup.find_all('apples and oranges')
filename.close()
我的预期结果是只看到包含短语 "apples and oranges."
的目录中的文件
错误消息说:
File "soupy4.py", line 14, in <module>
filename.close()
AttributeError: 'str' object has no attribute 'close'
marshiehmacbook:board marcyshieh$ python3 soupy4.py
Traceback (most recent call last):
File "soupy4.py", line 11, in <module>
soup = BeautifulSoup(files, 'html.parser')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 300, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 240, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 374, in __init__
for encoding in self.detector.encodings:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 265, in encodings
self.markup, self.is_html)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 323, in find_declared_encoding
declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
TypeError: expected string or bytes-like object
我认为问题在于您实际上并未阅读文件。在 soup = BeautifulSoup(files, 'html.parser')
中,files
不是字符串。
您需要先读取它,然后将其传递给 BeautifulSoup:
import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
directory = "/directorypath"
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)) as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
soup.find_all('apples and oranges')
实际上,如果您所做的只是检查该短语是否在文件中,则不需要 BeautfulSoup。读入后,看看它是否在文本中:
import os
import fnmatch
directory = "/directorypath"
remove_files = []
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)) as f:
html = f.read()
if 'apples and oranges' in html:
print ('Found apples and oranges.')
else:
remove_files.append(os.path.join(dirpath, filename))
for each in remove_files:
os.remove(each)
print ('REMOVED: %s' %each)
我试图打开目录中的所有 HTML 个文件,读取 HTML 个文件,并且只保留 HTML 个包含短语 "apples and oranges." 的文件
我尝试打开目录中的每个文件,然后对其应用 BeautifulSoup 函数。
import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
directory = "/directorypath"
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)):
soup = BeautifulSoup(files, 'html.parser')
print(soup.prettify())
soup.find_all('apples and oranges')
filename.close()
我的预期结果是只看到包含短语 "apples and oranges."
的目录中的文件错误消息说:
File "soupy4.py", line 14, in <module>
filename.close()
AttributeError: 'str' object has no attribute 'close'
marshiehmacbook:board marcyshieh$ python3 soupy4.py
Traceback (most recent call last):
File "soupy4.py", line 11, in <module>
soup = BeautifulSoup(files, 'html.parser')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 300, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 240, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 374, in __init__
for encoding in self.detector.encodings:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 265, in encodings
self.markup, self.is_html)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/dammit.py", line 323, in find_declared_encoding
declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
TypeError: expected string or bytes-like object
我认为问题在于您实际上并未阅读文件。在 soup = BeautifulSoup(files, 'html.parser')
中,files
不是字符串。
您需要先读取它,然后将其传递给 BeautifulSoup:
import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
directory = "/directorypath"
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)) as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
soup.find_all('apples and oranges')
实际上,如果您所做的只是检查该短语是否在文件中,则不需要 BeautfulSoup。读入后,看看它是否在文本中:
import os
import fnmatch
directory = "/directorypath"
remove_files = []
for dirpath, dirs, files in os.walk(directory):
for filename in fnmatch.filter(files, '*.html'):
with open(os.path.join(dirpath, filename)) as f:
html = f.read()
if 'apples and oranges' in html:
print ('Found apples and oranges.')
else:
remove_files.append(os.path.join(dirpath, filename))
for each in remove_files:
os.remove(each)
print ('REMOVED: %s' %each)