如何有条件地跳过 html 不包含 pd.read_html() 中的表格的文件？

Question

我想遍历多个目录，每个目录在我的本地计算机上都包含一个 HTML 文件。我将每个文件的路径存储在一个列表变量中，但现在我想遍历每个文件并将其读入类似 pd.read_html 的内容，以便从 HTML 文件中提取 table 信息.但是，某些文件不包含任何 table，因此会抛出错误 ValueError: No tables found。这个错误当然是预料之中的，我只需要跳过这些错误所需的逻辑方面的帮助。

我已经尝试阅读 pd.DataFrame（此处：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and pd.read_html (here: https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.read_html.html）文档，但找不到我正在寻找的逻辑。

这是我目前所做的

# initialize the path
p = Path('C:/path/to/directories/')

# glob all html file paths into list of paths
html_paths = [file for file in p.glob('**/*.html')]

现在我有了一个路径列表，我想遍历并读入 pd.read_html。我可以使用以下代码轻松完成此操作：

# initialize empty data frame to append pd.read_html() output to
html_files = pd.DataFrame()

# iterate over each file and read in using pandas
for p in html_paths:
     html_files.append(pd.read_html(str(p)))

但是，因为我的一些 html 文件不包含任何 tables，所以当我的 for 循环遍历这些文件时出现错误。我想要一种在阅读这些文件时不使用 tables 来跳过这些文件的方法，这样它就可以继续附加其余文件而不是破坏代码。

Answer 1

你可以简单地做

for p in html_paths:
     try:
         html_files = html_files.append(pd.read_html(str(p)))
     except ValueError:
         pass

如何有条件地跳过 html 不包含 pd.read_html() 中的表格的文件？

How to conditional skip html files that don't contain tables in pd.read_html()?

python

pandas

try-except