修复 html table 在 python 中用 BS4 提取的损坏
Repairing broken html table extracted with BS4 in python
我正在解析来自行政文件的 html tables。这很棘手,因为 html 经常被破坏,这会导致 table 构造不佳。这是我加载到 pandas 数据帧中的 table 示例:
0 1 2 3 4 5 \
0 NaN NaN NaN NaN NaN NaN
1 Name NaN Age NaN NaN Position
2 Aylwin Lewis NaN NaN 59.0 NaN NaN
3 John Morlock NaN NaN 58.0 NaN NaN
4 Matthew Revord NaN NaN 50.0 NaN NaN
5 Charles Talbot NaN NaN 48.0 NaN NaN
6 Nancy Turk NaN NaN 49.0 NaN NaN
7 Anne Ewing NaN NaN 49.0 NaN NaN
6
0 NaN
1 NaN
2 Chairman, Chief Executive Officer and President
3 Senior Vice President, Chief Operations Officer
4 Senior Vice President, Chief Legal Officer, Ge...
5 Senior Vice President and Chief Financial Officer
6 Senior Vice President, Chief People Officer an...
7 Senior Vice President, New Shop Development
我写了下面的 python 代码来尝试修复 table:
#dropping empty rows
df = df.dropna(how='all',axis=0)
#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0:
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#changing column names
df.columns = df.iloc[row.Index]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)
这是我得到的table:
0 Name NaN NaN
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development
我的主要问题是 headers "Age" 和 "Position" 消失了,因为它们与它们的列没有对齐。我正在使用这个脚本来解析许多 table,所以我无法手动修复它们。此时我该如何修复数据?
不要在开始时删除几乎空的列,我们稍后需要它们:一旦找到包含 "Name" 的 header 行,我们收集它的所有 non-empty 元素来设置在剩余数据中删除空列后,它们作为第 header 列。
#dropping empty rows
df = df.dropna(how='all',axis=0)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0:
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#collect column names
headers = [c for c in row if not pd.isnull(c)][1:]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#dropping empty columns
df = df.dropna(axis=1)
#setting column names
df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)
结果:
Name Age Position
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development
我正在解析来自行政文件的 html tables。这很棘手,因为 html 经常被破坏,这会导致 table 构造不佳。这是我加载到 pandas 数据帧中的 table 示例:
0 1 2 3 4 5 \
0 NaN NaN NaN NaN NaN NaN
1 Name NaN Age NaN NaN Position
2 Aylwin Lewis NaN NaN 59.0 NaN NaN
3 John Morlock NaN NaN 58.0 NaN NaN
4 Matthew Revord NaN NaN 50.0 NaN NaN
5 Charles Talbot NaN NaN 48.0 NaN NaN
6 Nancy Turk NaN NaN 49.0 NaN NaN
7 Anne Ewing NaN NaN 49.0 NaN NaN
6
0 NaN
1 NaN
2 Chairman, Chief Executive Officer and President
3 Senior Vice President, Chief Operations Officer
4 Senior Vice President, Chief Legal Officer, Ge...
5 Senior Vice President and Chief Financial Officer
6 Senior Vice President, Chief People Officer an...
7 Senior Vice President, New Shop Development
我写了下面的 python 代码来尝试修复 table:
#dropping empty rows
df = df.dropna(how='all',axis=0)
#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0:
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#changing column names
df.columns = df.iloc[row.Index]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)
这是我得到的table:
0 Name NaN NaN
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development
我的主要问题是 headers "Age" 和 "Position" 消失了,因为它们与它们的列没有对齐。我正在使用这个脚本来解析许多 table,所以我无法手动修复它们。此时我该如何修复数据?
不要在开始时删除几乎空的列,我们稍后需要它们:一旦找到包含 "Name" 的 header 行,我们收集它的所有 non-empty 元素来设置在剩余数据中删除空列后,它们作为第 header 列。
#dropping empty rows
df = df.dropna(how='all',axis=0)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0:
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#collect column names
headers = [c for c in row if not pd.isnull(c)][1:]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#dropping empty columns
df = df.dropna(axis=1)
#setting column names
df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)
结果:
Name Age Position
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development