将 pandas read_html 整形为更简单的结构
Shaping pandas read_html results into simpler structure
我希望有人能建议我如何创建 pandas 数据框,该数据框仅包含第 2 列的文本,而不包含第 1、2 行或左列的文本。该解决方案需要能够处理多个相似的表。
我原以为 pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')
从 html 创建一个数据帧列表(跳过 2 行)会是一种方式,但最终的数据结构太混乱了,这个新手无法理解或操纵成更简单的结构。
其他人是否可以使用产生的结构或推荐其他方法来优化数据,以便我最终得到仅包含我需要的文本的 1 列?
样本Table
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table>
第一个选项
使用 iloc
这应该可以通过让 iloc
去掉第一列来实现`
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]
说明
...iloc[:, 1:]
# ^ ^
# | \
# says to says to take columns
# take all starting with one and on
# rows
您可以只使用单列
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]
我运行
的工作代码
htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table> """
pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]
我希望有人能建议我如何创建 pandas 数据框,该数据框仅包含第 2 列的文本,而不包含第 1、2 行或左列的文本。该解决方案需要能够处理多个相似的表。
我原以为 pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')
从 html 创建一个数据帧列表(跳过 2 行)会是一种方式,但最终的数据结构太混乱了,这个新手无法理解或操纵成更简单的结构。
其他人是否可以使用产生的结构或推荐其他方法来优化数据,以便我最终得到仅包含我需要的文本的 1 列?
样本Table
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table>
第一个选项
使用 iloc
这应该可以通过让 iloc
去掉第一列来实现`
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]
说明
...iloc[:, 1:]
# ^ ^
# | \
# says to says to take columns
# take all starting with one and on
# rows
您可以只使用单列
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]
我运行
的工作代码htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table> """
pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]