从 HTML table 中提取数据并使用 Python 打印到 CSV 的问题
Issues Extracting data from HTML table and printing to CSV using Python
我正在尝试从 IB 网站上抓取股票代码列表,但我在从 HTML.
中提取 table 信息时遇到问题
如果我用,
import requests
website_url = requests.get('https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK#productbuffer').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('div',{'class':'table-responsive no-margin'})
print (My_table)
它捕获了 HTML 数据信息,但是当我尝试将它与下面的代码一起使用时,它不喜欢它,因此,作为一种解决方法,我捕获了 HTML Table 数据信息,手动解析。
我有以下代码:
import pandas as pd
from bs4 import BeautifulSoup
html_string = """
<div class="table-responsive no-margin">
<table width="100%" cellpadding="0" cellspacing="0" border="0"
class="table table-striped table-bordered">
<thead>
<tr>
<th width="15%" align="left" valign="middle"
class="table_subheader">IB Symbol</th>
<th width="55%" align="left" valign="middle" class="table_subheader">Product Description
<span class="text-small">(click link for more details)</span></th>
<th width="15%" align="left" valign="middle" class="table_subheader">Symbol</th>
<th width="15%" align="left" valign="middle" class="table_subheader">Currency</th>
</tr>
</thead>
<tbody>
<tr>
<td>0JN9N</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=189723078','Details','600','600','custom','front');" class="linkexternal">DSV AS</a></td>
<td>0JN9N</td>
<td>MXN</td>
</tr>
<tr>
<td>0QBON</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=189723075','Details','600','600','custom','front');" class="linkexternal">COLOPLAST-B</a></td>
<td>0QBON</td>
<td>MXN</td>
</tr>
<tr>
<td>0R87N</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=195567802','Details','600','600','custom','front');" class="linkexternal">ASSA ABLOY AB-B</a></td>
<td>0R87N</td>
<td>MXN</td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(html_string, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
print(new_table)
但它只显示最后一行:
如果我删除最后一部分,并添加以下内容:
soup = BeautifulSoup(html, 'lxml')
table = soup.find("div")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in
table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
df = pd.DataFrame(headings, (td.get_text() for td in
row.find_all("td")))
datasets.append(df)
print(datasets)
df.to_csv('Path_to_file\test1.csv')
它能看到其余项目,但格式完全不对,在 csv 中,它只打印列表的最后一项。
如何直接从网站提取 HTML table 的详细信息并以第一张图片的格式打印到 csv?
您可以删除row_marker = 0
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
new_table.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
我正在尝试从 IB 网站上抓取股票代码列表,但我在从 HTML.
中提取 table 信息时遇到问题如果我用,
import requests
website_url = requests.get('https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK#productbuffer').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('div',{'class':'table-responsive no-margin'})
print (My_table)
它捕获了 HTML 数据信息,但是当我尝试将它与下面的代码一起使用时,它不喜欢它,因此,作为一种解决方法,我捕获了 HTML Table 数据信息,手动解析。
我有以下代码:
import pandas as pd
from bs4 import BeautifulSoup
html_string = """
<div class="table-responsive no-margin">
<table width="100%" cellpadding="0" cellspacing="0" border="0"
class="table table-striped table-bordered">
<thead>
<tr>
<th width="15%" align="left" valign="middle"
class="table_subheader">IB Symbol</th>
<th width="55%" align="left" valign="middle" class="table_subheader">Product Description
<span class="text-small">(click link for more details)</span></th>
<th width="15%" align="left" valign="middle" class="table_subheader">Symbol</th>
<th width="15%" align="left" valign="middle" class="table_subheader">Currency</th>
</tr>
</thead>
<tbody>
<tr>
<td>0JN9N</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=189723078','Details','600','600','custom','front');" class="linkexternal">DSV AS</a></td>
<td>0JN9N</td>
<td>MXN</td>
</tr>
<tr>
<td>0QBON</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=189723075','Details','600','600','custom','front');" class="linkexternal">COLOPLAST-B</a></td>
<td>0QBON</td>
<td>MXN</td>
</tr>
<tr>
<td>0R87N</td>
<td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=GEN&conid=195567802','Details','600','600','custom','front');" class="linkexternal">ASSA ABLOY AB-B</a></td>
<td>0R87N</td>
<td>MXN</td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(html_string, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
print(new_table)
但它只显示最后一行:
如果我删除最后一部分,并添加以下内容:
soup = BeautifulSoup(html, 'lxml')
table = soup.find("div")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in
table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
df = pd.DataFrame(headings, (td.get_text() for td in
row.find_all("td")))
datasets.append(df)
print(datasets)
df.to_csv('Path_to_file\test1.csv')
它能看到其余项目,但格式完全不对,在 csv 中,它只打印列表的最后一项。
如何直接从网站提取 HTML table 的详细信息并以第一张图片的格式打印到 csv?
您可以删除row_marker = 0
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
new_table.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue