Python pandas 解析 html table 以获取隐藏值和链接
Python pandas parse html table to get hidden values and links
这是我试图用 Pandas 使用 Python:
解析的页面片段
<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>
table(header 6 and header 9)中有隐藏变量,将鼠标悬停在上面可以看到信息:
当我尝试使用 Pandas 时,我得到以下信息:
with open ("/root/Downloads/adad.html", "r") as content_file:
f = content_file.read()
dfs = pd.read_html(f)
dfs
我的愿望是获得:
[ header1info header2info header3info header4info header5info header6info header7 header8 header9info
0 value1 stuff stuff stuff stuff stuff(extra_info) stuff link1(http://link1) stuff(extra_info) stuff2(extra_info2) out(http://out)
link2(http://link2)
link3(http://link3)
1 value2 stuff2 stuff2 stuff2 stuff2 stuff2 stuff2 link4(http://link4) stuff(extra_info) stuff(extra_info2) out2(http://out)
link5(http://link5)
2 value3 stuff3 stuff3 stuff3 stuff3 stuff3 stuff3 link6(http://link6) stuff3(extra_info) stuff3(extra_info2)]
这可以使用 Pandas 吗?如果是,我怎样才能达到预期的输出?
抱歉,Pandas我不是专家。我不确定是否还有其他方法来解析信息。我唯一想到的就是分割线并获取所需的信息,但你只能想象它有多挑剔......
简答:否
pd.read_html()
仅读取在 html 上生成的文本,而不读取具有其属性的元素。为了实现你想要的,你可能想使用像 bs4 这样的 HTML 解析器,然后找到 table class='gene'
,然后遍历 <tr>
和 <td>
在里面。代码如下所示:
import pandas as pd
from bs4 import BeautifulSoup
source = r"""<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>"""
soup = BeautifulSoup(source, 'html.parser')
table = soup.findAll("table", {"class": "gene"})
trs = table[0].findAll("tr")
headers = []
for th in trs[0].findAll("th"):
headers.append(th.text)
rows = []
for i in range(1, len(trs)):
tds = []
for td in trs[i].findAll("td"):
a = td.findAll("a")
spans = td.findAll("span")
inputs = td.findAll("input")
ret = ""
if len(a) != 0 or len(spans) != 0 or len(inputs) != 0:
if len(a) != 0:
for link in a:
ret += link.text + '('+link['href']+') '
if len(spans) != 0:
for span in spans:
if span.has_attr('title'):
ret += span.text + '('+span['title']+') '
if len(inputs) != 0:
for inp in inputs:
if inp.has_attr('value'):
if inp.has_attr('type'):
if inp['type'] == "hidden":
ret += inp['value']
else:
ret = td.text if td.text != '' and td.text != '\n' else "NaN"
tds.append(ret)
rows.append(tds)
df = pd.DataFrame(rows, columns = headers)
df
这是我试图用 Pandas 使用 Python:
解析的页面片段<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>
table(header 6 and header 9)中有隐藏变量,将鼠标悬停在上面可以看到信息:
当我尝试使用 Pandas 时,我得到以下信息:
with open ("/root/Downloads/adad.html", "r") as content_file:
f = content_file.read()
dfs = pd.read_html(f)
dfs
我的愿望是获得:
[ header1info header2info header3info header4info header5info header6info header7 header8 header9info
0 value1 stuff stuff stuff stuff stuff(extra_info) stuff link1(http://link1) stuff(extra_info) stuff2(extra_info2) out(http://out)
link2(http://link2)
link3(http://link3)
1 value2 stuff2 stuff2 stuff2 stuff2 stuff2 stuff2 link4(http://link4) stuff(extra_info) stuff(extra_info2) out2(http://out)
link5(http://link5)
2 value3 stuff3 stuff3 stuff3 stuff3 stuff3 stuff3 link6(http://link6) stuff3(extra_info) stuff3(extra_info2)]
这可以使用 Pandas 吗?如果是,我怎样才能达到预期的输出?
抱歉,Pandas我不是专家。我不确定是否还有其他方法来解析信息。我唯一想到的就是分割线并获取所需的信息,但你只能想象它有多挑剔......
简答:否
pd.read_html()
仅读取在 html 上生成的文本,而不读取具有其属性的元素。为了实现你想要的,你可能想使用像 bs4 这样的 HTML 解析器,然后找到 table class='gene'
,然后遍历 <tr>
和 <td>
在里面。代码如下所示:
import pandas as pd
from bs4 import BeautifulSoup
source = r"""<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>"""
soup = BeautifulSoup(source, 'html.parser')
table = soup.findAll("table", {"class": "gene"})
trs = table[0].findAll("tr")
headers = []
for th in trs[0].findAll("th"):
headers.append(th.text)
rows = []
for i in range(1, len(trs)):
tds = []
for td in trs[i].findAll("td"):
a = td.findAll("a")
spans = td.findAll("span")
inputs = td.findAll("input")
ret = ""
if len(a) != 0 or len(spans) != 0 or len(inputs) != 0:
if len(a) != 0:
for link in a:
ret += link.text + '('+link['href']+') '
if len(spans) != 0:
for span in spans:
if span.has_attr('title'):
ret += span.text + '('+span['title']+') '
if len(inputs) != 0:
for inp in inputs:
if inp.has_attr('value'):
if inp.has_attr('type'):
if inp['type'] == "hidden":
ret += inp['value']
else:
ret = td.text if td.text != '' and td.text != '\n' else "NaN"
tds.append(ret)
rows.append(tds)
df = pd.DataFrame(rows, columns = headers)
df