Python pandas 解析 html table 以获取隐藏值和链接

Python pandas parse html table to get hidden values and links

这是我试图用 Pandas 使用 Python:

解析的页面片段
<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>

</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>



<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
    <td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff2</span>   <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>

<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
    <td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff</span>   <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>

<tr class='odd'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
    <td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span>  <span class='gen' title='extra_info2'>stuff3</span>  </td></tr>

</table>

table(header 6 and header 9)中有隐藏变量,将鼠标悬停在上面可以看到信息:

当我尝试使用 Pandas 时,我得到以下信息:

with open ("/root/Downloads/adad.html", "r") as content_file:
    f = content_file.read()
dfs = pd.read_html(f)
dfs

我的愿望是获得:

[   header1info    header2info    header3info    header4info    header5info    header6info         header7          header8                header9info
0   value1         stuff          stuff          stuff          stuff          stuff(extra_info)   stuff            link1(http://link1)    stuff(extra_info) stuff2(extra_info2) out(http://out)
                                                                                                                    link2(http://link2)
                                                                                                                    link3(http://link3)
1   value2         stuff2         stuff2         stuff2         stuff2         stuff2              stuff2           link4(http://link4)    stuff(extra_info) stuff(extra_info2) out2(http://out)
                                                                                                                    link5(http://link5)  
2   value3         stuff3        stuff3          stuff3         stuff3         stuff3              stuff3           link6(http://link6)    stuff3(extra_info) stuff3(extra_info2)]

这可以使用 Pandas 吗?如果是,我怎样才能达到预期的输出?

抱歉,Pandas我不是专家。我不确定是否还有其他方法来解析信息。我唯一想到的就是分割线并获取所需的信息,但你只能想象它有多挑剔......

简答:否

pd.read_html() 仅读取在 html 上生成的文本,而不读取具有其属性的元素。为了实现你想要的,你可能想使用像 bs4 这样的 HTML 解析器,然后找到 table class='gene',然后遍历 <tr><td> 在里面。代码如下所示:

import pandas as pd
from bs4 import BeautifulSoup

source = r"""<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>

</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>



<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
    <td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff2</span>   <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>

<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
    <td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff</span>   <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>

<tr class='odd'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
    <td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span>  <span class='gen' title='extra_info2'>stuff3</span>  </td></tr>

</table>"""

soup = BeautifulSoup(source, 'html.parser')

table = soup.findAll("table", {"class": "gene"})
trs = table[0].findAll("tr")

headers = []
for th in trs[0].findAll("th"):
    headers.append(th.text)
rows = []
for i in range(1, len(trs)):
    tds = []
    for td in trs[i].findAll("td"):
        a = td.findAll("a")
        spans = td.findAll("span")
        inputs = td.findAll("input")
        ret = ""
        if len(a) != 0 or len(spans) != 0 or len(inputs) != 0:
            if len(a) != 0:
                for link in a:
                    ret += link.text + '('+link['href']+') '
            if len(spans) != 0:
                for span in spans:
                    if span.has_attr('title'):
                        ret += span.text + '('+span['title']+') '
            if len(inputs) != 0:
                for inp in inputs:
                    if inp.has_attr('value'):
                        if inp.has_attr('type'):
                            if inp['type'] == "hidden":
                                ret +=  inp['value']
        else:
            ret = td.text if td.text != '' and td.text != '\n' else "NaN"
        tds.append(ret)
    rows.append(tds)
    
df = pd.DataFrame(rows, columns = headers)

df