来自 html 和 BeautifulSoup 的文本分类

Text classification from html with BeautifulSoup

我已经获得了 html 页面源代码,然后使用 'html5lib' 和 BeautifulSoup 解析了它。

我有这样的东西:

<div class="V0h1Ob-haAclf OPZbO-KE6vqe o0s21d-HiaYvf" jsaction="mouseover:pane.wfvdle40;mouseout:pane.wfvdle40" jsan="7.V0h1Ob-haAclf,7.OPZbO-KE6vqe,7.o0s21d-HiaYvf,0.jsaction" jstcache="824">
    <a aria-label="Muzeum Londynu" class="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd" href="https://www.google.com/maps/place/Muzeum+Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>
    <div class="CJY91c-jRmmHf-aVTXAb-haAclf-WFkMr" jstcache="826"></div>
    <div aria-label="Muzeum Londynu" class="MVVflb-haAclf V0h1Ob-haAclf-d6wfac MVVflb-haAclf-uxVfW-hSRGPd" jsan="7.MVVflb-haAclf,7.V0h1Ob-haAclf-d6wfac,7.MVVflb-haAclf-uxVfW-hSRGPd,0.aria-label" jstcache="827">
        <div class="CJY91c-jRmmHf-aVTXAb-haAclf-bIWrp" jstcache="828"></div>
        <div class="lI9IFe">
            <div class="CJY91c-jRmmHf-aVTXAb-haAclf-HSrbLb" jstcache="829">
                <div class="RnEfrd-jRmmHf-HSrbLb B9Hcub-QFlW2" jsan="t-pdDsP4P8DQQ,7.RnEfrd-jRmmHf-HSrbLb,7.B9Hcub-QFlW2" jstcache="933">
                    <button jstcache="842" style="display:none"></button>
                    <div class="Z8fK3b" jsan="7.Z8fK3b,t-MjeqqY5XOdM" jstcache="843"> 
                        <div class="CUwbzc-content gm2-body-2"> <div class="qBF1Pd-haAclf">
                            <div class="qBF1Pd gm2-subtitle-alt-1" jsan="7.qBF1Pd,7.gm2-subtitle-alt-1,t-u3p6PfXaXm4" jstcache="845">
                                <span jstcache="858">Muzeum Londynu</span> 
                            </div>
                            <h1 jstcache="846" style="display:none"></h1> 
                            <span class="RnEfrd-jRmmHf-HSrbLb-title-Btuy5e-haAclf"></span> 
                        </div> 
                        <div class="section-subtitle-extension" jstcache="847"></div> 
                        <div class="ZY2y6b-RWgCYc" jsan="7.ZY2y6b-RWgCYc,t-hEqDOx2FFV0" jstcache="848"> 
                        <div class="OEvfgc-wcwwM-haAclf"> 
                            <span class="RnEfrd-jRmmHf-HSrbLb-wPzPJb-Btuy5e-haAclf" jstcache="860"></span>
                            <span class="gm2-body-2" jsan="t-CJ3Gw1VPbAA,7.gm2-body-2" jstcache="861">
                            <span jstcache="868" style="display:none"></span>
                            <span aria-label=" 4,6-gwiazdkowy  Opinie (13 898)  " class="ZkP5Je" jsan="7.ZkP5Je,0.aria-label,0.role,t-kqtGnPs-9G0" jstcache="869" role="group">
                            <span aria-hidden="true" class="MW4etd" jsan="7.MW4etd,0.aria-hidden" jstcache="872">4,6</span>
                            <div jstcache="873" style="display:none"></div>
                            <div class="QBUL8c" jsan="7.QBUL8c" jsinstance="0" jstcache="874"></div>
                            <div class="QBUL8c" jsan="7.QBUL8c" jsinstance="1" jstcache="874"></div>
                            <div class="QBUL8c" jsan="7.QBUL8c" jsinstance="2" jstcache="874"></div>
                            <div class="QBUL8c" jsan="7.QBUL8c" jsinstance="3" jstcache="874"></div>
                            <div class="QBUL8c cXOKEb-S62Q7b" jsan="7.QBUL8c,7.cXOKEb-S62Q7b" jsinstance="*4" jstcache="874"></div> 
                            <span aria-hidden="true" class="UY7F9" jsan="7.UY7F9,0.aria-hidden" jstcache="875">(13 898)</span>
                        </span>
                     </span> 
                     <span jstcache="862" style="display:none">
                         <jsl jstcache="863" style="display:none"></jsl> 
                     </span> 
                 </div> 
             </div> 
             <div class="ZY2y6b-RWgCYc"> 
                 <span jstcache="849" style="display:none"></span> 
                 <div class="ZY2y6b-RWgCYc" jsinstance="0" jstcache="850"> 
                     <span jsinstance="0" jstcache="851">
                          <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> 
                          <span jstcache="885">Muzeum</span> <span jstcache="886" style="display:none"></span> </jsl> </span><span jsinstance="*1" jstcache="851"> <jsl jstcache="852"> <span aria-hidden="true" class="bXlT7b-hgDUwe" jsan="7.bXlT7b-hgDUwe,0.aria-hidden" jstcache="884">·</span> <span jstcache="885">150 London Wall</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div class="ZY2y6b-RWgCYc" jsinstance="1" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Historia Londynu od starożytności do dziś</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div class="ZY2y6b-RWgCYc" jsinstance="*2" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Zamknięcie: 17:00</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div> </div> </div> </div></div></div><div class="CJY91c-jRmmHf-aVTXAb-haAclf-JIbuQc" jstcache="830"></div><div class="CJY91c-jRmmHf-aVTXAb-haAclf-HiaYvf" jstcache="831"><div class="xwpmRb qisNDe" jsan="t-PLs0ILPSy_c,7.xwpmRb,7.qisNDe,5.width,5.height,5.margin-top,5.margin-bottom,5.margin-left,5.margin-right" jstcache="932" style="width: 84px; height: 84px; margin: 0px;"><div class="Vig8jf-haAclf p0Hhde" jsan="7.p0Hhde,7.Vig8jf-haAclf,5.min-width,5.min-height" jstcache="836" style="min-width:84px;min-height:84px"><img aria-hidden="true" decoding="async" src="//lh5.googleusercontent.com/proxy/tWfK1sqsGJZNlZu3WTUika5NJAu4mqKhx07Kub2ZjC_yU3PdIv3DWCKe8_cwJ3RBAUHjW5qZp3S6vGLQJ7HnYxCL_4YR4X1T3ju-ISh86JeC5Kb0KGnvp8j8Jt0vvk6Es_gdVz1AyfBfMDSN6DImwkgbwPL0RQ=w138-h92-k-no" style="position: absolute; top: 50%;left: 50%;width: 126px;height: 84px;-webkit-transform: translateY(-50%) translateX(-50%);transform: translateY(-50%) translateX(-50%);"/></div><button jstcache="837" style="display:none"></button><div class="badge-container"></div></div></div><div class="CJY91c-jRmmHf-aVTXAb-haAclf-hxbzzc" jstcache="832"></div></div><div class="CJY91c-jRmmHf-aVTXAb-haAclf-IoWfhc" jstcache="833"></div></div></div>

最后一部分是 运行 methong .find_all('a', href=True) 这让我得到了这样的东西:

[<a aria-label="Muzeum Londynu" class="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd" href="https://www.google.com/maps/place/Muzeum+Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>]

我正在尝试专门提取 href 中存在的 [51.5176183, -0.0967782] 的经度和纬度。

我试过使用类似于 .text 方法的 .href 方法,但是当我使用 .href 时返回 'None'。你能告诉我如何从 href body 中提取这两个值吗?

运行 html 代码上的 .text 方法返回如下输出:

Museum of London         4,6(13 898)           · Museum     · 150 London Wall       · The history of London from antiquity to today       · Closing: 17:00      

根据你的问题,我使用 split() 方法得到想要的输出。

脚本

html='''
<html>
 <head>
 </head>
 <body>
  <a aria-label="Muzeum Londynu" class="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd" href="https://www.google.com/maps/place/Muzeum+Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825">
  </a>
 </body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
#print(soup.prettify())
href=soup.find("a",class_="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd").get('href')
lat_lan=','.join(href.split('/')[-1].split('?')[0].split(':')[-1].split('!')[2:]).replace('3d','').replace('4d','').split()
print(lat_lan)

输出

['51.5176183', '-0.0967782']