使用 BeautifulSoup 拉取多个 kml 文件

Using BeautifulSoup to pull multiple kml files

我正在学习 python 并且正在尝试使一个过程自动化,该过程涉及到一个站点:wildcad net 并单击每个调度中心,从那里加载一个 kml。我注意到每个页面都遵循类似的格式,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
    "http://www.w3.org/TR/html4/frameset.dtd">
    <head>
     <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
     <meta content="WildCAD (Brian Booher)" name="GENERATOR"/>
     <title>
      WCAZ-ADC
     </title>
    </head>
    <frameset rows="64,*">
     <frame name="banner" noresize="" scrolling="no" src="WCAZ-ADCtop.htm"/>
     <frameset cols="150,*">
      <frame name="contents" src="WCAZ-ADCleft.htm"/>
      <frame name="main" src="WCAZ-ADCright.htm"/>
     </frameset>
     <noframes>
      <body>
       <p>  
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.kml" target="map"><font size="1">Incident Map (Google Earth)</font></a>

       </p>
      </body>
     </noframes>
    </frameset>

我想我可以全部使用 BeautifulSoup 和 select。我放入每个调度中心,然后 select 搜索 'a' 和 'href',因为它们不应该改变。我写的代码是这样的。然而,它似乎没有将 KML 识别为它自己的变量。我不太确定我哪里出错了,我对后续步骤的故障排除有点迷茫。任何指向正确方向的指示都会有很大帮助!

from bs4 import BeautifulSoup
    import requests

    urls = ('http://www.wildcad.net/WCAZ-ADC.htm', 'http://www.wildcad.net/WCALAIC.htm',
           'http://www.wildcad.net/WCAR-AOC.htm','http://www.wildcad.net/WCAZ-ADC.htm'
           'http://www.wildcad.net/WCAZ-FDC.htm', 'http://www.wildcad.net/WCAZ-PDC.htm'
           'http://www.wildcad.net/WCAZ-PHC.htm', 'http://www.wildcad.net/WCAZ-SDC.htm'
           'http://www.wildcad.net/WCAZ-TDC.htm', 'http://www.wildcad.net/WCAZ-WDC.htm'
           'http://www.wildcad.net/WCBLMNOC.htm', 'http://www.wildcad.net/WCCA-ANF.htm'
           'http://www.wildcad.net/WCCA-ANF.htm', 'http://www.wildcad.net/WCCA-CNF.htm'
           'http://www.wildcad.net/WCCA-FICC.htm', 'http://www.wildcad.net/WCCA-GVCC.htm'
           'http://www.wildcad.net/WCCA-MICC.htm', 'http://www.wildcad.net/WCCA-ONCC.htm'
           'http://www.wildcad.net/WCCA-OVICC.htm', 'http://www.wildcad.net/WCCA-PNF.htm'
           'http://www.wildcad.net/WCCA-SNF.htm' , 'http://www.wildcad.net/WCCA-STF.htm'
           'http://www.wildcad.net/WCCA-YICC.htm', 'http://www.wildcad.net/WCCA-YNP.htm'
           'http://www.wildcad.net/WCCALPF.htm' , 'http://www.wildcad.net/WCCAMNF.htm'
           'http://www.wildcad.net/WCCANCIC.htm' , 'http://www.wildcad.net/WCCARICC.htm'
           'http://www.wildcad.net/WCCASQCC.htm', 'http://www.wildcad.net/WCCCICC.htm'
           'http://www.wildcad.net/WCCO-CRC.htm' , 'http://www.wildcad.net/WCCO-FTC.htm'
           'http://www.wildcad.net/WCCO-GJC.htm' , 'http://www.wildcad.net/WCCO-MTC.htm'
           'http://www.wildcad.net/WCCODRC.htm' , 'http://www.wildcad.net/WCCOPBC.htm'
           'http://www.wildcad.net/WCFL-FIC.htm' , 'http://www.wildcad.net/WCGAGIC.htm'
           'http://www.wildcad.net/WCID-CDC.htm' , 'http://www.wildcad.net/WCID-GVC.htm'
           'http://www.wildcad.net/WCID-SCC.htm', 'http://www.wildcad.net/WCIDBDC.htm'
           'http://www.wildcad.net/WCIDCIC.htm', 'http://www.wildcad.net/WCIDEIC.htm'
           'http://www.wildcad.net/WCIDPAC.htm' , 'http://www.wildcad.net/WCILILC.htm'
           'http://www.wildcad.net/WCIN-IIC.htm', 'http://www.wildcad.net/WCKY-KIC.htm'
           'http://www.wildcad.net/WCLALIC.htm', 'http://www.wildcad.net/WCMI-MIDC.htm'
           'http://www.wildcad.net/WCMN-MNCC.htm', 'http://www.wildcad.net/WCMOMOC.htm'
           'http://www.wildcad.net/WCMSMIC.htm', 'http://www.wildcad.net/WCMT-BRC.htm'
           'http://www.wildcad.net/WCMT-BZC.htm', 'http://www.wildcad.net/WCMT-DDC.htm'
           'http://www.wildcad.net/WCMT-GDC.htm' 'http://www.wildcad.net/WCMT-HDC.htm'
           'http://www.wildcad.net/WCMT-KDC.htm', 'http://www.wildcad.net/WCMT-KIC.htm'
           'http://www.wildcad.net/WCMT-LEC.htm' , 'http://www.wildcad.net/WCMT-MCC.htm'
           'http://www.wildcad.net/WCMT-MDC.htm', 'http://www.wildcad.net/WCNC-NCC.htm'
           'http://www.wildcad.net/WCNDNDC.htm' , 'http://www.wildcad.net/WCNH-NEC.htm'
           'http://www.wildcad.net/WCNM-ABC.htm' , 'http://www.wildcad.net/WCNM-ADC.htm'
           'http://www.wildcad.net/WCNM-SDC.htm', 'http://www.wildcad.net/?WildWeb=NM-SFC'
           'http://www.wildcad.net/WCNMTDC.htm', 'http://www.wildcad.net/WCNMTDC.htm'
           'http://www.wildcad.net/WCNVCNC.htm' , 'http://www.wildcad.net/WCNVECC.htm'
           'http://www.wildcad.net/WCNVEIC.htm' , 'http://www.wildcad.net/WCNVLIC.htm'
           'http://www.wildcad.net/WCNVSFC.htm', 'http://www.wildcad.net/WCOR-BIC.htm'
           'http://www.wildcad.net/WCOR-COC.htm', 'http://www.wildcad.net/WCOR-EIC.htm'
           'http://www.wildcad.net/WCOR-JDCC.htm', 'http://www.wildcad.net/WCOR-RICC.htm'
           'http://www.wildcad.net/WCOR-RVC.htm', 'http://www.wildcad.net/WCOR-VAC.htm'
           'http://www.wildcad.net/WCORBMC.htm', 'http://www.wildcad.net/WCORLFC.htm'
           'http://www.wildcad.net/WCPA-MACC.htm', 'http://www.wildcad.net/WCSC-SCC.htm'
           'http://www.wildcad.net/WCSC-SRF.htm', 'http://www.wildcad.net/WCSD-GPC.htm'
           'http://www.wildcad.net/WCTN-TNC.htm', 'http://www.wildcad.net/WCTXTIC.htm'
           'http://www.wildcad.net/WCUT-CDC.htm' , 'http://www.wildcad.net/WCUT-MFC.htm'
           'http://www.wildcad.net/WCUT-NUC.htm' , 'http://www.wildcad.net/WCUT-RFC.htm'
           'http://www.wildcad.net/WCUT-UBC.htm' , 'http://www.wildcad.net/WCVAVIC.htm'
           'http://www.wildcad.net/WCWA-CWC.htm', 'http://www.wildcad.net/WCWY-CDC.htm'
           'http://www.wildcad.net/WCWY-CPC.htm',
    result = requests.get(urls)
    doc = BeautifulSoup(result.text, 'html.parser')
    print(doc.prettify())
    for i in enumerate(soup.findAll('a')):
        _KML = urls + link.get('href')
        if _KML.endswith('.kml'):
            urls.append(_KML)

    open(_KML)

如果我理解了这个问题,那么这就是下一个工作示例

doc='''
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
    "http://www.w3.org/TR/html4/frameset.dtd">
    <head>
     <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
     <meta content="WildCAD (Brian Booher)" name="GENERATOR"/>
     <title>
      WCAZ-ADC
     </title>
    </head>
    <frameset rows="64,*">
     <frame name="banner" noresize="" scrolling="no" src="WCAZ-ADCtop.htm"/>
     <frameset cols="150,*">
      <frame name="contents" src="WCAZ-ADCleft.htm"/>
      <frame name="main" src="WCAZ-ADCright.htm"/>
     </frameset>
     <noframes>
      <body>
       <p>  
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.kml" target="map"><font size="1">Incident Map (Google Earth)</font></a>
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.htm" target="map"><font size="1">Incident Map (Google Earth)</font></a>
       </p>
      </body>
     </noframes>
    </frameset>

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
#print(doc.prettify())
for i in soup.find_all('a'):
    #print(i.get('href'))
    urls = i.get('href')
    if urls.endswith('.kml'):
        kml = urls
        print(kml)

输出:

http://www.wildcadmap.net/WildCAD_AZ-FDC.kml