Python 未处理链接列表

Question

因此，由于我需要更详细的数据，因此我必须更深入地挖掘网站的 HTML 代码。我写了一个脚本，returns 为我提供了详细信息页面的特定 link 列表，但我无法让 Python 为我搜索此列表中的每个 link，它总是停在第一个。我做错了什么？

 from BeautifulSoup import BeautifulSoup
 import urllib2
 from lxml import html
 import requests

 #Open site
 html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")

#Inform BeautifulSoup
soup = BeautifulSoup(html_page)

#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
    #print found links
    print link.get('href')
    #complete links
    complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
    #print complete links
    print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#

page = requests.get(complete_links)
tree = html.fromstring(page.text)

#Details
name = tree.xpath('//dl[@class="services"]')

for i in name:
    print i.text_content()

另外：你能推荐什么教程让我学习如何将我的输出放入文件并清理它，给变量名等等？

Answer 1

我认为您需要 complete_links 中的 link 列表，而不是单个 link。正如@Pynchia 和@lemonhead 所说，您要覆盖 complete_links 第一个 for 循环的每次迭代。

您需要进行两项更改：

将 link 追加到列表中并使用它来循环和废弃每个 link

# [...] Same code here

links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
    print link.get('href')
    complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
    print complete_links
    link_list.append(complete_links)  # append new link to the list

在另一个循环中废弃每个累积的 link

for link in link_list:
    page = requests.get(link)
    tree = html.fromstring(page.text)

    #Details
    name = tree.xpath('//dl[@class="services"]')

    for i in name:
        print i.text_content()

PS：我建议 scrapy framework 完成这样的任务。

Python 未处理链接列表

Python not progressing a list of links

html

python

screen-scraping

data-cleaning