从给定站点的网页中收集文本

collect text from web pages of a given site

有一个我经常访问和阅读的网站"best advice"。这是我可以轻松提取我想要的文本的方法...

import urllib2
from bs4 import BeautifulSoup  

mylist=list()

myurl='http://www.apartmenttherapy.com/carols-east-side-cottage-house-tour-194787'
s=urllib2.urlopen(myurl)
soup =  BeautifulSoup(s)

hello = soup.find(text='Best Advice: ')
mylist.append(hello.next)

但我如何从所有页面收集文本片段?


我可以使用这个简单的 google 查询来搜索所有页面...

站点:http://www.apartmenttherapy.com

google搜索是否有API可用于python? 我正在为这个问题寻找一次简单的解决方案。所以我宁愿不要安装太多包来完成这个任务。

您必须像这里解释的那样使用支持 js 的抓取: http://koaning.io/dynamic-scraping-with-python.html

可以先阅读BeautifulSoup手册,也可以学习使用web developer tool检测网络流量。

完成后,您可能会看到可以使用 GET 请求获取房屋列表 http://www.apartmenttherapy.com/search?page=1&q=House+Tour&type=all

假设,我们可以从第 1 页迭代到 X 以获取所有房屋索引页。

在每个索引页上,您正好有 15 url 可以添加到列表中。

获得完整的 url 列表后,您可以废弃每个 url 以获得每个 "best advice" 的文本。

请看下面的代码:

import time
import requests
import random
from bs4 import BeautifulSoup  

#here we get a list of all url to scrap
url_list=[]
max_index=2 

for page_index in range(1,max_index):

    #get index page
    html=requests.get("http://www.apartmenttherapy.com/search?page="+str(page_index)+"&q=House+Tour&type=all").content

    #iterate over teaser
    for teaser in BeautifulSoup(html).findAll('a',{'class':'SimpleTeaser'}):

        #add link to url list
        url_list.append(teaser['href'])

    #sleep a litte to avoid overload/ to be smart
    time.sleep(random.random()/2.) # respect server side load

    #here I break because it s just an example (it does not required to scrap all index page)
    break #comment this break in production


#here we show list  
print url_list


#we iterate over url to get the advice
mylist=[]
for url in url_list:

    #get teaser page
    html=requests.get(url).content

    #find best advice text
    hello = BeautifulSoup(html).find(text='Best Advice: ')

    #print advice
    print "advice for",url,"\n","=>",

    #try to add next text to mylist
    try:
        mylist.append(hello.next)
    except:
        pass

    #sleep a litte to avoid overload/ to be smart
    time.sleep(random.random()/2.) # respect server side load

#show list of advice
print mylist

输出为:

['http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229', 'http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725', 'http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896', 'http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962', 'http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440', 'http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846', 'http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080', 'http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294', 'http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667', 'http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203', 'http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878', 'http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791', 'http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295', 'http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518', 'http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764']
advice for http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229 
=> advice for http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725 
=> advice for http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896 
=> advice for http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962 
=> advice for http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440 
=> advice for http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846 
=> advice for http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080 
=> advice for http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294 
=> advice for http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667 
=> advice for http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203 
=> advice for http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878 
=> advice for http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791 
=> advice for http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295 
=> advice for http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518 
=> advice for http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764 
=> [u"If you make a bad design choice or purchase, don't be afraid to change it. Try and try again until you love it.\n\t", u" Sisal rugs. They clean up easily and they're very understated. Start with very light colors and add colors later.\n", u"Bring in what you love, add dimension and texture to your walls. Decorate as an individual and not to please your neighbor or the masses. Trends are fun but I love elements of timeless interiors. Include things from any/every decade as well as mixing styles. I'm convinced it's the hardest way to decorate without looking like you are living in a flea market stall. Scale, color, texture, and contrast are what I focus on. For me it takes some toying around, and I always consider how one item affects the next. Consider space and let things stand out by limiting what surrounds them.", u'You don\u2019t need to invest in \u201cdecor\u201d and nothing needs to match. Just decorate with the special things (books, cards, trinkets, jars, etc.) that you\u2019ve collected over the years, and be organized. I honestly think half the battle of having good home design is keeping a neat house. The other half is just displaying stuff that is special to you. Stuff that has a story and/or reminds you of people, ideas, and places that you love. One more piece of advice - the best place to buy picture frames is Goodwill. Pick a frame in decent condition, and just paint it to complement your palette. One last piece of advice\u2014 decor need not be pricey. I ALWAYS shop consignment and thrift, and then I repaint and customize as I see fit.\n', u'From my sister \u2014 to use the second bedroom as my room, as it is dark and quiet, both of which I need in order to sleep.\n', u'Collect things that you love in your travels throughout life. I tend to purchase ceramics when travelling, sometimes a collection of bowls\u2026 not so easy transporting in the suitcase, but no breakages yet!\n\t', u'Keep things authentic to the character of your home and to the character of your family. Then, you can never go wrong!\n\t', u'Contemporary architecture does not require contemporary furnishings.\n']