为什么我的网络爬虫不跟进下一个 link 包含关键字
Why does my webcrawler not follow into the next link containing keywords
我写了一个简单的网络爬虫,它最终将只关注新闻 link 以将文章文本抓取到数据库中。我实际上在遵循源 url 中的 link 时遇到了问题。这是到目前为止的代码:
import urlparse
import mechanize
url ="https://news.google.co.uk"
def spider(root, steps):
urls = [root]
visited =[root]
counter = 0
while counter < steps:
step_url = scrape(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counter+=1
return visited
def scrape(root):
result_urls = []
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Chrome')]
for url in root:
try:
br.open(url)
keyWords = ['news','article','business', 'world']
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
result_urls.append(newurl)
[newslinks for newslinks in result_urls if newslinks in keyWords]
print newslinks
except:
print "scrape error"
return result_urls
print spider(url, 2)
编辑:NLTK
`for text in (parse_links_text(get_links(url), d)):
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary=True)
entities = re.findall(r'NE\s(.*?)/',str(namedEnt))
descriptives = re.findall(r'\(\'(\w*)\',\s\'JJ\w?\'', str(tagged))`
然后添加到数据库中。
Mechanize 不是您想要的最佳工具,这将获取所有 links 并使用 BeautifulSoup 从 links 页面中提取主要文本,我们可以使用字典在正确的 css select 或网站名称之间创建映射,使用正则表达式从 link 中提取密钥并传递正确的 css select 基于:
url ="https://news.google.co.uk"
import requests
import re
from bs4 import BeautifulSoup
def get_links(start):
cont = requests.get(start).content
soup = BeautifulSoup(cont, "lxml")
keys = ['news','article','business', 'world']
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
return [a["url"] for a in soup.select(".esc-layout-table a.article") if any(k in a["url"] for k in keys)]
def parse_links_text(links, css_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate selector from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.")
for link in links:
print(link)
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link).group()
p = [p.text for p in soup.select(css_d[css])]
yield p
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "p.mol-para-with-font","telegraph.":"#mainBodyArea",
"bbc.": "div.story-body p","independent.":"div.text-wrapper p"}
for text in (parse_links_text(get_links(url), d)):
print(text)
从 telegraph、dailymail、bbc 和 独立的 link。没有灵丹妙药可以让一个标签获得您想要的所有数据,您将不得不为其他页面添加更多潜在的 select 或在 html 发生变化时调整它们。
输出片段:
http://www.telegraph.co.uk/news/politics/12199759/The-IDS-explosion-could-do-untold-damage-to-David-Camerons-reputation.html
[u' In a sense, David Cameron owes his job to Iain Duncan Smith. Without the abject failure of Mr Duncan Smith\u2019s leadership between 2001 and 2003, the Conservatives might not have reached the collective conclusion that a traditional Tory focus on issues such as Europe would not win an election and realise that, to use a Cameron phrase, they had to change to win. ', u' Mr Cameron\u2019s leadership is easily understood as a political reaction to Mr Duncan Smith\u2019s, but the two have more in common than is easily visible. Intellectually, there is a continuity between the two leaderships that is not often realised. Even as he was failing dismally as leader, Mr Duncan Smith was saying things about the party that Mr Cameron would endorse today. ', u'\n', u"Huge respect for IDS. Welfare reform must b done the right way. The electorate will not trust us again if we don't look after the vulnerable", u' So in his awful 2002 \u201cquiet man\u201d speech to the Conservative conference in Bournemouth, we find IDS outlining a vision of \u201ccompassionate conservatism\u201d, declaring: \u201cWe believe that the privileges of the few must be turned into the opportunities of the many.\u201d We also hear him telling the Tory faithful (and you had to be devoted to be at that miserable gathering) to acknowledge that many voters felt bitterly angry about the party\u2019s last spell in government: \u201cAll of us here want to remember the good things we did and there were many. But beyond this hall, people too often remember the hurt we caused and the anger they felt,\u201d he said. ', u' That is a decent exposition of what the Cameron team would, four years later, describe as the Tory \u201cbrand problem\u201d: the perception among some voters that the party governed for the privileged few at the direct expense of the less fortunate many. Changing that perception has been the most consistent objective in Mr Cameron\u2019s politics, a near-constant in a career whose successes owe much to his willingness to shift strategy and tactics according to circumstance. ', u' But \u201cdetoxifying the Tory brand\u201d is not, whatever his critics may say, simply a marketing exercise for a PM who used to work in PR. In another similarity with Mr Duncan Smith, Mr Cameron is a believer. People close to David Cameron know that what really drives and excites him is not reforming the EU (whatever he says in public, the topic bores him) or balancing the budget. Those things may dominate his Government\u2019s agenda, but friends say what raises his political passion is social reform \u2013 ensuring that people born without his privileges can share a little of the riches he has known all his life. ', u'\n', u' The origins of this feeling are hard to pinpoint with certainty, but those who have known him longest credit both his wife Samantha and their tragically short-lived first child, Ivan, with opening the eyes of a previously conventionally upper-class Conservative to the reality of life for those who suffer misfortune. ', u' So when he was, to everyone\u2019s surprise including his own, re-elected with a majority last year, the first thing Mr Cameron said was that he wished to pursue a One Nation agenda, to govern for rich and poor alike, and to make it easier for the latter to become the former. That agenda might have been recently eclipsed by Europe, and often reduced to an empty slogan, but that is where the Prime Minister\u2019s heart truly lies. For evidence, consider the series of speeches Mr Cameron gave in the early weeks of this year, focusing on social mobility, racism, and equal opportunities. ', u' I was among those who thought the speeches mostly good and impressive, though many others, including a fair few Conservatives, disagreed and took a more cynical view. But both admirers and critics alike would, I think, concede that Mr Cameron was genuine in his talk of social reform. And this is the agenda that Mr Duncan Smith is threatening with his softly spoken, hard-hitting words on The Andrew Marr Show \u2013 which were, arguably, more inflammatory than his incendiary resignation letter. ', u'\n', u'Goodbye, Iain Duncan Smith. Hello, Stephen Crabb. pic.twitter.com/fs5gscKCh3', u' Mr Duncan Smith says that Mr Cameron is not, in fact, seeking to make Britain one nation. He says the policies overseen by the Prime Minister \u2013 and let\u2019s remember that the Prime Minister, no matter how mighty he lets his Chancellor of the Exchequer become, is ultimately responsible for policy \u2013 are in the interests of the better-off and harmful to those without means or opportunity. More grave yet, he suggests his leader is indifferent to causing suffering among the poor and weak: \u201cIt just looks like we see this as a pot of money, that it doesn\u2019t matter because they don\u2019t vote for us.\u201d ', u' Coming from the man who spent six years running welfare policy, that is a potentially devastating assessment in political terms. Mr Duncan Smith makes a case for the prosecution of Mr Cameron\u2019s administration that Jeremy Corbyn could not fault. ', u'\n', u' But it is also intensely personal. Mr Duncan Smith is challenging the Prime Minister on the turf that Mr Cameron is most committed to claiming for his own. Can you really hope to go down in history as a great social-reforming premier when, in the assessment of your own welfare secretary, you have chosen to help the rich and fortunate by harming the poor and vulnerable? In this context, it is no surprise that Mr Cameron has reacted to Mr Duncan Smith\u2019s departure with true rage. (A hot temper and tendency to profanity are also things he shares with IDS, as I and several others can attest.) ', u' Amid recent events, much attention is rightly being paid to the severe damage the IDS explosion has done George Osborne\u2019s already damaged hopes of the leadership. But for Mr Cameron, this is about something else, something even more important than ambition. It is about purpose. ', u' There are already many reasons for the Prime Minister to want to win his EU referendum and run his government for a few more years. But he now has another. If Mr Cameron cannot make good on his fine words about One Nation and social mobility and equality of opportunity, and thus disprove the charges Mr Duncan Smith levels against him, then his life in politics has all been for nothing. ', u'\n\nIDS career\n']
http://www.bbc.co.uk/news/uk-politics-35855616
[u'Iain Duncan Smith has warned that the government risks dividing society, in his first interview since resigning as work and pensions secretary.', u'He attacked the "desperate search for savings" focused on benefit payments to people who "don\'t vote for us".', u'And he told the BBC\'s Andrew Marr his "painful" decision was "not personal" against Chancellor George Osborne.', u'Downing Street said it was sorry to see Iain Duncan Smith go but was determined to help "everyone in our society".', u'BBC political correspondent Alan Soady said Mr Duncan Smith\'s interview - which followed his resignation over cuts to disability benefits on Friday - was an "absolutely blistering attack".', u'He added: "This was not just about his objections to one change in disability benefit, he was questioning the fundamental principles underpinning the government."', u'Mr Duncan Smith told the BBC he had supported a consultation on the changes to Personal Independence Payments but had come under "massive pressure" to deliver the savings ahead of last week\'s Budget.', u'The way the cuts were presented in the Budget had been "deeply unfair", he said, because they were "juxtaposed" with tax cuts for the wealthy.', u'He criticised the "arbitrary" decision to lower the welfare cap after the general election and suggested the government was in danger of losing "the balance of the generations", expressing his "deep concern" at a "very narrow attack on working-age benefits" while also protecting pensioner benefits.', u'If the focus on the working-age benefit budget continued, he said, "it just looks like we see this as a pot of money, that it doesn\'t matter because they don\'t vote for us".', u'Mr Duncan Smith, who said he felt he had become "semi-detached" from government, said the Conservatives had to return to being a party "that cares about even those who do not vote for us".', u'He said he cared "passionately" about "people who don\'t get the choices my children get" and "bringing people back in to an arena where we play daily but they do not".', u'He suggested the government was in "danger of drifting in a direction that divides society rather than unites it, and that, I think, is unfair".', u'In his interview, Mr Duncan Smith gave his version of a deteriorating relationship with the government, saying he had considered resigning last year and had "long-running" concerns about cuts imposed since May\'s general election.', u'He said the disability benefit cuts should have been part of a "much wider programme" - but after Christmas "pressure began to grow" to rush a consultation so they could feature in Wednesday\'s Budget.', u'Asked why he had not spoken out when the measures were presented to cabinet, he said he "sat silently" as he "realised the full state of what was happening" with tax cuts featuring elsewhere in the Budget.', u'After thinking "long and hard", he said he agreed to write to MPs to reassure them over the disability cuts, saying "it\'s not what it sounds like in the Budget".', u'But he said he realised in the following two days "there was no way I would able to stop this process" and resigned on Friday evening.', u'Alan Soady, BBC political correspondent', u'What pushes a cabinet minister to resign so sensationally?', u"Its origins lie partly in the rapid shift of the economic gloom-o-meter. Forecasts in December's Autumn Statement were upbeat, predicting more money rolling into the Treasury.", u'By Wednesday\'s Budget, the sunshine had turned into "storm clouds". They blew over Iain Duncan-Smith\'s department because welfare changes of recent years have so far brought in nothing like the savings originally projected.', u'IDS signed off on tightening the rules around Personal Independence Payments five days before the Budget, but now says he would rather have been allowed to wait so he could see who were the winners and losers.', u"As the row gathered momentum after the Budget, Education Secretary Nicky Morgan suggested the plans weren't set in stone.", u"Mr Duncan Smith's people disagreed, firmly believing the proposals were final. The following day, Downing Street suggested a U-turn was on the cards.", u"For IDS, it was the final straw, believing he was going to carry the can for a policy he claims he'd been bounced into prematurely. Others question his account - asking why he signed off the proposal in the first place if he was so against it.", u'Mr Duncan Smith spoke of his "love" for the Conservative Party and described claims he was trying to undermine David Cameron as "nonsense", saying he had had a "robust" conversation with the PM after telling him of his resignation.', u'Asked whether Mr Osborne would make a good prime minister, he added: "If he was to stand and if he was elected by the electorate, which is not just me it is everybody else, I would hope that he would."', u'A Number 10 spokesman said: "We are sorry to see Iain Duncan Smith go, but we are a \'one nation\' government determined to continue helping everyone in our society have more security and opportunity, including the most disadvantaged.', u'"That means we will deliver our manifesto commitments to make the welfare system fairer, cut taxes and ensure we have a stable economy by controlling welfare spending and living within our means."', u'He said more people were in work under this government with fewer "trapped" on unemployment benefits.', u'Former Lib Dem minister David Laws told Andrew Marr divisions between Mr Osborne and Mr Duncan Smith over welfare had been a "running sore throughout the last parliament".', u'He said: "George Osborne, I think it\'s fair to say, did regard the welfare budget as something of a cash cow to be squeezed in order to help to deliver deficit reduction. Iain Duncan Smith had a different view."', u"Mr Duncan Smith's resignation has divided his former ministerial team at the DWP.", u'Pensions minister Baroness Ros Altmann attacked his tenure, describing him as "exceptionally difficult" to work for, and accused him of using his resignation "to do maximum damage to the party leadership" in order to support the campaign to leave the EU.', u'But her fellow DWP minister Shailesh Vara said he was "surprised" at Baroness Altmann\'s comments, saying: "Ros\'s recollection does not accord with mine and I\'m sorry that this has all happened."', u'Disabilities minister Justin Tomlinson said the former secretary of state had "always conducted himself in a professional, dedicated and determined manner", while employment minister Priti Patel told BBC Radio 5 live it had been a "privilege" to work for him.', u'Owen Smith, Labour\'s welfare spokesman, said Mr Duncan Smith had been "very honest in explaining how George Osborne could have taken different choices" and had revealed "the fundamental unfairness at the heart of government policy".']
您当然可以只 p = [p.text for p in soup.select("p")]
到 select 段落中的所有文本,但这将包含很多您不需要的数据。如果您只对某些页面感兴趣,您还可以使用以下内容根据您是否在 css_d
字典中找到匹配项进行过滤:
for link in links:
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link)
if not css:
continue
css = css.group()
yield [p.text for p in soup.select(css)]
正如评论中所讨论的,为了灵活性,lxml 是一个很好的工具,要获取这些部分,我们可以使用以下代码:
from urlparse import urljoin
import requests
url = "https://news.google.co.uk"
def get_sections(start, sections):
'''Pulls the links for each sections we pass in, i.e World, Business etc...'''
cont = requests.get(start).content
xml = fromstring(cont, HTMLParser())
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
secs = xml.xpath("//span[@class='section-name']")
for sec in secs:
_sec = sec.text.rsplit(None, 1)[0].lower().rstrip(".")
if _sec in sections:
yield _sec, urljoin(url, sec.xpath(".//parent::a/@href")[0])
def get_section_links(sec_url):
''''Get all links from individual sections.'''
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
seen = set()
for url in xml.xpath("//div[@class='section-stream-content']//a/@url"):
if url not in seen:
yield url
seen.add(url)
# set of sections we want
s = {'business', 'world', "sports", "u.k"}
for sec, link in get_sections(url, s):
for sec_link in (get_section_links(link)):
print(sec, sec_link)
所以如果我们 运行 上面的代码我们从每个部分得到所有 links,下面是每个部分的一个非常小的片段,实际上有相当数量的 links returned:
(u'world', 'http://www.theguardian.com/commentisfree/2016/mar/21/new-york-millionaires-who-want-taxes-raised')
(u'world', 'http://www.abc.net.au/news/2016-03-22/berg-turnbull%27s-only-real-option-was-bluff-and-bravado/7264350')
(u'world', 'http://www.swissinfo.ch/eng/reuters/australian-pm-takes-bold-gamble--sets-in-motion-july-2-poll/42037074')
(u'world', 'https://www.washingtonpost.com/news/checkpoint/wp/2016/03/21/these-are-the-new-u-s-military-bases-near-the-south-china-sea-china-isnt-impressed/')
(u'world', 'http://www.reuters.com/article/southchinasea-china-usa-idUSL3N16T3BH')
(u'world', 'http://atimes.com/2016/03/philippine-election-question-marks-sow-panic-in-south-china-sea/')
(u'world', 'http://www.manilatimes.net/what-if-china-attacks-bases-used-by-america/251946/')
(u'world', 'http://www.arabnews.com/world/news/898816')
(u'world', 'http://macaudailytimes.com.mo/koreas-seoul-north-korea-fires-five-short-range-projectiles.html')
(u'world', 'http://gulftoday.ae/portal/cb0e2530-0769-411d-9622-2e991191656b.aspx')
(u'world', 'http://38north.org/2016/03/aabrahamian032116/')
(u'u.k', 'http://www.irishnews.com/news/2016/03/22/news/judge-tells-madonna-and-richie-to-settle-rocco-dispute-458929/')
(u'u.k', 'http://www.marilynstowe.co.uk/2016/03/21/judge-urges-amicable-resolution-in-madonna-dispute-over-son/')
(u'u.k', 'http://www.mercurynews.com/celebrities/ci_29666212/judge-tells-madonna-and-guy-ritchie-get-it')
(u'u.k', 'http://www.telegraph.co.uk/news/celebritynews/madonna/12199922/Madonnas-UK-court-fight-with-Guy-Ritchie-over-son-Rocco-can-end-judge-rules.html')
(u'u.k', 'http://www.pbo.co.uk/news/boaty-mcboatface-leading-public-vote-to-name-200m-polar-research-ship-28429')
(u'u.k', 'http://www.theguardian.com/environment/shortcuts/2016/mar/21/from-bell-end-boaty-mcboatface-trouble-letting-public-name-things')
(u'u.k', 'http://www.independent.co.uk/news/uk/boaty-mcboatface-debacle-shows-the-perils-of-crowdsourcing-opinion-from-hooty-mcowlface-to-mr-a6944801.html')
(u'u.k', 'http://www.sacbee.com/news/nation-world/world/article67322252.html')
(u'u.k', 'http://www.westerndailypress.co.uk/Jury-discharged-manslaughter-case-Thomas-Orchard/story-28964162-detail/story.html')
(u'u.k', 'http://www.exeterexpressandecho.co.uk/Breaking-Thomas-Orchard-manslaughter-trial-jury/story-28963859-detail/story.html')
(u'u.k', 'http://www.theguardian.com/uk-news/2016/mar/21/thomas-orchard-trial-jury-discharged-judge-halts-proceedings')
(u'u.k', 'http://www.ft.com/cms/s/0/0bf3e966-ef57-11e5-9f20-c3a047354386.html')
(u'u.k', 'http://www.theweek.co.uk/london-mayor-election-2016/62681/london-mayor-election-2016-whos-in-the-running-as-starting-gun')
(u'business', 'https://uk.finance.yahoo.com/news/companies-may-soon-stop-reporting-162707837.html')
(u'business', 'http://www.theweek.co.uk/70785/why-youre-about-to-stop-getting-quarterly-reports-on-your-investments')
(u'business', 'http://uk.reuters.com/article/uk-starwood-hotels-m-a-marriott-idUKKCN0WN142')
(u'business', 'http://www.reuters.com/article/us-global-oil-idUSKCN0WN00I')
(u'business', 'http://www.digitallook.com/news/commodities/commodities-oil-futures-recoup-previous-sessions-losses--1087119.html')
(u'business', 'http://news.sky.com/story/1664056/new-top-dog-at-pets-at-home-as-ceo-retires')
(u'business', 'http://money.aol.co.uk/2016/03/21/sky-tv-price-hike-shock/')
(u'business', 'http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=11609694')
(u'business', 'http://www.dailymail.co.uk/sciencetech/article-3502838/The-Flying-Bum-ready-lift-World-s-largest-aircraft-Airlander-10-fitted-fins-engines-ahead-flight.html')
(u'business', 'http://www.business-standard.com/article/pti-stories/world-s-longest-aircraft-revealed-in-new-pictures-116032000569_1.html')
(u'sports', 'http://www.telegraph.co.uk/football/2016/03/21/gary-neville-consulted-roy-hodgson-on-england-delay/')
(u'sports', 'http://www.dailymail.co.uk/sport/football/article-3502767/Gary-Neville-leaving-Valencia-join-England-gritted-teeth-feels-like-La-Liga-club-giving-fans-chant-manager-now.html')
(u'sports', 'http://www.irishexaminer.com/sport/soccer/gary-neville-in-firing-line-as-valencia-lose-again-388634.html')
(u'sports', 'http://timesofindia.indiatimes.com/sports/tennis/top-stories/Male-tennis-players-should-earn-more-than-females-Djokovic/articleshow/51499959.cms')
(u'sports', 'http://www.sport24.co.za/soccer/livescoring?mid=23948674&st=football')
(u'sports', 'http://www.dispatch.com/content/stories/sports/2016/03/21/0321-serena-williams-rips-indian-wells-ceo.html')
(u'sports', 'http://www.bbc.co.uk/sport/football/35864765')
(u'sports', 'http://indianexpress.com/article/sports/football/joachim-loew-throws-max-kruse-out-of-germany-squad/')
(u'sports', 'http://www.si.com/planet-futbol/2016/03/21/max-kruse-germany-kicked-jogi-low')
(u'sports', 'http://www.dw.com/en/coach-joachim-l%C3%B6w-drops-max-kruse-from-german-national-team/a-19132035')
(u'sports', 'http://www.bbc.co.uk/sport/football/35865092')
(u'sports', 'http://news.sky.com/story/1664218')
(u'sports', 'http://www.theguardian.com/business/2016/mar/21/sports-direct-founder-mike-ashley-snubs-call-mps-parliamentary-select-committee')
(u'sports', 'http://www.mirror.co.uk/news/business/sports-direct-boss-mike-ashley-7604067')
(u'sports', 'http://www.independent.ie/sport/soccer/mike-ashley-says-he-is-wedded-to-newcastle-even-if-they-go-down-34558617.html')
(u'sports', 'http://www.heraldscotland.com/sport/14373924.Michael_Carrick_praises_performance_after_United_win_Manchester_derby/')
(u'sports', 'http://www.dorsetecho.co.uk/sport/national/14373773.Michael_Carrick_hails_vital_Manchester_derby_victory/')
如果我们只是 return 一组 get_section_links 我们可以将其传递给函数来解析文本:
def get_section_links(sec_url):
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
return set(xml.xpath("//div[@class='section-stream-content']//a/@url"))
因此,使用 lxml 来使用 xpaths 进行解析,对于我们已经解析的少数站点,我们可以添加更多的逻辑来捕获变化:
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "//div[@itemprop='articleBody']//p",
"telegraph.": "//div[@id='mainBodyArea']//p",
"bbc.": "//div[@class='story-body']//p",
"independent.": "//div[@class='text-wrapper']//p",
"www.mirror.": "//*[@class='live-now-entry' or @class='lead-entry' or @itemprop='articleBody']//p"}
import logging
logger = logging.getLogger(__file__)
logging.basicConfig()
logger.setLevel(logging.DEBUG)
def parse_links_text(links, xpath_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate xpath from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.|www.mirror.")
for link in links:
try:
cont = requests.get(link).content
except requests.exceptions.RequestException as e:
logging.error(e.message)
continue
xml = fromstring(cont, HTMLParser())
xpath = r.search(link)
if xpath:
p = "".join(filter(None, ("".join(p.xpath("normalize-space(.//text())"))
for p in xml.xpath(xpath_d[xpath.group()]))))
if p:
yield p
else:
logger.debug("No match for {}".format(link))
同样,您将必须决定可以访问哪些站点,并找到正确的 xpath 来提取主要文章文本,但这应该会让您顺利进行。当我有更多时间时,我将异步地向 运行 请求添加一些逻辑。
我写了一个简单的网络爬虫,它最终将只关注新闻 link 以将文章文本抓取到数据库中。我实际上在遵循源 url 中的 link 时遇到了问题。这是到目前为止的代码:
import urlparse
import mechanize
url ="https://news.google.co.uk"
def spider(root, steps):
urls = [root]
visited =[root]
counter = 0
while counter < steps:
step_url = scrape(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counter+=1
return visited
def scrape(root):
result_urls = []
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Chrome')]
for url in root:
try:
br.open(url)
keyWords = ['news','article','business', 'world']
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
result_urls.append(newurl)
[newslinks for newslinks in result_urls if newslinks in keyWords]
print newslinks
except:
print "scrape error"
return result_urls
print spider(url, 2)
编辑:NLTK
`for text in (parse_links_text(get_links(url), d)):
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary=True)
entities = re.findall(r'NE\s(.*?)/',str(namedEnt))
descriptives = re.findall(r'\(\'(\w*)\',\s\'JJ\w?\'', str(tagged))`
然后添加到数据库中。
Mechanize 不是您想要的最佳工具,这将获取所有 links 并使用 BeautifulSoup 从 links 页面中提取主要文本,我们可以使用字典在正确的 css select 或网站名称之间创建映射,使用正则表达式从 link 中提取密钥并传递正确的 css select 基于:
url ="https://news.google.co.uk"
import requests
import re
from bs4 import BeautifulSoup
def get_links(start):
cont = requests.get(start).content
soup = BeautifulSoup(cont, "lxml")
keys = ['news','article','business', 'world']
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
return [a["url"] for a in soup.select(".esc-layout-table a.article") if any(k in a["url"] for k in keys)]
def parse_links_text(links, css_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate selector from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.")
for link in links:
print(link)
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link).group()
p = [p.text for p in soup.select(css_d[css])]
yield p
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "p.mol-para-with-font","telegraph.":"#mainBodyArea",
"bbc.": "div.story-body p","independent.":"div.text-wrapper p"}
for text in (parse_links_text(get_links(url), d)):
print(text)
从 telegraph、dailymail、bbc 和 独立的 link。没有灵丹妙药可以让一个标签获得您想要的所有数据,您将不得不为其他页面添加更多潜在的 select 或在 html 发生变化时调整它们。
输出片段:
http://www.telegraph.co.uk/news/politics/12199759/The-IDS-explosion-could-do-untold-damage-to-David-Camerons-reputation.html
[u' In a sense, David Cameron owes his job to Iain Duncan Smith. Without the abject failure of Mr Duncan Smith\u2019s leadership between 2001 and 2003, the Conservatives might not have reached the collective conclusion that a traditional Tory focus on issues such as Europe would not win an election and realise that, to use a Cameron phrase, they had to change to win. ', u' Mr Cameron\u2019s leadership is easily understood as a political reaction to Mr Duncan Smith\u2019s, but the two have more in common than is easily visible. Intellectually, there is a continuity between the two leaderships that is not often realised. Even as he was failing dismally as leader, Mr Duncan Smith was saying things about the party that Mr Cameron would endorse today. ', u'\n', u"Huge respect for IDS. Welfare reform must b done the right way. The electorate will not trust us again if we don't look after the vulnerable", u' So in his awful 2002 \u201cquiet man\u201d speech to the Conservative conference in Bournemouth, we find IDS outlining a vision of \u201ccompassionate conservatism\u201d, declaring: \u201cWe believe that the privileges of the few must be turned into the opportunities of the many.\u201d We also hear him telling the Tory faithful (and you had to be devoted to be at that miserable gathering) to acknowledge that many voters felt bitterly angry about the party\u2019s last spell in government: \u201cAll of us here want to remember the good things we did and there were many. But beyond this hall, people too often remember the hurt we caused and the anger they felt,\u201d he said. ', u' That is a decent exposition of what the Cameron team would, four years later, describe as the Tory \u201cbrand problem\u201d: the perception among some voters that the party governed for the privileged few at the direct expense of the less fortunate many. Changing that perception has been the most consistent objective in Mr Cameron\u2019s politics, a near-constant in a career whose successes owe much to his willingness to shift strategy and tactics according to circumstance. ', u' But \u201cdetoxifying the Tory brand\u201d is not, whatever his critics may say, simply a marketing exercise for a PM who used to work in PR. In another similarity with Mr Duncan Smith, Mr Cameron is a believer. People close to David Cameron know that what really drives and excites him is not reforming the EU (whatever he says in public, the topic bores him) or balancing the budget. Those things may dominate his Government\u2019s agenda, but friends say what raises his political passion is social reform \u2013 ensuring that people born without his privileges can share a little of the riches he has known all his life. ', u'\n', u' The origins of this feeling are hard to pinpoint with certainty, but those who have known him longest credit both his wife Samantha and their tragically short-lived first child, Ivan, with opening the eyes of a previously conventionally upper-class Conservative to the reality of life for those who suffer misfortune. ', u' So when he was, to everyone\u2019s surprise including his own, re-elected with a majority last year, the first thing Mr Cameron said was that he wished to pursue a One Nation agenda, to govern for rich and poor alike, and to make it easier for the latter to become the former. That agenda might have been recently eclipsed by Europe, and often reduced to an empty slogan, but that is where the Prime Minister\u2019s heart truly lies. For evidence, consider the series of speeches Mr Cameron gave in the early weeks of this year, focusing on social mobility, racism, and equal opportunities. ', u' I was among those who thought the speeches mostly good and impressive, though many others, including a fair few Conservatives, disagreed and took a more cynical view. But both admirers and critics alike would, I think, concede that Mr Cameron was genuine in his talk of social reform. And this is the agenda that Mr Duncan Smith is threatening with his softly spoken, hard-hitting words on The Andrew Marr Show \u2013 which were, arguably, more inflammatory than his incendiary resignation letter. ', u'\n', u'Goodbye, Iain Duncan Smith. Hello, Stephen Crabb. pic.twitter.com/fs5gscKCh3', u' Mr Duncan Smith says that Mr Cameron is not, in fact, seeking to make Britain one nation. He says the policies overseen by the Prime Minister \u2013 and let\u2019s remember that the Prime Minister, no matter how mighty he lets his Chancellor of the Exchequer become, is ultimately responsible for policy \u2013 are in the interests of the better-off and harmful to those without means or opportunity. More grave yet, he suggests his leader is indifferent to causing suffering among the poor and weak: \u201cIt just looks like we see this as a pot of money, that it doesn\u2019t matter because they don\u2019t vote for us.\u201d ', u' Coming from the man who spent six years running welfare policy, that is a potentially devastating assessment in political terms. Mr Duncan Smith makes a case for the prosecution of Mr Cameron\u2019s administration that Jeremy Corbyn could not fault. ', u'\n', u' But it is also intensely personal. Mr Duncan Smith is challenging the Prime Minister on the turf that Mr Cameron is most committed to claiming for his own. Can you really hope to go down in history as a great social-reforming premier when, in the assessment of your own welfare secretary, you have chosen to help the rich and fortunate by harming the poor and vulnerable? In this context, it is no surprise that Mr Cameron has reacted to Mr Duncan Smith\u2019s departure with true rage. (A hot temper and tendency to profanity are also things he shares with IDS, as I and several others can attest.) ', u' Amid recent events, much attention is rightly being paid to the severe damage the IDS explosion has done George Osborne\u2019s already damaged hopes of the leadership. But for Mr Cameron, this is about something else, something even more important than ambition. It is about purpose. ', u' There are already many reasons for the Prime Minister to want to win his EU referendum and run his government for a few more years. But he now has another. If Mr Cameron cannot make good on his fine words about One Nation and social mobility and equality of opportunity, and thus disprove the charges Mr Duncan Smith levels against him, then his life in politics has all been for nothing. ', u'\n\nIDS career\n']
http://www.bbc.co.uk/news/uk-politics-35855616
[u'Iain Duncan Smith has warned that the government risks dividing society, in his first interview since resigning as work and pensions secretary.', u'He attacked the "desperate search for savings" focused on benefit payments to people who "don\'t vote for us".', u'And he told the BBC\'s Andrew Marr his "painful" decision was "not personal" against Chancellor George Osborne.', u'Downing Street said it was sorry to see Iain Duncan Smith go but was determined to help "everyone in our society".', u'BBC political correspondent Alan Soady said Mr Duncan Smith\'s interview - which followed his resignation over cuts to disability benefits on Friday - was an "absolutely blistering attack".', u'He added: "This was not just about his objections to one change in disability benefit, he was questioning the fundamental principles underpinning the government."', u'Mr Duncan Smith told the BBC he had supported a consultation on the changes to Personal Independence Payments but had come under "massive pressure" to deliver the savings ahead of last week\'s Budget.', u'The way the cuts were presented in the Budget had been "deeply unfair", he said, because they were "juxtaposed" with tax cuts for the wealthy.', u'He criticised the "arbitrary" decision to lower the welfare cap after the general election and suggested the government was in danger of losing "the balance of the generations", expressing his "deep concern" at a "very narrow attack on working-age benefits" while also protecting pensioner benefits.', u'If the focus on the working-age benefit budget continued, he said, "it just looks like we see this as a pot of money, that it doesn\'t matter because they don\'t vote for us".', u'Mr Duncan Smith, who said he felt he had become "semi-detached" from government, said the Conservatives had to return to being a party "that cares about even those who do not vote for us".', u'He said he cared "passionately" about "people who don\'t get the choices my children get" and "bringing people back in to an arena where we play daily but they do not".', u'He suggested the government was in "danger of drifting in a direction that divides society rather than unites it, and that, I think, is unfair".', u'In his interview, Mr Duncan Smith gave his version of a deteriorating relationship with the government, saying he had considered resigning last year and had "long-running" concerns about cuts imposed since May\'s general election.', u'He said the disability benefit cuts should have been part of a "much wider programme" - but after Christmas "pressure began to grow" to rush a consultation so they could feature in Wednesday\'s Budget.', u'Asked why he had not spoken out when the measures were presented to cabinet, he said he "sat silently" as he "realised the full state of what was happening" with tax cuts featuring elsewhere in the Budget.', u'After thinking "long and hard", he said he agreed to write to MPs to reassure them over the disability cuts, saying "it\'s not what it sounds like in the Budget".', u'But he said he realised in the following two days "there was no way I would able to stop this process" and resigned on Friday evening.', u'Alan Soady, BBC political correspondent', u'What pushes a cabinet minister to resign so sensationally?', u"Its origins lie partly in the rapid shift of the economic gloom-o-meter. Forecasts in December's Autumn Statement were upbeat, predicting more money rolling into the Treasury.", u'By Wednesday\'s Budget, the sunshine had turned into "storm clouds". They blew over Iain Duncan-Smith\'s department because welfare changes of recent years have so far brought in nothing like the savings originally projected.', u'IDS signed off on tightening the rules around Personal Independence Payments five days before the Budget, but now says he would rather have been allowed to wait so he could see who were the winners and losers.', u"As the row gathered momentum after the Budget, Education Secretary Nicky Morgan suggested the plans weren't set in stone.", u"Mr Duncan Smith's people disagreed, firmly believing the proposals were final. The following day, Downing Street suggested a U-turn was on the cards.", u"For IDS, it was the final straw, believing he was going to carry the can for a policy he claims he'd been bounced into prematurely. Others question his account - asking why he signed off the proposal in the first place if he was so against it.", u'Mr Duncan Smith spoke of his "love" for the Conservative Party and described claims he was trying to undermine David Cameron as "nonsense", saying he had had a "robust" conversation with the PM after telling him of his resignation.', u'Asked whether Mr Osborne would make a good prime minister, he added: "If he was to stand and if he was elected by the electorate, which is not just me it is everybody else, I would hope that he would."', u'A Number 10 spokesman said: "We are sorry to see Iain Duncan Smith go, but we are a \'one nation\' government determined to continue helping everyone in our society have more security and opportunity, including the most disadvantaged.', u'"That means we will deliver our manifesto commitments to make the welfare system fairer, cut taxes and ensure we have a stable economy by controlling welfare spending and living within our means."', u'He said more people were in work under this government with fewer "trapped" on unemployment benefits.', u'Former Lib Dem minister David Laws told Andrew Marr divisions between Mr Osborne and Mr Duncan Smith over welfare had been a "running sore throughout the last parliament".', u'He said: "George Osborne, I think it\'s fair to say, did regard the welfare budget as something of a cash cow to be squeezed in order to help to deliver deficit reduction. Iain Duncan Smith had a different view."', u"Mr Duncan Smith's resignation has divided his former ministerial team at the DWP.", u'Pensions minister Baroness Ros Altmann attacked his tenure, describing him as "exceptionally difficult" to work for, and accused him of using his resignation "to do maximum damage to the party leadership" in order to support the campaign to leave the EU.', u'But her fellow DWP minister Shailesh Vara said he was "surprised" at Baroness Altmann\'s comments, saying: "Ros\'s recollection does not accord with mine and I\'m sorry that this has all happened."', u'Disabilities minister Justin Tomlinson said the former secretary of state had "always conducted himself in a professional, dedicated and determined manner", while employment minister Priti Patel told BBC Radio 5 live it had been a "privilege" to work for him.', u'Owen Smith, Labour\'s welfare spokesman, said Mr Duncan Smith had been "very honest in explaining how George Osborne could have taken different choices" and had revealed "the fundamental unfairness at the heart of government policy".']
您当然可以只 p = [p.text for p in soup.select("p")]
到 select 段落中的所有文本,但这将包含很多您不需要的数据。如果您只对某些页面感兴趣,您还可以使用以下内容根据您是否在 css_d
字典中找到匹配项进行过滤:
for link in links:
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link)
if not css:
continue
css = css.group()
yield [p.text for p in soup.select(css)]
正如评论中所讨论的,为了灵活性,lxml 是一个很好的工具,要获取这些部分,我们可以使用以下代码:
from urlparse import urljoin
import requests
url = "https://news.google.co.uk"
def get_sections(start, sections):
'''Pulls the links for each sections we pass in, i.e World, Business etc...'''
cont = requests.get(start).content
xml = fromstring(cont, HTMLParser())
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
secs = xml.xpath("//span[@class='section-name']")
for sec in secs:
_sec = sec.text.rsplit(None, 1)[0].lower().rstrip(".")
if _sec in sections:
yield _sec, urljoin(url, sec.xpath(".//parent::a/@href")[0])
def get_section_links(sec_url):
''''Get all links from individual sections.'''
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
seen = set()
for url in xml.xpath("//div[@class='section-stream-content']//a/@url"):
if url not in seen:
yield url
seen.add(url)
# set of sections we want
s = {'business', 'world', "sports", "u.k"}
for sec, link in get_sections(url, s):
for sec_link in (get_section_links(link)):
print(sec, sec_link)
所以如果我们 运行 上面的代码我们从每个部分得到所有 links,下面是每个部分的一个非常小的片段,实际上有相当数量的 links returned:
(u'world', 'http://www.theguardian.com/commentisfree/2016/mar/21/new-york-millionaires-who-want-taxes-raised')
(u'world', 'http://www.abc.net.au/news/2016-03-22/berg-turnbull%27s-only-real-option-was-bluff-and-bravado/7264350')
(u'world', 'http://www.swissinfo.ch/eng/reuters/australian-pm-takes-bold-gamble--sets-in-motion-july-2-poll/42037074')
(u'world', 'https://www.washingtonpost.com/news/checkpoint/wp/2016/03/21/these-are-the-new-u-s-military-bases-near-the-south-china-sea-china-isnt-impressed/')
(u'world', 'http://www.reuters.com/article/southchinasea-china-usa-idUSL3N16T3BH')
(u'world', 'http://atimes.com/2016/03/philippine-election-question-marks-sow-panic-in-south-china-sea/')
(u'world', 'http://www.manilatimes.net/what-if-china-attacks-bases-used-by-america/251946/')
(u'world', 'http://www.arabnews.com/world/news/898816')
(u'world', 'http://macaudailytimes.com.mo/koreas-seoul-north-korea-fires-five-short-range-projectiles.html')
(u'world', 'http://gulftoday.ae/portal/cb0e2530-0769-411d-9622-2e991191656b.aspx')
(u'world', 'http://38north.org/2016/03/aabrahamian032116/')
(u'u.k', 'http://www.irishnews.com/news/2016/03/22/news/judge-tells-madonna-and-richie-to-settle-rocco-dispute-458929/')
(u'u.k', 'http://www.marilynstowe.co.uk/2016/03/21/judge-urges-amicable-resolution-in-madonna-dispute-over-son/')
(u'u.k', 'http://www.mercurynews.com/celebrities/ci_29666212/judge-tells-madonna-and-guy-ritchie-get-it')
(u'u.k', 'http://www.telegraph.co.uk/news/celebritynews/madonna/12199922/Madonnas-UK-court-fight-with-Guy-Ritchie-over-son-Rocco-can-end-judge-rules.html')
(u'u.k', 'http://www.pbo.co.uk/news/boaty-mcboatface-leading-public-vote-to-name-200m-polar-research-ship-28429')
(u'u.k', 'http://www.theguardian.com/environment/shortcuts/2016/mar/21/from-bell-end-boaty-mcboatface-trouble-letting-public-name-things')
(u'u.k', 'http://www.independent.co.uk/news/uk/boaty-mcboatface-debacle-shows-the-perils-of-crowdsourcing-opinion-from-hooty-mcowlface-to-mr-a6944801.html')
(u'u.k', 'http://www.sacbee.com/news/nation-world/world/article67322252.html')
(u'u.k', 'http://www.westerndailypress.co.uk/Jury-discharged-manslaughter-case-Thomas-Orchard/story-28964162-detail/story.html')
(u'u.k', 'http://www.exeterexpressandecho.co.uk/Breaking-Thomas-Orchard-manslaughter-trial-jury/story-28963859-detail/story.html')
(u'u.k', 'http://www.theguardian.com/uk-news/2016/mar/21/thomas-orchard-trial-jury-discharged-judge-halts-proceedings')
(u'u.k', 'http://www.ft.com/cms/s/0/0bf3e966-ef57-11e5-9f20-c3a047354386.html')
(u'u.k', 'http://www.theweek.co.uk/london-mayor-election-2016/62681/london-mayor-election-2016-whos-in-the-running-as-starting-gun')
(u'business', 'https://uk.finance.yahoo.com/news/companies-may-soon-stop-reporting-162707837.html')
(u'business', 'http://www.theweek.co.uk/70785/why-youre-about-to-stop-getting-quarterly-reports-on-your-investments')
(u'business', 'http://uk.reuters.com/article/uk-starwood-hotels-m-a-marriott-idUKKCN0WN142')
(u'business', 'http://www.reuters.com/article/us-global-oil-idUSKCN0WN00I')
(u'business', 'http://www.digitallook.com/news/commodities/commodities-oil-futures-recoup-previous-sessions-losses--1087119.html')
(u'business', 'http://news.sky.com/story/1664056/new-top-dog-at-pets-at-home-as-ceo-retires')
(u'business', 'http://money.aol.co.uk/2016/03/21/sky-tv-price-hike-shock/')
(u'business', 'http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=11609694')
(u'business', 'http://www.dailymail.co.uk/sciencetech/article-3502838/The-Flying-Bum-ready-lift-World-s-largest-aircraft-Airlander-10-fitted-fins-engines-ahead-flight.html')
(u'business', 'http://www.business-standard.com/article/pti-stories/world-s-longest-aircraft-revealed-in-new-pictures-116032000569_1.html')
(u'sports', 'http://www.telegraph.co.uk/football/2016/03/21/gary-neville-consulted-roy-hodgson-on-england-delay/')
(u'sports', 'http://www.dailymail.co.uk/sport/football/article-3502767/Gary-Neville-leaving-Valencia-join-England-gritted-teeth-feels-like-La-Liga-club-giving-fans-chant-manager-now.html')
(u'sports', 'http://www.irishexaminer.com/sport/soccer/gary-neville-in-firing-line-as-valencia-lose-again-388634.html')
(u'sports', 'http://timesofindia.indiatimes.com/sports/tennis/top-stories/Male-tennis-players-should-earn-more-than-females-Djokovic/articleshow/51499959.cms')
(u'sports', 'http://www.sport24.co.za/soccer/livescoring?mid=23948674&st=football')
(u'sports', 'http://www.dispatch.com/content/stories/sports/2016/03/21/0321-serena-williams-rips-indian-wells-ceo.html')
(u'sports', 'http://www.bbc.co.uk/sport/football/35864765')
(u'sports', 'http://indianexpress.com/article/sports/football/joachim-loew-throws-max-kruse-out-of-germany-squad/')
(u'sports', 'http://www.si.com/planet-futbol/2016/03/21/max-kruse-germany-kicked-jogi-low')
(u'sports', 'http://www.dw.com/en/coach-joachim-l%C3%B6w-drops-max-kruse-from-german-national-team/a-19132035')
(u'sports', 'http://www.bbc.co.uk/sport/football/35865092')
(u'sports', 'http://news.sky.com/story/1664218')
(u'sports', 'http://www.theguardian.com/business/2016/mar/21/sports-direct-founder-mike-ashley-snubs-call-mps-parliamentary-select-committee')
(u'sports', 'http://www.mirror.co.uk/news/business/sports-direct-boss-mike-ashley-7604067')
(u'sports', 'http://www.independent.ie/sport/soccer/mike-ashley-says-he-is-wedded-to-newcastle-even-if-they-go-down-34558617.html')
(u'sports', 'http://www.heraldscotland.com/sport/14373924.Michael_Carrick_praises_performance_after_United_win_Manchester_derby/')
(u'sports', 'http://www.dorsetecho.co.uk/sport/national/14373773.Michael_Carrick_hails_vital_Manchester_derby_victory/')
如果我们只是 return 一组 get_section_links 我们可以将其传递给函数来解析文本:
def get_section_links(sec_url):
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
return set(xml.xpath("//div[@class='section-stream-content']//a/@url"))
因此,使用 lxml 来使用 xpaths 进行解析,对于我们已经解析的少数站点,我们可以添加更多的逻辑来捕获变化:
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "//div[@itemprop='articleBody']//p",
"telegraph.": "//div[@id='mainBodyArea']//p",
"bbc.": "//div[@class='story-body']//p",
"independent.": "//div[@class='text-wrapper']//p",
"www.mirror.": "//*[@class='live-now-entry' or @class='lead-entry' or @itemprop='articleBody']//p"}
import logging
logger = logging.getLogger(__file__)
logging.basicConfig()
logger.setLevel(logging.DEBUG)
def parse_links_text(links, xpath_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate xpath from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.|www.mirror.")
for link in links:
try:
cont = requests.get(link).content
except requests.exceptions.RequestException as e:
logging.error(e.message)
continue
xml = fromstring(cont, HTMLParser())
xpath = r.search(link)
if xpath:
p = "".join(filter(None, ("".join(p.xpath("normalize-space(.//text())"))
for p in xml.xpath(xpath_d[xpath.group()]))))
if p:
yield p
else:
logger.debug("No match for {}".format(link))
同样,您将必须决定可以访问哪些站点,并找到正确的 xpath 来提取主要文章文本,但这应该会让您顺利进行。当我有更多时间时,我将异步地向 运行 请求添加一些逻辑。