Python 2.7.10 尝试使用 Beautiful Soup 4 从网站打印文本

Question

我希望我的输出如下：

count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim

Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.

Count:1 - Andre-Pierre Gignac set for Mexico move

Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.

我的程序：

from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)

count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
    divTaginb = tag.find_all("div", {"class":"lc-title-container"})
    divTaginp = tag.find_all("div",{"class":"lc-post-body"})
    for tag1 in divTaginb:
        h4Tag = tag1.find_all("b")
        for tag2 in h4Tag:
            print "count:%d - "%count,
            print tag2.text
            print '\n'
            tagp = divTaginp[count].find_all('p')
            for p in tagp:
            print p
            print '\n'
            count +=1

我的输出：

Count:0 - ....
...
count:37 -  ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process


<p><strong>STEPHEN MCGOWAN:</strong>┬áBundesliga giants Hamburg have been linked
 with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>

<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>


count:38 -  ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta


<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
 Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>


count:39 -  Good morning everybody!


<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
 shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today┬áyou can read all
about them here.</p>

DailyMail 网站如下所示：

<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
    <div class="lc-icons">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
        <div class="lc-post-time">18:03 </div>
    </div>
    <div class="lc-title-container">
        <h4>
            <a href="http://www.dailymail.co.uk/sport/football/article-3130092/Bournemouth-Watford-want-former-Manchester-City-midfielder.html" target="_blank"><b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b></a>
        </h4>
    </div>
    <div class="lc-post-body">
        <p><strong>SAMI MOKBEL:&nbsp;</strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
    </div>


    <img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
    <b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
    <div class="lc-clear"></div>

    <ul class="lc-social">
        <li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
        <li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
    </ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
    <div class="lc-icons">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
        <img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
        <div class="lc-post-time">16:54 </div>
    </div>
    <div class="lc-title-container">
            <span><b>Andre-Pierre Gignac set for Mexico move</b></span>
    </div>
    <div class="lc-post-body">
        <p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
    </div>


    <img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
    <b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
    <div class="lc-clear"></div>

    <ul class="lc-social">
        <li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
        <li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
    </ul>
</div>

现在我的目标是 <div class="lc-title-container"> 在这个 <b></b> 里面。我很容易做到。但是当我在这所有 <p></p> 中定位 <div class="lc-post-body"> 时。我无法只获得所需的文本。我尝试了 p.text 和 p.strip()，但仍然无法解决我的问题。

使用 p.text

时出错

count:19 -  City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality


MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.

Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
  File "app.py", line 24, in <module>
    print p.text
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
 160: character maps to <undefined>

当我使用 p.strip() 时，我没有得到任何输出。有没有什么好的办法。帮我找到最好的方法。我从早上到现在都在尝试这个东西。

如果可能我不想使用任何编码器或解码器

dammit = UnicodeDammit(html) print(dammit.unicode_markup)

Answer 1

这是我的代码。你应该去看看。我懒得为数据集添加特定字段，而是将所有内容组合起来。

from bs4 import BeautifulSoup, element
import urllib2




response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)

count=0

article_dataset = {}


# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})


# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:

  container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})

  body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})

  # Find the article name, and initialize an empty dict as the value
  for count, container in enumerate(container_tags):

    # We know there is only 1 <b> tag in our container, 
    # so use find() instead of find_all()
    article_name_tag = container.find('b')

    # Our primary key is the article name, the corrosponding value is the body_tag.
    article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}





for article_name, details in article_dataset.items():

    content = []
    content_line_tags = details['body_tag'].find_all('p')

    # Go through each tag and collect the text
    for content_tag in content_line_tags:
        for data in content_tag.contents: # gather strings in our tags
            if type(data) == element.NavigableString:
                data = unicode(data)
            else:
                data = data.text
            content += [data]

    # combine the content
    content = '\n'.join(content)

    # Add the content to our data
    article_dataset[article_name]['content'] = content





# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
    del details['body_tag']

    print
    print
    print 'Artilce Name: ' + name
    print 'Player: ' + details['content'].split('\n')[0]
    print 'Article Summary: ' + details['content']
    print

Python 2.7.10 尝试使用 Beautiful Soup 4 从网站打印文本

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

python

urllib2

beautifulsoup

python-2.7