使用 Python 中的 BeautifulSoup 从 HTML 文档中提取文本

Extracting text from an HTML Doc using BeautifulSoup in Python

我正在尝试使用以下 HTML 中的 BeautifulSoup 从网站 songmeanings.com 上的家长评论中提取文本:

<div class="text" id="comment-73014911864">
 <strong class="title">
  General Comment
 </strong>
 This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
 <br/>
 <br/>
 (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
 <br/>
 (b) He has talent. He can actually rap. I don't think d12 is any good. =/
 <br/>
 <br/>
 Anyway. I love this song and I'm getting his new CD right now... hehe.
 <br/>
 -Sarah
 <div class="sign">
  <a class="author" href="/profiles/view/17067478/" id="userprofile-17067478" rel="me nofollow" title="xoDonnieDarko">
   xoDonnieDarko
  </a>
  <em class="date">
   on December 06, 2005
  </em>
  <a href="/songs/view/3530822107858560012/?&amp;specific_com=73014911864#comments" id="specific_com-73014911864" rel="nofollow" title="Permalink">
   Link
  </a>
 </div>
 <ul class="answers">
  <li>
   <div class="title">
    <a class="replies close-replies" href="#" id="showreplies-73014911864" rel="nofollow" title="3 Replies">
     3 Replies
    </a>
    <span class="login">
     <a class="lightbox" href="#popup-loginform" rel="nofollow">
      Log in to reply
     </a>
    </span>
    <br>
    </br>
   </div>
   <div id="formreply-73014911864" style="display: none;">
    <!-- comment-form -->
    <form action="#" class="comment-form-reply" id="comment-form-reply-73014911864">
     <div class="area" id="reply-errors-box" style="display: none;">
      <label for="type">
      </label>
      <span id="reply-errors" style="color: #ff0000;">
       There was an error.
      </span>
     </div>
     <div class="area">
      <div class="textarea">
       <div class="holder">
        <div class="frame">
         <textarea class="frmreplycomment-73014911864" id="frmreplycomment" name="frmreplycomment">
          @xoDonnieDarko
         </textarea>
        </div>
       </div>
      </div>
     </div>
     <input id="frmreplylid" name="frmreplylid" type="hidden" value="3530822107858560012">
      <input id="frmaid" name="frmaid" type="hidden" value="94">
       <input id="frmreplycid" name="frmreplycid" type="hidden" value="73014911864">
        <input class="submit" type="submit" value="Add reply"/>
       </input>
      </input>
     </input>
    </form>
   </div>
   <div id="thesereplies-73014911864" style="display: none;">
    <div class="answer-holder" id="fullcomment-73015890665">
     <a name="comment-73015890665">
     </a>
     <div id="rating-holder-73015890665">
      <div class="numb-holder">
       <span id="com-rating-73015890665">
        <strong class="numb" id="numb-rating-73015890665">
         +1
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73015890665" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73015890665" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      i agree he is the only rapper i can listen too.
      <div class="sign">
       <span id="flagspan-73015890665">
        <a class="flag" href="#" id="flag-73015890665">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
        byrdman1992
       </a>
       <em class="date">
        on March 15, 2010
       </em>
      </div>
     </div>
    </div>
    <div class="answer-holder" id="fullcomment-73015961779">
     <a name="comment-73015961779">
     </a>
     <div id="rating-holder-73015961779">
      <div class="numb-holder">
       <span id="com-rating-73015961779">
        <strong class="numb" id="numb-rating-73015961779">
         0
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73015961779" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73015961779" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      same her the ONLY one...and sometimes lil' wayne! lol
      <div class="sign">
       <span id="flagspan-73015961779">
        <a class="flag" href="#" id="flag-73015961779">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
        dancer017
       </a>
       <em class="date">
        on August 26, 2010
       </em>
      </div>
     </div>
    </div>
    <div class="answer-holder" id="fullcomment-73016306033">
     <a name="comment-73016306033">
     </a>
     <div id="rating-holder-73016306033">
      <div class="numb-holder">
       <span id="com-rating-73016306033">
        <strong class="numb" id="numb-rating-73016306033">
         0
        </strong>
       </span>
       <div class="com-whorated" id="com-whorated-73016306033" style="display: none; text-align: center;">
        <span class="processing">
        </span>
       </div>
       <div id="processing-73016306033" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
        <span class="processing">
        </span>
       </div>
      </div>
     </div>
     <div class="text">
      <a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
       @xoDonnieDarko
      </a>
      RIttz is pretty good.. Can listen to yela and tech too.
      <div class="sign">
       <span id="flagspan-73016306033">
        <a class="flag" href="#" id="flag-73016306033">
         Flag
        </a>
       </span>
       <a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
        Heeltoehole
       </a>
       <em class="date">
        on September 05, 2015
       </em>
      </div>
     </div>
    </div>
   </div>
  </li>
 </ul>
</div>

<div class="text">
 i agree he is the only rapper i can listen too.
 <div class="sign">
  <span id="flagspan-73015890665">
   <a class="flag" href="#" id="flag-73015890665">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
   byrdman1992
  </a>
  <em class="date">
   on March 15, 2010
  </em>
 </div>
</div>

<div class="text">
 same her the ONLY one...and sometimes lil' wayne! lol
 <div class="sign">
  <span id="flagspan-73015961779">
   <a class="flag" href="#" id="flag-73015961779">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
   dancer017
  </a>
  <em class="date">
   on August 26, 2010
  </em>
 </div>
</div>

<div class="text">
 <a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
  @xoDonnieDarko
 </a>
 RIttz is pretty good.. Can listen to yela and tech too.
 <div class="sign">
  <span id="flagspan-73016306033">
   <a class="flag" href="#" id="flag-73016306033">
    Flag
   </a>
  </span>
  <a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
   Heeltoehole
  </a>
  <em class="date">
   on September 05, 2015
  </em>
 </div>
</div>

使用此代码我可以从评论中提取大部分文本,但是任何带有换行符的评论都将缺少内容:

import urllib2
from bs4 import BeautifulSoup

url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')

for strong_tag in soup.find_all('strong'):
    print strong_tag.next_sibling

给出输出:

This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,

我想要的是:

This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,

(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
(b) He has talent. He can actually rap. I don't think d12 is any good. =/

Anyway. I love this song and I'm getting his new CD right now... hehe.
-Sarah

如何从父评论中提取所有文本?有比使用强标签更好的方法吗?

我稍微修改了(给他点个赞!)得到这个解决方案:

def loop_until(text,first_elem):
  try: 
    text += first_elem.string
    if first_elem.next == first_elem.find_next('div'):
      return text
    else:
      return loop_until(text,first_elem.next.next)
  except TypeError:
    pass 

这样称呼它:

next_elem = soup.find_all('strong')[0].nextSibling
loop_until('',next_elem)

结果:

 u"\n This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,\n \n\n (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.\n \n (b) He has talent. He can actually rap. I don't think d12 is any good. =/\n \n\n Anyway. I love this song and I'm getting his new CD right now... hehe.\n \n -Sarah\n "