使用 Python 中的 BeautifulSoup 从 HTML 文档中提取文本
Extracting text from an HTML Doc using BeautifulSoup in Python
我正在尝试使用以下 HTML 中的 BeautifulSoup 从网站 songmeanings.com 上的家长评论中提取文本:
<div class="text" id="comment-73014911864">
<strong class="title">
General Comment
</strong>
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
<br/>
<br/>
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
<br/>
(b) He has talent. He can actually rap. I don't think d12 is any good. =/
<br/>
<br/>
Anyway. I love this song and I'm getting his new CD right now... hehe.
<br/>
-Sarah
<div class="sign">
<a class="author" href="/profiles/view/17067478/" id="userprofile-17067478" rel="me nofollow" title="xoDonnieDarko">
xoDonnieDarko
</a>
<em class="date">
on December 06, 2005
</em>
<a href="/songs/view/3530822107858560012/?&specific_com=73014911864#comments" id="specific_com-73014911864" rel="nofollow" title="Permalink">
Link
</a>
</div>
<ul class="answers">
<li>
<div class="title">
<a class="replies close-replies" href="#" id="showreplies-73014911864" rel="nofollow" title="3 Replies">
3 Replies
</a>
<span class="login">
<a class="lightbox" href="#popup-loginform" rel="nofollow">
Log in to reply
</a>
</span>
<br>
</br>
</div>
<div id="formreply-73014911864" style="display: none;">
<!-- comment-form -->
<form action="#" class="comment-form-reply" id="comment-form-reply-73014911864">
<div class="area" id="reply-errors-box" style="display: none;">
<label for="type">
</label>
<span id="reply-errors" style="color: #ff0000;">
There was an error.
</span>
</div>
<div class="area">
<div class="textarea">
<div class="holder">
<div class="frame">
<textarea class="frmreplycomment-73014911864" id="frmreplycomment" name="frmreplycomment">
@xoDonnieDarko
</textarea>
</div>
</div>
</div>
</div>
<input id="frmreplylid" name="frmreplylid" type="hidden" value="3530822107858560012">
<input id="frmaid" name="frmaid" type="hidden" value="94">
<input id="frmreplycid" name="frmreplycid" type="hidden" value="73014911864">
<input class="submit" type="submit" value="Add reply"/>
</input>
</input>
</input>
</form>
</div>
<div id="thesereplies-73014911864" style="display: none;">
<div class="answer-holder" id="fullcomment-73015890665">
<a name="comment-73015890665">
</a>
<div id="rating-holder-73015890665">
<div class="numb-holder">
<span id="com-rating-73015890665">
<strong class="numb" id="numb-rating-73015890665">
+1
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015890665" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015890665" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73015961779">
<a name="comment-73015961779">
</a>
<div id="rating-holder-73015961779">
<div class="numb-holder">
<span id="com-rating-73015961779">
<strong class="numb" id="numb-rating-73015961779">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015961779" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015961779" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73016306033">
<a name="comment-73016306033">
</a>
<div id="rating-holder-73016306033">
<div class="numb-holder">
<span id="com-rating-73016306033">
<strong class="numb" id="numb-rating-73016306033">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73016306033" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73016306033" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
</div>
</div>
</li>
</ul>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
使用此代码我可以从评论中提取大部分文本,但是任何带有换行符的评论都将缺少内容:
import urllib2
from bs4 import BeautifulSoup
url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for strong_tag in soup.find_all('strong'):
print strong_tag.next_sibling
给出输出:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
我想要的是:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
(b) He has talent. He can actually rap. I don't think d12 is any good. =/
Anyway. I love this song and I'm getting his new CD right now... hehe.
-Sarah
如何从父评论中提取所有文本?有比使用强标签更好的方法吗?
我稍微修改了(给他点个赞!)得到这个解决方案:
def loop_until(text,first_elem):
try:
text += first_elem.string
if first_elem.next == first_elem.find_next('div'):
return text
else:
return loop_until(text,first_elem.next.next)
except TypeError:
pass
这样称呼它:
next_elem = soup.find_all('strong')[0].nextSibling
loop_until('',next_elem)
结果:
u"\n This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,\n \n\n (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.\n \n (b) He has talent. He can actually rap. I don't think d12 is any good. =/\n \n\n Anyway. I love this song and I'm getting his new CD right now... hehe.\n \n -Sarah\n "
我正在尝试使用以下 HTML 中的 BeautifulSoup 从网站 songmeanings.com 上的家长评论中提取文本:
<div class="text" id="comment-73014911864">
<strong class="title">
General Comment
</strong>
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
<br/>
<br/>
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
<br/>
(b) He has talent. He can actually rap. I don't think d12 is any good. =/
<br/>
<br/>
Anyway. I love this song and I'm getting his new CD right now... hehe.
<br/>
-Sarah
<div class="sign">
<a class="author" href="/profiles/view/17067478/" id="userprofile-17067478" rel="me nofollow" title="xoDonnieDarko">
xoDonnieDarko
</a>
<em class="date">
on December 06, 2005
</em>
<a href="/songs/view/3530822107858560012/?&specific_com=73014911864#comments" id="specific_com-73014911864" rel="nofollow" title="Permalink">
Link
</a>
</div>
<ul class="answers">
<li>
<div class="title">
<a class="replies close-replies" href="#" id="showreplies-73014911864" rel="nofollow" title="3 Replies">
3 Replies
</a>
<span class="login">
<a class="lightbox" href="#popup-loginform" rel="nofollow">
Log in to reply
</a>
</span>
<br>
</br>
</div>
<div id="formreply-73014911864" style="display: none;">
<!-- comment-form -->
<form action="#" class="comment-form-reply" id="comment-form-reply-73014911864">
<div class="area" id="reply-errors-box" style="display: none;">
<label for="type">
</label>
<span id="reply-errors" style="color: #ff0000;">
There was an error.
</span>
</div>
<div class="area">
<div class="textarea">
<div class="holder">
<div class="frame">
<textarea class="frmreplycomment-73014911864" id="frmreplycomment" name="frmreplycomment">
@xoDonnieDarko
</textarea>
</div>
</div>
</div>
</div>
<input id="frmreplylid" name="frmreplylid" type="hidden" value="3530822107858560012">
<input id="frmaid" name="frmaid" type="hidden" value="94">
<input id="frmreplycid" name="frmreplycid" type="hidden" value="73014911864">
<input class="submit" type="submit" value="Add reply"/>
</input>
</input>
</input>
</form>
</div>
<div id="thesereplies-73014911864" style="display: none;">
<div class="answer-holder" id="fullcomment-73015890665">
<a name="comment-73015890665">
</a>
<div id="rating-holder-73015890665">
<div class="numb-holder">
<span id="com-rating-73015890665">
<strong class="numb" id="numb-rating-73015890665">
+1
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015890665" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015890665" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73015961779">
<a name="comment-73015961779">
</a>
<div id="rating-holder-73015961779">
<div class="numb-holder">
<span id="com-rating-73015961779">
<strong class="numb" id="numb-rating-73015961779">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73015961779" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73015961779" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
</div>
<div class="answer-holder" id="fullcomment-73016306033">
<a name="comment-73016306033">
</a>
<div id="rating-holder-73016306033">
<div class="numb-holder">
<span id="com-rating-73016306033">
<strong class="numb" id="numb-rating-73016306033">
0
</strong>
</span>
<div class="com-whorated" id="com-whorated-73016306033" style="display: none; text-align: center;">
<span class="processing">
</span>
</div>
<div id="processing-73016306033" style="text-align: center; padding: 8px 8px 0px 12px; display: none;">
<span class="processing">
</span>
</div>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
</div>
</div>
</li>
</ul>
</div>
<div class="text">
i agree he is the only rapper i can listen too.
<div class="sign">
<span id="flagspan-73015890665">
<a class="flag" href="#" id="flag-73015890665">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17374833/" id="userprofile-17374833" rel="me nofollow" title="byrdman1992">
byrdman1992
</a>
<em class="date">
on March 15, 2010
</em>
</div>
</div>
<div class="text">
same her the ONLY one...and sometimes lil' wayne! lol
<div class="sign">
<span id="flagspan-73015961779">
<a class="flag" href="#" id="flag-73015961779">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17418133/" id="userprofile-17418133" rel="me nofollow" title="dancer017">
dancer017
</a>
<em class="date">
on August 26, 2010
</em>
</div>
</div>
<div class="text">
<a href="/profiles/view/17067478/?mention=12eeb84af5d911243541dc3bf651fc7b" id="userprofile-17067478" rel="me nofollow" title="@xoDonnieDarko">
@xoDonnieDarko
</a>
RIttz is pretty good.. Can listen to yela and tech too.
<div class="sign">
<span id="flagspan-73016306033">
<a class="flag" href="#" id="flag-73016306033">
Flag
</a>
</span>
<a class="author" href="/profiles/view/17643918/" id="userprofile-17643918" rel="me nofollow" title="Heeltoehole">
Heeltoehole
</a>
<em class="date">
on September 05, 2015
</em>
</div>
</div>
使用此代码我可以从评论中提取大部分文本,但是任何带有换行符的评论都将缺少内容:
import urllib2
from bs4 import BeautifulSoup
url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for strong_tag in soup.find_all('strong'):
print strong_tag.next_sibling
给出输出:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
我想要的是:
This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,
(a) His songs have meaning. They're not about sex and cars and bling blingin' rims.
(b) He has talent. He can actually rap. I don't think d12 is any good. =/Anyway. I love this song and I'm getting his new CD right now... hehe.
-Sarah
如何从父评论中提取所有文本?有比使用强标签更好的方法吗?
我稍微修改了(给他点个赞!)得到这个解决方案:
def loop_until(text,first_elem):
try:
text += first_elem.string
if first_elem.next == first_elem.find_next('div'):
return text
else:
return loop_until(text,first_elem.next.next)
except TypeError:
pass
这样称呼它:
next_elem = soup.find_all('strong')[0].nextSibling
loop_until('',next_elem)
结果:
u"\n This is a beautiful song. I love it a lot. He is the ONLY, and yes, ONLY rapper I will listen to. Because,\n \n\n (a) His songs have meaning. They're not about sex and cars and bling blingin' rims.\n \n (b) He has talent. He can actually rap. I don't think d12 is any good. =/\n \n\n Anyway. I love this song and I'm getting his new CD right now... hehe.\n \n -Sarah\n "