将 HTML 文档中的文本提取到单词列表中
Extracted Text from HTML Doc Into a List of Words
使用 BeautifulSoup,我从该页面的 html 文档中提取了网页上的评论。使用此代码我已经能够打印出评论:
import urllib2
from bs4 import BeautifulSoup
url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
def loop_until(text,first_elem):
try:
text += first_elem.string
if first_elem.next == first_elem.find_next('div'):
return text
else:
return loop_until(text,first_elem.next.next)
except TypeError:
pass
wordList = []
for strong_tag in soup.find_all('strong'):
next_elem = strong_tag.next_sibling
print loop_until("", next_elem)
现在我需要从该选择中取出所有单词并将它们附加到 wordList,我该怎么做?
更改最后一行,使用 append
wordList.append(loop_until("", next_elem))
使用 BeautifulSoup,我从该页面的 html 文档中提取了网页上的评论。使用此代码我已经能够打印出评论:
import urllib2
from bs4 import BeautifulSoup
url = "http://songmeanings.com/songs/view/3530822107858560012/"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
def loop_until(text,first_elem):
try:
text += first_elem.string
if first_elem.next == first_elem.find_next('div'):
return text
else:
return loop_until(text,first_elem.next.next)
except TypeError:
pass
wordList = []
for strong_tag in soup.find_all('strong'):
next_elem = strong_tag.next_sibling
print loop_until("", next_elem)
现在我需要从该选择中取出所有单词并将它们附加到 wordList,我该怎么做?
更改最后一行,使用 append
wordList.append(loop_until("", next_elem))