python: 从任何网站提取文本
python: extracting text from any website
到目前为止,我已经完成了我的工作,但它成功地从这两个网站获取了文本:
- http://www.tutorialspoint.com/cplusplus/index.htm
- http://www.cplusplus.com/doc/tutorial/program_structure/
但我不知道我哪里做错了,它没有从其他网站获取文本,当我放置其他链接时它给我错误,例如:
- http://www.cmpe.boun.edu.tr/~akin/cmpe223/chap2.htm
- http://www.i-programmer.info/babbages-bag/477-trees.html
- http://www.w3schools.com/html/html_elements.asp
错误:
Traceback (most recent call last):
File "C:\Users\DELL\Desktop\python\s\fyp\data extraction.py", line 20, in
text = soup.select('.C_doc')[0].get_text()
IndexError: list index out of range
我的代码:
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html" #unsuccessfull
#url = "http://www.tutorialspoint.com/cplusplus/index.htm" #doing successfully
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/" #doing successfully
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
# get text
#text = soup.select('.C_doc')[0].get_text()
#text = soup.select('.content')[0].get_text()
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
else:
text = soup.select('.C_doc')[0].get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
fo = open('foo.txt', 'w')
fo.seek(0, 2)
line = fo.writelines( text )
fo.close()
#writing done :)
您假设您废弃的所有网站都具有 class 名称 content
或 C_doc
.
如果您废弃的网站没有 class 名称 C_doc
怎么办?
修复如下:
text = ''
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
elif soup.select('.C_doc'):
text = soup.select('.C_doc')[0].get_text()
if text:
#put rest of the code.
else:
print 'text does not exists.'
尝试使用
Text = soup.findAll(text=True)
更新
这是一个基本的文本剥离器,您可以从中着手。
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract()
text = soup.findAll(text=True)
for p in text:
print p
到目前为止,我已经完成了我的工作,但它成功地从这两个网站获取了文本:
- http://www.tutorialspoint.com/cplusplus/index.htm
- http://www.cplusplus.com/doc/tutorial/program_structure/
但我不知道我哪里做错了,它没有从其他网站获取文本,当我放置其他链接时它给我错误,例如:
- http://www.cmpe.boun.edu.tr/~akin/cmpe223/chap2.htm
- http://www.i-programmer.info/babbages-bag/477-trees.html
- http://www.w3schools.com/html/html_elements.asp
错误:
Traceback (most recent call last):
File "C:\Users\DELL\Desktop\python\s\fyp\data extraction.py", line 20, in text = soup.select('.C_doc')[0].get_text() IndexError: list index out of range
我的代码:
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html" #unsuccessfull
#url = "http://www.tutorialspoint.com/cplusplus/index.htm" #doing successfully
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/" #doing successfully
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
# get text
#text = soup.select('.C_doc')[0].get_text()
#text = soup.select('.content')[0].get_text()
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
else:
text = soup.select('.C_doc')[0].get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
fo = open('foo.txt', 'w')
fo.seek(0, 2)
line = fo.writelines( text )
fo.close()
#writing done :)
您假设您废弃的所有网站都具有 class 名称 content
或 C_doc
.
如果您废弃的网站没有 class 名称 C_doc
怎么办?
修复如下:
text = ''
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
elif soup.select('.C_doc'):
text = soup.select('.C_doc')[0].get_text()
if text:
#put rest of the code.
else:
print 'text does not exists.'
尝试使用
Text = soup.findAll(text=True)
更新
这是一个基本的文本剥离器,您可以从中着手。
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract()
text = soup.findAll(text=True)
for p in text:
print p