使用 beautifulsoup in python 从具有更多文本内容的网页中提取数据
Extract the data from a Web Page which has more Textual Content using beautifulsoup in python
我一直在尝试提取网页的数据丰富节点。有没有办法从网页中提取文本
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
print html.title.text
我可以打印网页的标题,请你帮我提取网页中的文字(只有文字)。
提前致谢
尝试这样做
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
html = urllib.urlopen('http://www.amazon.in').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
print visible_texts
试试这个
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content, "html.parser")
print html.get_text()
我一直在尝试提取网页的数据丰富节点。有没有办法从网页中提取文本
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
print html.title.text
我可以打印网页的标题,请你帮我提取网页中的文字(只有文字)。
提前致谢
尝试这样做
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
html = urllib.urlopen('http://www.amazon.in').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
print visible_texts
试试这个
import requests
import bs4
from bs4 import BeautifulSoup
import urllib2
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content, "html.parser")
print html.get_text()