如何提取 HTML 段落的某些部分
How to extract certain parts of an HTML paragraph
我是网络抓取和正则表达式的新手,在这里遇到了一个问题。我的一个代码在 HTML 中给我一个输出,但我需要从段落中提取特定部分而不是完整的段落。我需要帮助。下面是我的代码。
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
屏幕输出:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
我需要的:30/06/1980
& 40631880
解析 URL 的简洁方式 (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])
对于Python 2.7 试试这个方法:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result
我是网络抓取和正则表达式的新手,在这里遇到了一个问题。我的一个代码在 HTML 中给我一个输出,但我需要从段落中提取特定部分而不是完整的段落。我需要帮助。下面是我的代码。
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
屏幕输出:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
我需要的:30/06/1980
& 40631880
解析 URL 的简洁方式 (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])
对于Python 2.7 试试这个方法:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result