网络爬虫在列表之间提取
Web crawler to extract in between the list
我正在 python 中编写网络爬虫。我希望获取 <li> </li>
标签之间的所有内容。例如:
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
所以我想:
a.)提取日期-并将其转换为dd/mm/yyyy格式
b.)人前数
soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
print (str(h2[count].get_text().encode('ascii', 'ignore')))
count += 1
我现在只能提取文本。
获取 .text
、split the string by the first occurence of :
, convert the date string to datetime
using strptime()
specifying existing %B %d, %Y
format, then format it to string using strftime()
specifying the desired %d/%m/%Y
format and extract the number using At least (\d+)
regular expression where (\d+)
is a capturing group 匹配一个或多个数字的文本:
from datetime import datetime
import re
from bs4 import BeautifulSoup
data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)
打印:
13/01/1991
40
我正在 python 中编写网络爬虫。我希望获取 <li> </li>
标签之间的所有内容。例如:
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
所以我想:
a.)提取日期-并将其转换为dd/mm/yyyy格式
b.)人前数
soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
print (str(h2[count].get_text().encode('ascii', 'ignore')))
count += 1
我现在只能提取文本。
获取 .text
、split the string by the first occurence of :
, convert the date string to datetime
using strptime()
specifying existing %B %d, %Y
format, then format it to string using strftime()
specifying the desired %d/%m/%Y
format and extract the number using At least (\d+)
regular expression where (\d+)
is a capturing group 匹配一个或多个数字的文本:
from datetime import datetime
import re
from bs4 import BeautifulSoup
data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)
打印:
13/01/1991
40