如何从无空格的字符串中提取数据?
How do I extract data from unspaced strings?
我需要从 BeautifulSoup 中解析的四个字符串中提取数据。他们是:
Arkansas72.21:59 AM76.29:04 AM5.22977.37:59 AM
Ashley71.93:39 AM78.78:59 AM0.53678.78:59 AM
Bradley72.64:49 AM77.28:59 AM2.41877.28:49 AM
Chicot-40.19:04 AM-40.19:04 AM2.573-40.112:09 AM
第一个字符串的数据,例如,Arkansas, 72.1, 1:59 AM, 76.2, 9:04 AM, 5.2, 29, 77.3, and 7:59 AM。有没有简单的方法可以做到这一点?
编辑:完整代码
import urllib2
from bs4 import BeautifulSoup
import time
def scraper():
#Arkansas State Plant Board Weather Web data
url1 = 'http://170.94.200.136/weather/Inversion.aspx'
#opens url and parses HTML into Unicode
page1 = urllib2.urlopen(url1)
soup1 = BeautifulSoup(page1, 'lxml')
#print(soup.get_text()) gives a single Unicode string of relevant data in strings from the url
#Without print(), returns everything in without proper spacing
sp1 = soup1.get_text()
#datasp1 is the chunk with the website data in it so the search for Arkansas doesn't return the header
#everything else finds locations for Unicode strings for first four stations
start1 = sp1.find('Today')
end1 = sp1.find('new Sys.')
datasp1 = sp1[start1:end1-10]
startArkansas = datasp1.find('Arkansas')
startAshley = datasp1.find('Ashley')
dataArkansas = datasp1[startArkansas:startAshley-2]
startBradley = datasp1.find('Bradley')
dataAshley = datasp1[startAshley:startBradley-2]
startChicot = datasp1.find('Chicot')
dataBradley = datasp1[startBradley:startChicot-2]
startCleveland = datasp1.find('Cleveland')
dataChicot = datasp1[startChicot:startCleveland-2]
print(dataArkansas)
print(dataAshley)
print(dataBradley)
print(dataChicot)
只是改进您提取表格数据的方式。我会使用 pandas.read_html()
将其读入 dataframe,我敢肯定,使用它会很方便:
import pandas as pd
df = pd.read_html("http://170.94.200.136/weather/Inversion.aspx", attrs={"id": "MainContent_GridView1"})[0]
print(df)
您需要使用 beautifulsoup 来解析 html 页面并检索您的数据:
url1 = 'http://170.94.200.136/weather/Inversion.aspx'
#opens url and parses HTML into Unicode
page1 = urlopen(url1)
soup1 = BeautifulSoup(page1)
# get the table
table = soup1.find(id='MainContent_GridView1')
# find the headers
headers = [h.get_text() for h in table.find_all('th')]
# retrieve data
data = {}
tr_elems = table.find_all('tr')
for tr in tr_elems:
tr_content = [td.get_text() for td in tr.find_all('td')]
if tr_content:
data[tr_content[0]] = dict(zip(headers[1:], tr_content[1:]))
print(data)
此示例将显示:
{
"Greene West": {
"Low Temp (\u00b0F)": "67.7",
"Time Of High": "10:19 AM",
"Wind Speed (MPH)": "0.6",
"High Temp (\u00b0F)": "83.2",
"Wind Dir (\u00b0)": "20",
"Time Of Low": "6:04 AM",
"Current Time": "10:19 AM",
"Current Temp (\u00b0F)": "83.2"
},
"Cleveland": {
"Low Temp (\u00b0F)": "70.8",
"Time Of High": "10:14 AM",
"Wind Speed (MPH)": "1.9",
[.....]
}
我需要从 BeautifulSoup 中解析的四个字符串中提取数据。他们是:
Arkansas72.21:59 AM76.29:04 AM5.22977.37:59 AM
Ashley71.93:39 AM78.78:59 AM0.53678.78:59 AM
Bradley72.64:49 AM77.28:59 AM2.41877.28:49 AM
Chicot-40.19:04 AM-40.19:04 AM2.573-40.112:09 AM
第一个字符串的数据,例如,Arkansas, 72.1, 1:59 AM, 76.2, 9:04 AM, 5.2, 29, 77.3, and 7:59 AM。有没有简单的方法可以做到这一点?
编辑:完整代码
import urllib2
from bs4 import BeautifulSoup
import time
def scraper():
#Arkansas State Plant Board Weather Web data
url1 = 'http://170.94.200.136/weather/Inversion.aspx'
#opens url and parses HTML into Unicode
page1 = urllib2.urlopen(url1)
soup1 = BeautifulSoup(page1, 'lxml')
#print(soup.get_text()) gives a single Unicode string of relevant data in strings from the url
#Without print(), returns everything in without proper spacing
sp1 = soup1.get_text()
#datasp1 is the chunk with the website data in it so the search for Arkansas doesn't return the header
#everything else finds locations for Unicode strings for first four stations
start1 = sp1.find('Today')
end1 = sp1.find('new Sys.')
datasp1 = sp1[start1:end1-10]
startArkansas = datasp1.find('Arkansas')
startAshley = datasp1.find('Ashley')
dataArkansas = datasp1[startArkansas:startAshley-2]
startBradley = datasp1.find('Bradley')
dataAshley = datasp1[startAshley:startBradley-2]
startChicot = datasp1.find('Chicot')
dataBradley = datasp1[startBradley:startChicot-2]
startCleveland = datasp1.find('Cleveland')
dataChicot = datasp1[startChicot:startCleveland-2]
print(dataArkansas)
print(dataAshley)
print(dataBradley)
print(dataChicot)
只是改进您提取表格数据的方式。我会使用 pandas.read_html()
将其读入 dataframe,我敢肯定,使用它会很方便:
import pandas as pd
df = pd.read_html("http://170.94.200.136/weather/Inversion.aspx", attrs={"id": "MainContent_GridView1"})[0]
print(df)
您需要使用 beautifulsoup 来解析 html 页面并检索您的数据:
url1 = 'http://170.94.200.136/weather/Inversion.aspx'
#opens url and parses HTML into Unicode
page1 = urlopen(url1)
soup1 = BeautifulSoup(page1)
# get the table
table = soup1.find(id='MainContent_GridView1')
# find the headers
headers = [h.get_text() for h in table.find_all('th')]
# retrieve data
data = {}
tr_elems = table.find_all('tr')
for tr in tr_elems:
tr_content = [td.get_text() for td in tr.find_all('td')]
if tr_content:
data[tr_content[0]] = dict(zip(headers[1:], tr_content[1:]))
print(data)
此示例将显示:
{
"Greene West": {
"Low Temp (\u00b0F)": "67.7",
"Time Of High": "10:19 AM",
"Wind Speed (MPH)": "0.6",
"High Temp (\u00b0F)": "83.2",
"Wind Dir (\u00b0)": "20",
"Time Of Low": "6:04 AM",
"Current Time": "10:19 AM",
"Current Temp (\u00b0F)": "83.2"
},
"Cleveland": {
"Low Temp (\u00b0F)": "70.8",
"Time Of High": "10:14 AM",
"Wind Speed (MPH)": "1.9",
[.....]
}