解析 HTML 标签 Beautifulsoup Python 中的更改文本
Parse Changing Text in HTML tag Beautifulsoup Python
我正在尝试从 Zillow 上的图表中删除数字和日期。
url 是:https://www.zillow.com/austin-tx/home-values/
我正在使用的 html 区域是:
<ul class="legend-entries" id="yui_3_18_1_1_1607476788112_1009">
<li class="legend-value">Oct 2021</li>
<li class="legend-entry legend-entry-0" id="yui_3_18_1_1_1607476788112_1330">Austin 4K</li>
<li class="hide legend-entry legend-entry-1"></li>
<li class="hide legend-entry legend-entry-2"></li>
<li class="hide legend-entry legend-entry-3"></li>
<li class="hide legend-entry legend-entry-4"></li>
<li class="hide legend-entry legend-entry-5"></li>
<li class="hide legend-entry legend-entry-6"></li>
</ul>
我正在尝试解析 legend-value
(2021 年 10 月)和 legend-entry
($464K) 文本。但是,当您将鼠标悬停在图表上的点(页面上存在此数据的位置)时,只要移动鼠标,html 中的值就会发生变化。
到目前为止,这是我的代码:
from bs4 import BeautifulSoup
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
all_data = []
url = 'https://www.zillow.com/austin-tx/home-values/'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'html.parser')
#soup.find (class_= 'legend-entries')
for ul in soup.find_all('ul'):
lis=ul.find_all('li')
for elem in lis:
all_data.append(elem.text.strip())
我觉得这应该可行,但 return 没什么。我代码中的散列行至少会 return legend-entries
标记。我不确定如何实现。
该图表来自 API 调用。您可以获取它并重建数据。
方法如下:
from datetime import datetime
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
"X-Requested-With": "XMLHttpRequest",
}
api_url = "https://www.zillow.com/ajax/homevalues/data/timeseries.json?r=10221&m=zhvi_plus_forecast&dt=111"
graph = requests.get(api_url, headers=headers).json()
time_ = graph["10221;zhvi_plus_forecast;111"]["data"]
for moment in time_:
date = datetime.fromtimestamp(moment["x"] // 1000).date()
value = moment["y"]
print(f"{date} - ${value}")
输出:
2010-12-31 - 4771
2011-01-31 - 4297
2011-02-28 - 3623
2011-03-31 - 3053
2011-04-30 - 2571
2011-05-31 - 1931
2011-06-30 - 1322
2011-07-31 - 0837
2011-08-31 - 1413
2011-09-30 - 2088
2011-10-31 - 2520
2011-11-30 - 2665
2011-12-31 - 2788
2012-01-31 - 3433
2012-02-29 - 4288
2012-03-31 - 5461
and so on ...
或者,您可以绘制它并拥有自己的图表(谁说您不能,对吧?)。
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
"X-Requested-With": "XMLHttpRequest",
}
api_url = "https://www.zillow.com/ajax/homevalues/data/timeseries.json?r=10221&m=zhvi_plus_forecast&dt=111"
graph = requests.get(api_url, headers=headers).json()
df = pd.DataFrame(graph["10221;zhvi_plus_forecast;111"]["data"])
plt.figure(1)
plt.plot(df['x'].apply(lambda x: datetime.fromtimestamp(x // 1000).date()), df['y'])
plt.show()
输出:
我正在尝试从 Zillow 上的图表中删除数字和日期。 url 是:https://www.zillow.com/austin-tx/home-values/
我正在使用的 html 区域是:
<ul class="legend-entries" id="yui_3_18_1_1_1607476788112_1009">
<li class="legend-value">Oct 2021</li>
<li class="legend-entry legend-entry-0" id="yui_3_18_1_1_1607476788112_1330">Austin 4K</li>
<li class="hide legend-entry legend-entry-1"></li>
<li class="hide legend-entry legend-entry-2"></li>
<li class="hide legend-entry legend-entry-3"></li>
<li class="hide legend-entry legend-entry-4"></li>
<li class="hide legend-entry legend-entry-5"></li>
<li class="hide legend-entry legend-entry-6"></li>
</ul>
我正在尝试解析 legend-value
(2021 年 10 月)和 legend-entry
($464K) 文本。但是,当您将鼠标悬停在图表上的点(页面上存在此数据的位置)时,只要移动鼠标,html 中的值就会发生变化。
到目前为止,这是我的代码:
from bs4 import BeautifulSoup
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
all_data = []
url = 'https://www.zillow.com/austin-tx/home-values/'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'html.parser')
#soup.find (class_= 'legend-entries')
for ul in soup.find_all('ul'):
lis=ul.find_all('li')
for elem in lis:
all_data.append(elem.text.strip())
我觉得这应该可行,但 return 没什么。我代码中的散列行至少会 return legend-entries
标记。我不确定如何实现。
该图表来自 API 调用。您可以获取它并重建数据。
方法如下:
from datetime import datetime
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
"X-Requested-With": "XMLHttpRequest",
}
api_url = "https://www.zillow.com/ajax/homevalues/data/timeseries.json?r=10221&m=zhvi_plus_forecast&dt=111"
graph = requests.get(api_url, headers=headers).json()
time_ = graph["10221;zhvi_plus_forecast;111"]["data"]
for moment in time_:
date = datetime.fromtimestamp(moment["x"] // 1000).date()
value = moment["y"]
print(f"{date} - ${value}")
输出:
2010-12-31 - 4771
2011-01-31 - 4297
2011-02-28 - 3623
2011-03-31 - 3053
2011-04-30 - 2571
2011-05-31 - 1931
2011-06-30 - 1322
2011-07-31 - 0837
2011-08-31 - 1413
2011-09-30 - 2088
2011-10-31 - 2520
2011-11-30 - 2665
2011-12-31 - 2788
2012-01-31 - 3433
2012-02-29 - 4288
2012-03-31 - 5461
and so on ...
或者,您可以绘制它并拥有自己的图表(谁说您不能,对吧?)。
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
"X-Requested-With": "XMLHttpRequest",
}
api_url = "https://www.zillow.com/ajax/homevalues/data/timeseries.json?r=10221&m=zhvi_plus_forecast&dt=111"
graph = requests.get(api_url, headers=headers).json()
df = pd.DataFrame(graph["10221;zhvi_plus_forecast;111"]["data"])
plt.figure(1)
plt.plot(df['x'].apply(lambda x: datetime.fromtimestamp(x // 1000).date()), df['y'])
plt.show()
输出: