使用 Python & lxml 对 Strava 进行网络抓取
Using Python & lxml to web scrape Strava
我想从 Strava 获取俱乐部活动。我最初是在考虑使用 api & C#(因为这就是我所知道的),但由于 api 提供的信息不足,我转向了这里的技术 (https://twitter.com/OleksMaistrenko/status/1252251408495190018).这是一个很棒的资源,让我完成了 90% 的工作。我现在正试图从 HTML 中获取更多信息,作为一个完整的 Python/lxml 新手,我不知道该怎么做。
所以,为了获得 activity 速度,这个 HTML:
<li title="Pace">
"7:46"
<abbr class="unit" title="minutes per mile"> /mi</abbr>
</li>
被以下代码抓取:
activity_pace = activity.xpath(".//li[@title='Pace']")[0].text.strip()
Q1。那么如何抓取这个 HTML 以获得 activity 持续时间?
<li title="Time">
"56"
<abbr class="unit" title="minute">m</abbr>
" 26"
<abbr class="unit" title="second">s</abbr>
</li>
我试过了,它只获取分钟:
activity_time = activity.xpath(".//li[@title='Time']")[0].text
Q2。我想获得 activity 标题(在本例中为 'Morning Run')。这是 HTML:
<h3 class="entry-title activity-title" str-on="click" str-trackable-
id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
<div class="entry-type-icon"><span class="app-icon-wrapper "><span class="app-icon icon-run icon-dark
icon-lg"></span></span></div>
<strong>
<a href="/activities/3339847809">Morning Run</a>
</strong>
</h3>
我已经弄清楚了可以用这个来获得这个块:
activity.xpath(".//h3[@class='entry-title activity-title']")[0]
但之后我就难住了:-(
这不是很优雅,但可以这样做:假设您的 html 看起来像这样:
activity = """
<doc>
<h3 class="entry-title activity-title" str-on="click" str-trackable-
id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
<div class="entry-type-icon"><span class="app-icon-wrapper "><span class="app-icon icon-run icon-dark
icon-lg"></span></span></div>
<strong>
<a href="/activities/3339847809">Morning Run</a>
</strong>
</h3>
<li title="Time">
"56"
<abbr class="unit" title="minute">m</abbr>
" 26"
<abbr class="unit" title="second">s</abbr>
</li>
</doc>"""
import lxml.html
doc = lxml.html.fromstring(activity)
sports = doc.xpath("//h3[@class='entry-title activity-title']//a/text()")
duration = doc.xpath('//li[@title="Time"]')
abbrs = doc.xpath('//abbr[@class="unit"]')
for abbr in abbrs:
abbr.text=''
for sport in sports:
print(sport)
for d in dur:
print(d.text_content().strip().replace('\n','').replace(' ','').replace('""',':'))
输出:
Morning Run
"56:26"
我想从 Strava 获取俱乐部活动。我最初是在考虑使用 api & C#(因为这就是我所知道的),但由于 api 提供的信息不足,我转向了这里的技术 (https://twitter.com/OleksMaistrenko/status/1252251408495190018).这是一个很棒的资源,让我完成了 90% 的工作。我现在正试图从 HTML 中获取更多信息,作为一个完整的 Python/lxml 新手,我不知道该怎么做。
所以,为了获得 activity 速度,这个 HTML:
<li title="Pace">
"7:46"
<abbr class="unit" title="minutes per mile"> /mi</abbr>
</li>
被以下代码抓取:
activity_pace = activity.xpath(".//li[@title='Pace']")[0].text.strip()
Q1。那么如何抓取这个 HTML 以获得 activity 持续时间?
<li title="Time">
"56"
<abbr class="unit" title="minute">m</abbr>
" 26"
<abbr class="unit" title="second">s</abbr>
</li>
我试过了,它只获取分钟:
activity_time = activity.xpath(".//li[@title='Time']")[0].text
Q2。我想获得 activity 标题(在本例中为 'Morning Run')。这是 HTML:
<h3 class="entry-title activity-title" str-on="click" str-trackable-
id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
<div class="entry-type-icon"><span class="app-icon-wrapper "><span class="app-icon icon-run icon-dark
icon-lg"></span></span></div>
<strong>
<a href="/activities/3339847809">Morning Run</a>
</strong>
</h3>
我已经弄清楚了可以用这个来获得这个块:
activity.xpath(".//h3[@class='entry-title activity-title']")[0]
但之后我就难住了:-(
这不是很优雅,但可以这样做:假设您的 html 看起来像这样:
activity = """
<doc>
<h3 class="entry-title activity-title" str-on="click" str-trackable-
id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
<div class="entry-type-icon"><span class="app-icon-wrapper "><span class="app-icon icon-run icon-dark
icon-lg"></span></span></div>
<strong>
<a href="/activities/3339847809">Morning Run</a>
</strong>
</h3>
<li title="Time">
"56"
<abbr class="unit" title="minute">m</abbr>
" 26"
<abbr class="unit" title="second">s</abbr>
</li>
</doc>"""
import lxml.html
doc = lxml.html.fromstring(activity)
sports = doc.xpath("//h3[@class='entry-title activity-title']//a/text()")
duration = doc.xpath('//li[@title="Time"]')
abbrs = doc.xpath('//abbr[@class="unit"]')
for abbr in abbrs:
abbr.text=''
for sport in sports:
print(sport)
for d in dur:
print(d.text_content().strip().replace('\n','').replace(' ','').replace('""',':'))
输出:
Morning Run
"56:26"