Xpath 在 <a> 中没有 return <table> 内容(<tbody> 不是问题)
Xpath doesn't return <table> content within <a> (<tbody> is not the issue)
以下代码使用了scrapy + scrapy-splash + Python。
我正在尝试从此站点提取即将到来的比赛(包括:球队名称、锦标赛名称、开始时间):https://www.hltv.org/matches
我在回调 'parse' 函数中的代码是:
match_days = response.xpath("//div[@class = 'upcoming-matches']//div[@class = 'match-day']")
for match in match_days.xpath("./a"):
print(match.extract())
# tournament_name = match.xpath(".//td[@class='event']//span[@class='event-name']/text()").extract_first()
# team1_name = match.xpath(".//td[@class='team-cell'][1]//div[@class='team']/text()").extract_first()
它应该让我得到每个“”元素的内容(即应该看起来像这样,例如:
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
<table class="table">
<tbody>
<tr>
<td class="time">
<div class="time" data-time-format="HH:mm" data-unix="1514028600000">12:30</div>
</td>
<td class="team-cell">
<div class="line-align">
<img alt="DKISS" src="https://static.hltv.org/images/team/logo/8657" class="logo" title="DKISS">
<div class="team">DKISS</div>
</div>
</td>
<td class="vs">vs</td>
<td class="team-cell">
<div class="team">PSYCHOACTIVE/proWince winner</div>
</td>
<td class="event"><img alt="Winner of the Future 2017" src="https://static.hltv.org/images/eventLogos/3464.png" class="event-logo" title="Winner of the Future 2017"><span class="event-name">Winner of the Future 2017</span></td>
<td class="star-cell">
<div class="map-text">bo3</div>
</td>
</tr>
</tbody>
</table>
</a>
但我只为每个“”得到这个:
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
</a>
我已经在 scrapy 中试过了 shell 结果相同。
我尝试了 Chrome 开发者工具,我可以在 innerHTML 属性.
中看到每个“”的所有内容
我不认为问题出在“< tbody >”,因为我了解到它在某些情况下被省略并由网络浏览器添加,因为当我打印出 html 内容时从 "response"“
”返回的页面的一部分在那里(顺便说一下,我通过 scrapy-splash 使用 lua 脚本向 [=57] 发出 POST 请求=] 并获取 html 页)
有人知道为什么会这样吗?在过去的几天里,我一直在解决这个问题,但没有得到任何答案,我也不知道还需要测试什么来弄清楚为什么会在不应该发生的情况下发生这种情况。
谢谢。
使用 css
选择器对我来说更容易。
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.hltv.org/matches']
def parse(self, response):
print('url:', response.url)
days = response.css('.match-day')
for day in days:
date = day.css('.standard-headline::text').extract_first()
print('date:', date)
tables = day.css('table')
for table in tables:
time = table.css('div.time::text').extract_first()
teams = table.css('.team::text').extract()
event = table.css('.event-name::text').extract_first()
placeholder = table.css('.placeholder-text-cell::text').extract_first()
print(' time:', time)
if teams:
print(' teams 1:', teams[0])
print(' teams 2:', teams[1])
print(' event:', event)
else:
print(' placeholder:', placeholder)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
结果
url: https://www.hltv.org/matches
date: 2017-12-24
time: 03:00
teams 1: NSPR
teams 2: MiTH
event: WESG 2017 Thailand LAN
time: 06:00
teams 1: Signature
teams 2: Beyond
event: WESG 2017 Thailand LAN
time: 09:00
placeholder: WESG Thailand - Grand Final
time: 10:00
teams 1: DKISS
teams 2: Izako Boars
event: Winner of the Future 2017
date: 2017-12-26
time: 14:00
teams 1: Recca
teams 2: Signature
event: GOTV.GG Invitational #1
time: 17:00
teams 1: AGO
teams 2: Vega Squadron
event: LOOT.BET Cup 2
time: 20:00
teams 1: mousesports
teams 2: Spirit
event: LOOT.BET Cup 2
date: 2017-12-27
time: 12:00
teams 1: Singularity
teams 2: GoodJob
event: CSesport.com XMAS Cup
time: 13:00
placeholder: GOTV.GG - Semi-Final #1
time: 15:00
placeholder: GOTV.GG - Semi-Final #2
time: 15:00
teams 1: MANS NOT HOT
teams 2: VenatoreS
event: CSesport.com XMAS Cup
time: 17:00
teams 1: Heroic
teams 2: Valiance
event: LOOT.BET Cup 2
date: 2017-12-28
time: 13:00
placeholder: GOTV.GG - 3rd place decider
time: 15:00
placeholder: GOTV.GG - Grand Final
以下代码使用了scrapy + scrapy-splash + Python。 我正在尝试从此站点提取即将到来的比赛(包括:球队名称、锦标赛名称、开始时间):https://www.hltv.org/matches
我在回调 'parse' 函数中的代码是:
match_days = response.xpath("//div[@class = 'upcoming-matches']//div[@class = 'match-day']")
for match in match_days.xpath("./a"):
print(match.extract())
# tournament_name = match.xpath(".//td[@class='event']//span[@class='event-name']/text()").extract_first()
# team1_name = match.xpath(".//td[@class='team-cell'][1]//div[@class='team']/text()").extract_first()
它应该让我得到每个“”元素的内容(即应该看起来像这样,例如:
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
<table class="table">
<tbody>
<tr>
<td class="time">
<div class="time" data-time-format="HH:mm" data-unix="1514028600000">12:30</div>
</td>
<td class="team-cell">
<div class="line-align">
<img alt="DKISS" src="https://static.hltv.org/images/team/logo/8657" class="logo" title="DKISS">
<div class="team">DKISS</div>
</div>
</td>
<td class="vs">vs</td>
<td class="team-cell">
<div class="team">PSYCHOACTIVE/proWince winner</div>
</td>
<td class="event"><img alt="Winner of the Future 2017" src="https://static.hltv.org/images/eventLogos/3464.png" class="event-logo" title="Winner of the Future 2017"><span class="event-name">Winner of the Future 2017</span></td>
<td class="star-cell">
<div class="map-text">bo3</div>
</td>
</tr>
</tbody>
</table>
</a>
但我只为每个“”得到这个:
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
</a>
我已经在 scrapy 中试过了 shell 结果相同。
我尝试了 Chrome 开发者工具,我可以在 innerHTML 属性.
中看到每个“”的所有内容我不认为问题出在“< tbody >”,因为我了解到它在某些情况下被省略并由网络浏览器添加,因为当我打印出 html 内容时从 "response"“
”返回的页面的一部分在那里(顺便说一下,我通过 scrapy-splash 使用 lua 脚本向 [=57] 发出 POST 请求=] 并获取 html 页)有人知道为什么会这样吗?在过去的几天里,我一直在解决这个问题,但没有得到任何答案,我也不知道还需要测试什么来弄清楚为什么会在不应该发生的情况下发生这种情况。
谢谢。
使用 css
选择器对我来说更容易。
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.hltv.org/matches']
def parse(self, response):
print('url:', response.url)
days = response.css('.match-day')
for day in days:
date = day.css('.standard-headline::text').extract_first()
print('date:', date)
tables = day.css('table')
for table in tables:
time = table.css('div.time::text').extract_first()
teams = table.css('.team::text').extract()
event = table.css('.event-name::text').extract_first()
placeholder = table.css('.placeholder-text-cell::text').extract_first()
print(' time:', time)
if teams:
print(' teams 1:', teams[0])
print(' teams 2:', teams[1])
print(' event:', event)
else:
print(' placeholder:', placeholder)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
结果
url: https://www.hltv.org/matches
date: 2017-12-24
time: 03:00
teams 1: NSPR
teams 2: MiTH
event: WESG 2017 Thailand LAN
time: 06:00
teams 1: Signature
teams 2: Beyond
event: WESG 2017 Thailand LAN
time: 09:00
placeholder: WESG Thailand - Grand Final
time: 10:00
teams 1: DKISS
teams 2: Izako Boars
event: Winner of the Future 2017
date: 2017-12-26
time: 14:00
teams 1: Recca
teams 2: Signature
event: GOTV.GG Invitational #1
time: 17:00
teams 1: AGO
teams 2: Vega Squadron
event: LOOT.BET Cup 2
time: 20:00
teams 1: mousesports
teams 2: Spirit
event: LOOT.BET Cup 2
date: 2017-12-27
time: 12:00
teams 1: Singularity
teams 2: GoodJob
event: CSesport.com XMAS Cup
time: 13:00
placeholder: GOTV.GG - Semi-Final #1
time: 15:00
placeholder: GOTV.GG - Semi-Final #2
time: 15:00
teams 1: MANS NOT HOT
teams 2: VenatoreS
event: CSesport.com XMAS Cup
time: 17:00
teams 1: Heroic
teams 2: Valiance
event: LOOT.BET Cup 2
date: 2017-12-28
time: 13:00
placeholder: GOTV.GG - 3rd place decider
time: 15:00
placeholder: GOTV.GG - Grand Final