如何使用 scrapy 从多个 url 将数据收集到单个项目中 python
How to collect data into single item from multiple urls with scrapy python
简单来说,我想从回调函数中获取 return 值,直到 for 循环耗尽,然后是 yield 单项。
我想做的是跟随,
我正在创建新的 links,代表点击 https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/ 上的选项卡
比如
https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#ah;2
https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#over-under;2
等等。
但它们基本上是同一场比赛的数据,所以我试图将投注信息收集到一次。
基本上,我使用带有 dict 的 for 循环来创建一个新的 link 并产生带有回调函数的请求。
class CountryLinksSpider(scrapy.Spider):
name = 'country_links'
allowed_domains = ['oddsportal.com']
start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)
def create_all_tabs_links_from_url(self, response):
current_url = response.request.url
_other_useful_scrape_data_dict = OrderedDict(
[('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])
with requests.Session() as s:
s.headers = {
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,pl;q=0.8",
"referer": 'https://www.oddsportal.com',
"user-agent": fake_useragent.UserAgent().random
}
r = s.get(current_url)
version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
xeid = re.search(r'"id":"(.*?)"', r.text).group(1)
xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))
unix = int(round(time.time() * 1000))
tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
'#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
all_tabs_data = OrderedDict()
all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict
for key, value in tabs_dict.items():
api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'
# goto each main tabs and get data from it and yield here
single_tab_scrape_data = yield scrapy.http.Request(api_url,
callback=self.scrape_single_tab)
# following i want to do (collect all the data from all tabs into single item)
# all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)
# yield all_tabs_data # yield single dict with scrape data from all the tabs
def scrape_single_tab(self, response):
# sample scraped data from the response
scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])
yield scrape_dict
我试过了,
首先,我尝试从 scrape_match_data 函数中简单地 returning 抓取项目。但是我找不到从 yield 请求中获取回调函数的 return 值的方法。
我试过使用以下库,
从 inline_requests 导入 inline_requests
从 twisted.internet.defer 导入 inlineCallbacks
但我无法让它工作。我觉得必须有更简单的方法来将来自不同 link 的抓取项目附加到一个项目中并产生它。
请帮我解决这个问题。
从技术上讲,在 scrapy 中,我们有 2 种方法在我们用于从多个请求构造项目的回调函数之间传输数据:
1.请求元字典:
def parse(self, response):
...
yield Request(
url,
callback=self.parse_details,
meta = {'scraped_item_data': data})
def parse_details(self, response):
scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
...
可能您错过了调用 response.meta.get('_scrape_dict')
来访问从之前的回调函数中抓取的数据
2。 cb_kwargs
可用于 scrapy 1.7 及更新版本:
def parse(self, response):
...
yield Request(
url,
callback=self.parse_details,
cb_kwargs={'scraped_item_data': data})
def parse_details(self, response, scraped_item_data): # <- already accessible data from previous request
...
3.Single 来自多个相同类型的回复的项目。
实现它的最简单方法是将数据分配给 class 变量。
代码将如下所示:
def parse(self, response):
self.tabs_data = []
...
self.tabs_number = len(tabs) # or len(list(tabs)) # <number of tabs
for tab in tabs:
yield Request(...
def parse_details(self, response)
scraped_tab_data = ...
self.tabs_data.append(scraped_tab_data)
if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
# compose one big item
...
yield item
简单来说,我想从回调函数中获取 return 值,直到 for 循环耗尽,然后是 yield 单项。
我想做的是跟随,
我正在创建新的 links,代表点击 https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/ 上的选项卡
比如
https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#ah;2
https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#over-under;2 等等。 但它们基本上是同一场比赛的数据,所以我试图将投注信息收集到一次。
基本上,我使用带有 dict 的 for 循环来创建一个新的 link 并产生带有回调函数的请求。
class CountryLinksSpider(scrapy.Spider):
name = 'country_links'
allowed_domains = ['oddsportal.com']
start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)
def create_all_tabs_links_from_url(self, response):
current_url = response.request.url
_other_useful_scrape_data_dict = OrderedDict(
[('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])
with requests.Session() as s:
s.headers = {
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,pl;q=0.8",
"referer": 'https://www.oddsportal.com',
"user-agent": fake_useragent.UserAgent().random
}
r = s.get(current_url)
version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
xeid = re.search(r'"id":"(.*?)"', r.text).group(1)
xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))
unix = int(round(time.time() * 1000))
tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
'#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
all_tabs_data = OrderedDict()
all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict
for key, value in tabs_dict.items():
api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'
# goto each main tabs and get data from it and yield here
single_tab_scrape_data = yield scrapy.http.Request(api_url,
callback=self.scrape_single_tab)
# following i want to do (collect all the data from all tabs into single item)
# all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)
# yield all_tabs_data # yield single dict with scrape data from all the tabs
def scrape_single_tab(self, response):
# sample scraped data from the response
scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])
yield scrape_dict
我试过了, 首先,我尝试从 scrape_match_data 函数中简单地 returning 抓取项目。但是我找不到从 yield 请求中获取回调函数的 return 值的方法。
我试过使用以下库, 从 inline_requests 导入 inline_requests 从 twisted.internet.defer 导入 inlineCallbacks
但我无法让它工作。我觉得必须有更简单的方法来将来自不同 link 的抓取项目附加到一个项目中并产生它。
请帮我解决这个问题。
从技术上讲,在 scrapy 中,我们有 2 种方法在我们用于从多个请求构造项目的回调函数之间传输数据:
1.请求元字典:
def parse(self, response):
...
yield Request(
url,
callback=self.parse_details,
meta = {'scraped_item_data': data})
def parse_details(self, response):
scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
...
可能您错过了调用 response.meta.get('_scrape_dict')
来访问从之前的回调函数中抓取的数据
2。 cb_kwargs
可用于 scrapy 1.7 及更新版本:
def parse(self, response):
...
yield Request(
url,
callback=self.parse_details,
cb_kwargs={'scraped_item_data': data})
def parse_details(self, response, scraped_item_data): # <- already accessible data from previous request
...
3.Single 来自多个相同类型的回复的项目。
实现它的最简单方法是将数据分配给 class 变量。
代码将如下所示:
def parse(self, response):
self.tabs_data = []
...
self.tabs_number = len(tabs) # or len(list(tabs)) # <number of tabs
for tab in tabs:
yield Request(...
def parse_details(self, response)
scraped_tab_data = ...
self.tabs_data.append(scraped_tab_data)
if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
# compose one big item
...
yield item