如何使用 python 从网站捕获数据作为键值对?
How to capture data from website as key-value pairs from the website using python?
生成的输出:1
用于获取模型名称的代码:2
enter code here
test_link = 'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt'
r = requests.get(test_link, headers=headers)
soup = BeautifulSoup(r.content,'lxml')
whole_data = soup.find('div', class_='fieldset-wrapper')
specifications = []
specifications_value=[]
for variable1 in whole_data.find_all('div', class_='field__label'):
#print(variable1.text)
variable1 = variable1.text
specifications = list(variable1.split('\n'))
#print(specifications)
for variable2 in whole_data.find_all('div', class_='field__item'):
#print(variable2.text)
variable2 = variable2.text
specifications_value = list(variable2.split('\n'))
#print(specifications_value)
问题:我正在获取数据,但是在单独的变量和for循环中,如何使用键值对映射这两个变量?这样我就可以检查以下条件:
如果值是平台,那么只讲它的值(盒式处理器)
我想以这样的方式捕获数据,如果 'key' 是平台,则只捕获它的值(盒装处理器)。对于所有其他 14 个标签也是如此。
您可以遍历预期键列表并使用 :-soup-contains
定位描述节点。如果那不是 None 那么 select 子值。否则,return ''.
import requests
from bs4 import BeautifulSoup as bs
links = ['https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt',
'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt']
all_keys = ['Platform', 'Product Family', 'Product Line', '# of CPU Cores',
'# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']
with requests.Session() as s:
s.headers = {'User-Agent': 'Mozilla/5.0'}
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
specification = {}
for key in all_keys:
spec = soup.select_one(
f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')
if spec is None:
specification[key] = ''
else:
if key == '*OS Support':
specification[key] = [
i.text for i in spec.parent.select('.field__item')]
else:
specification[key] = spec.text
print(specification)
print()
生成的输出:1 用于获取模型名称的代码:2
enter code here
test_link = 'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt'
r = requests.get(test_link, headers=headers)
soup = BeautifulSoup(r.content,'lxml')
whole_data = soup.find('div', class_='fieldset-wrapper')
specifications = []
specifications_value=[]
for variable1 in whole_data.find_all('div', class_='field__label'):
#print(variable1.text)
variable1 = variable1.text
specifications = list(variable1.split('\n'))
#print(specifications)
for variable2 in whole_data.find_all('div', class_='field__item'):
#print(variable2.text)
variable2 = variable2.text
specifications_value = list(variable2.split('\n'))
#print(specifications_value)
问题:我正在获取数据,但是在单独的变量和for循环中,如何使用键值对映射这两个变量?这样我就可以检查以下条件: 如果值是平台,那么只讲它的值(盒式处理器)
我想以这样的方式捕获数据,如果 'key' 是平台,则只捕获它的值(盒装处理器)。对于所有其他 14 个标签也是如此。
您可以遍历预期键列表并使用 :-soup-contains
定位描述节点。如果那不是 None 那么 select 子值。否则,return ''.
import requests
from bs4 import BeautifulSoup as bs
links = ['https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt',
'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt']
all_keys = ['Platform', 'Product Family', 'Product Line', '# of CPU Cores',
'# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']
with requests.Session() as s:
s.headers = {'User-Agent': 'Mozilla/5.0'}
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
specification = {}
for key in all_keys:
spec = soup.select_one(
f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')
if spec is None:
specification[key] = ''
else:
if key == '*OS Support':
specification[key] = [
i.text for i in spec.parent.select('.field__item')]
else:
specification[key] = spec.text
print(specification)
print()