python beautifulsoup 如何获取 json 格式的数据?
python beautifulsoup how to get data as json format?
我想获取 json 格式的数据。现在我正在获取数据作为字典,这对我来说有点混乱。这是我的代码:
my_dict = {"job_title":[],"time_posted":[],"number_of_proposal":[],"page_link":[]};
for page_num in range(1, 12):
time.sleep(3)
url = (
f'my_url').format(page_num)
print(url)
headers = requests.utils.default_headers()
print(headers)
headers.update(
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', })
print(headers)
r = requests.get(url, headers=headers).text
soup = BeautifulSoup(r, 'lxml')
box = soup.select('.item__top_container⤍ListItem⤚3pRrO')
for i in box:
job_title = i.select('.item__title⤍ListItem⤚2FRMT')[0].text.lower()
job_title = job_title.replace('opportunity', ' opportunity').replace(
'urgent', ' urgent').strip()
print(job_title)
time_posted = i.select('time')[0].text.lower()
remove_month_year = ["month", "year"]
print(time_posted)
proposal = i.select(
'.item__info⤍ListItem⤚1ci50 li:nth-child(3)')[0].text.replace('Proposals', '').strip()
keywords = ['scrap', 'data mining']
if(any(key_words in job_title for key_words in keywords)):
if(not any(remove_m_y in time_posted for remove_m_y in remove_month_year)):
my_dict["job_title"].append(job_title)
my_dict["time_posted"].append(time_posted)
my_dict["number_of_proposal"].append(proposal)
my_dict["page_link"].append(url)
我的字典数据是这样的:
{'job_title': ['web scraping of product reviews', 'yell web scraping in python', 'google business scraping',],'time_posted': ['6 days ago', '9 days ago', '3 days ago'], 'page_link': ['url1','url2','url3']}
我的预期结果将如下所示:
{"job_title":"web scraping of product reviews","time_posted":"6 days ago","page_link":"url1"},{"job_title":"yell web scraping in python","time_posted":"9 days ago","page_link":"url2"}
您可以为每个条目创建一个字典:
# Just using x because it's shorter. This does not create a copy
x = my_dict
x = [{'job_title': x['job_title'][i], 'time_posted': x['time_posted'][i],
'page_link': x['page_link'][i]} for i in range(len(x['page_link']))]
>>> x
[{'job_title': 'web scraping of product reviews',
'page_link': 'url1',
'time_posted': '6 days ago'},
{'job_title': 'yell web scraping in python',
'page_link': 'url2',
'time_posted': '9 days ago'},
{'job_title': 'google business scraping',
'page_link': 'url3',
'time_posted': '3 days ago'}]
您可以使用以下代码更改结构:
my_list = []
for i in range(len(my_dict["job_title"])):
my_list.append({
"job_title": my_dict["job_title"][i],
"time_posted": my_dict["time_posted"][i],
"number_of_proposal": my_dict["number_of_proposal"][i],
"page_link": my_dict["page_link"][i]
})
更好的方法是在第一个循环中直接创建列表,就像您最后需要它一样。
my_list = []
for i in box:
job_title = i.select('.item__title⤍ListItem⤚2FRMT')[0].text.lower()
job_title = job_title.replace('opportunity', ' opportunity').replace(
'urgent', ' urgent').strip()
print(job_title)
time_posted = i.select('time')[0].text.lower()
remove_month_year = ["month", "year"]
print(time_posted)
proposal = i.select(
'.item__info⤍ListItem⤚1ci50 li:nth-child(3)')[0].text.replace('Proposals', '').strip()
keywords = ['scrap', 'data mining']
if(any(key_words in job_title for key_words in keywords)):
if(not any(remove_m_y in time_posted for remove_m_y in remove_month_year)):
my_list.append({
"job_title": job_title,
"time_posted": time_posted,
"number_of_proposal": number_of_proposal,
"page_link": page_link
})
我认为您定义的数据结构有误。根据您的预期结果,我了解到您想要:
{"job_title": "title 1", "time_posted":"6 天前" ... }, {"job_title": "title2". ..}
所以,字典列表。现在你有了一个包含列表类型值的字典。
您有两个选择:
1.- 处理您的字典以获取您想要的结构
final_list = []
for _ in range(len(my_dict["job_title"])):
item_dict = {}
for key in my_dict:
item_dict[key] = my_dict[key].pop(0)
final_list.append(item_dict)
print(final_list)
# [{'job_title': 'web scraping of product reviews', 'time_posted': '6 days ago', 'page_link': 'url1'}, {'job_title': 'yell web scraping in python', 'time_posted': '9 days ago', 'page_link': 'url2'}, {'job_title': 'google business scraping', 'time_posted': '3 days ago', 'page_link': 'url3'}]
2.- 与用户jugi 提到的一样,这是最好的选择。他在我写这篇文章的时候已经回答了,所以我还是 post 因为我的选项 1 略有不同。
我想获取 json 格式的数据。现在我正在获取数据作为字典,这对我来说有点混乱。这是我的代码:
my_dict = {"job_title":[],"time_posted":[],"number_of_proposal":[],"page_link":[]};
for page_num in range(1, 12):
time.sleep(3)
url = (
f'my_url').format(page_num)
print(url)
headers = requests.utils.default_headers()
print(headers)
headers.update(
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', })
print(headers)
r = requests.get(url, headers=headers).text
soup = BeautifulSoup(r, 'lxml')
box = soup.select('.item__top_container⤍ListItem⤚3pRrO')
for i in box:
job_title = i.select('.item__title⤍ListItem⤚2FRMT')[0].text.lower()
job_title = job_title.replace('opportunity', ' opportunity').replace(
'urgent', ' urgent').strip()
print(job_title)
time_posted = i.select('time')[0].text.lower()
remove_month_year = ["month", "year"]
print(time_posted)
proposal = i.select(
'.item__info⤍ListItem⤚1ci50 li:nth-child(3)')[0].text.replace('Proposals', '').strip()
keywords = ['scrap', 'data mining']
if(any(key_words in job_title for key_words in keywords)):
if(not any(remove_m_y in time_posted for remove_m_y in remove_month_year)):
my_dict["job_title"].append(job_title)
my_dict["time_posted"].append(time_posted)
my_dict["number_of_proposal"].append(proposal)
my_dict["page_link"].append(url)
我的字典数据是这样的:
{'job_title': ['web scraping of product reviews', 'yell web scraping in python', 'google business scraping',],'time_posted': ['6 days ago', '9 days ago', '3 days ago'], 'page_link': ['url1','url2','url3']}
我的预期结果将如下所示:
{"job_title":"web scraping of product reviews","time_posted":"6 days ago","page_link":"url1"},{"job_title":"yell web scraping in python","time_posted":"9 days ago","page_link":"url2"}
您可以为每个条目创建一个字典:
# Just using x because it's shorter. This does not create a copy
x = my_dict
x = [{'job_title': x['job_title'][i], 'time_posted': x['time_posted'][i],
'page_link': x['page_link'][i]} for i in range(len(x['page_link']))]
>>> x
[{'job_title': 'web scraping of product reviews',
'page_link': 'url1',
'time_posted': '6 days ago'},
{'job_title': 'yell web scraping in python',
'page_link': 'url2',
'time_posted': '9 days ago'},
{'job_title': 'google business scraping',
'page_link': 'url3',
'time_posted': '3 days ago'}]
您可以使用以下代码更改结构:
my_list = []
for i in range(len(my_dict["job_title"])):
my_list.append({
"job_title": my_dict["job_title"][i],
"time_posted": my_dict["time_posted"][i],
"number_of_proposal": my_dict["number_of_proposal"][i],
"page_link": my_dict["page_link"][i]
})
更好的方法是在第一个循环中直接创建列表,就像您最后需要它一样。
my_list = []
for i in box:
job_title = i.select('.item__title⤍ListItem⤚2FRMT')[0].text.lower()
job_title = job_title.replace('opportunity', ' opportunity').replace(
'urgent', ' urgent').strip()
print(job_title)
time_posted = i.select('time')[0].text.lower()
remove_month_year = ["month", "year"]
print(time_posted)
proposal = i.select(
'.item__info⤍ListItem⤚1ci50 li:nth-child(3)')[0].text.replace('Proposals', '').strip()
keywords = ['scrap', 'data mining']
if(any(key_words in job_title for key_words in keywords)):
if(not any(remove_m_y in time_posted for remove_m_y in remove_month_year)):
my_list.append({
"job_title": job_title,
"time_posted": time_posted,
"number_of_proposal": number_of_proposal,
"page_link": page_link
})
我认为您定义的数据结构有误。根据您的预期结果,我了解到您想要: {"job_title": "title 1", "time_posted":"6 天前" ... }, {"job_title": "title2". ..}
所以,字典列表。现在你有了一个包含列表类型值的字典。
您有两个选择:
1.- 处理您的字典以获取您想要的结构
final_list = []
for _ in range(len(my_dict["job_title"])):
item_dict = {}
for key in my_dict:
item_dict[key] = my_dict[key].pop(0)
final_list.append(item_dict)
print(final_list)
# [{'job_title': 'web scraping of product reviews', 'time_posted': '6 days ago', 'page_link': 'url1'}, {'job_title': 'yell web scraping in python', 'time_posted': '9 days ago', 'page_link': 'url2'}, {'job_title': 'google business scraping', 'time_posted': '3 days ago', 'page_link': 'url3'}]
2.- 与用户jugi 提到的一样,这是最好的选择。他在我写这篇文章的时候已经回答了,所以我还是 post 因为我的选项 1 略有不同。