如何获取uri的特定部分?
how to grab specific part of a uri?
我有一个问题不知道如何在互联网上搜索它来找到答案。我在工作,应该尽快解决。
我正在使用以下代码读取 URI:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept-Encoding': '*',
'Accept': 'text/html',
'Accept-Language': '*'}
import requests
link = "http://data.europa.eu/esco/isco/C0110"
f = requests.get(link,headers=headers)
print(f.text)
如果您查看 http://data.europa.eu/esco/isco/C0110
,您会发现有 Commissioned armed forces officers
的描述
我只需要提取描述部分。
有几千行,但我想要的部分是:
<h2>Description</h2>
<pre>Commissioned armed forces officers provide leadership and management to organizational units in the armed forces and/or perform similar tasks to those performed in a variety of civilian occupations outside the armed forces. This group includes all members of the armed forces holding the rank of second lieutenant (or equivalent) or higher.
可能吗?我有 1000 个这样的数据,所以我不能手动完成。我需要描述部分。
尝试:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept-Encoding': '*',
'Accept': 'text/html',
'Accept-Language': '*'}
link = "http://data.europa.eu/esco/isco/C0110"
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
result = soup.find('pre')
text = result.text
# print whole text
#print(text)
# print just first paragraph
print(text.split('\n')[0])
输出:
Commissioned armed forces officers provide leadership and management to organizational units...
如果第一个“pre”元素不是描述,找到描述元素然后找到下一个 <pre>
元素。
result = soup.find("h2", string="Description")
for tag in result.next_siblings:
if tag.name == 'pre':
text = tag.text
print(text.split('\n')[0])
break
我有一个问题不知道如何在互联网上搜索它来找到答案。我在工作,应该尽快解决。
我正在使用以下代码读取 URI:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept-Encoding': '*',
'Accept': 'text/html',
'Accept-Language': '*'}
import requests
link = "http://data.europa.eu/esco/isco/C0110"
f = requests.get(link,headers=headers)
print(f.text)
如果您查看 http://data.europa.eu/esco/isco/C0110
,您会发现有 Commissioned armed forces officers
我只需要提取描述部分。
有几千行,但我想要的部分是:
<h2>Description</h2>
<pre>Commissioned armed forces officers provide leadership and management to organizational units in the armed forces and/or perform similar tasks to those performed in a variety of civilian occupations outside the armed forces. This group includes all members of the armed forces holding the rank of second lieutenant (or equivalent) or higher.
可能吗?我有 1000 个这样的数据,所以我不能手动完成。我需要描述部分。
尝试:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept-Encoding': '*',
'Accept': 'text/html',
'Accept-Language': '*'}
link = "http://data.europa.eu/esco/isco/C0110"
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
result = soup.find('pre')
text = result.text
# print whole text
#print(text)
# print just first paragraph
print(text.split('\n')[0])
输出:
Commissioned armed forces officers provide leadership and management to organizational units...
如果第一个“pre”元素不是描述,找到描述元素然后找到下一个 <pre>
元素。
result = soup.find("h2", string="Description")
for tag in result.next_siblings:
if tag.name == 'pre':
text = tag.text
print(text.split('\n')[0])
break