如何获取uri的特定部分?

how to grab specific part of a uri?

我有一个问题不知道如何在互联网上搜索它来找到答案。我在工作,应该尽快解决。

我正在使用以下代码读取 URI:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Accept-Encoding': '*',
    'Accept': 'text/html',
    'Accept-Language': '*'}
import requests

link = "http://data.europa.eu/esco/isco/C0110"
f = requests.get(link,headers=headers)
print(f.text)

如果您查看 http://data.europa.eu/esco/isco/C0110,您会发现有 Commissioned armed forces officers

的描述

我只需要提取描述部分。

有几千行,但我想要的部分是:

  <h2>Description</h2>
  <pre>Commissioned armed forces officers provide leadership and management to organizational units in the armed forces and/or perform similar tasks to those performed in a variety of civilian occupations outside the armed forces. This group includes all members of the armed forces holding the rank of second lieutenant (or equivalent) or higher.

可能吗?我有 1000 个这样的数据,所以我不能手动完成。我需要描述部分。

尝试:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Accept-Encoding': '*',
    'Accept': 'text/html',
    'Accept-Language': '*'}

link = "http://data.europa.eu/esco/isco/C0110"
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
result = soup.find('pre')
text = result.text
# print whole text
#print(text)
# print just first paragraph
print(text.split('\n')[0])

输出:

Commissioned armed forces officers provide leadership and management to organizational units...

如果第一个“pre”元素不是描述,找到描述元素然后找到下一个 <pre> 元素。

result = soup.find("h2", string="Description")
for tag in result.next_siblings:
    if tag.name == 'pre':
        text = tag.text
        print(text.split('\n')[0])
        break