如何根据给定文本中的多个 sub-titles 为每个新文本动态拆分文本？

Question

我有一个原文是这样的：

我们是AMS。我们是一家全球性的整体劳动力解决方案公司；我们通过建设、re-shaping 和优化员工队伍，使组织能够在不断变化的时代蓬勃发展。我们的临时劳动力解决方案 (CWS) 是我们的服务产品之一；我们充当客户招聘团队的延伸，提供专业的临时和临时资源。

我们目前正在与我们的客户 Royal London 合作。

Royal London 是一家与众不同的金融服务公司。作为英国最大的共同人寿、养老金和投资公司，我们由我们的会员所有，并为他们的利益而不是股东利润而工作。我们发展迅速，已被公认为英国 top-rated 工作场所之一。

如今，皇家伦敦管理着超过 1,140 亿英镑的资金，约 3,500 名员工在英国和爱尔兰的六个办事处工作。我们努力成为我们专业市场的专家，建立了一个值得信赖的品牌——我们的团队为此赢得了很多奖项。无论您有兴趣加入什么团队，扮演什么角色；我们会帮助您有所作为。

我们正在伦敦寻找一份 6 个月合同的业务分析师。

角色的目的：

您将与内部数据小组合作，研究业务和相关报告中的新功能。部分项目将涉及系统升级

作为业务分析师，您将负责：

查看数据集，提取信息并能够查看 SQL 脚本，编写报告序列，分析数据。能够理解和传递数据，提出问题和挑战需求，理解数据 journey/mapping 文档。

我们向您寻求的技能、属性和能力包括：

良好的口头和书面沟通能力
scrum 团队内部以及与其他 BA 以及直接与业务用户的强大团队合作
丰富的资产管理经验
资产经理使用的关键数据集的工作知识
拥有主数据管理工具的经验，最好是 IHS Markit EDM
敏捷工作经验
能够编写用户故事来详细说明开发团队和 * QA 团队将使用的需求
强大的 SQL 技能，最好使用 Microsoft SQL Server
管理数据接口映射文档的经验
熟悉数据建模概念
有基于ETL和Data Warehousing的项目经验者优先
技术（开发）背景有利
有资产管理背景。
Thinkfolio 和 Murex 是理想的，EDM 平台知识将是可取的。该客户将只接受通过参与模式工作的员工。

如果您有兴趣申请此职位并符合上述条件，请点击 link 申请并立即与我们的一位采购专家联系.

AMS 是一家招聘流程外包公司，在提供其部分服务时可能被视为职业介绍所或职业介绍所

我使用 beautiful soup 使用下面的方法根据 sub-titles 从他们原来的 html 中拆分和提取文本。基本上，目标是：

用粗体文本分隔 html 摘录。
从这个粗体文本列表中，提取那些既是粗体又带有“:”的文本，以表示它是合法的 sub-title
然后从粗体列表中找出第一个和最后一个合法sub-titles的位置。如果在最后一个 sub-title 的文本下方还有其他缺少“:”的粗体文本，这将有助于拆分文本。
根据最后一个 sub-title 确实是粗体文本列表中的最后一个元素的条件进行拆分，如果不是，则进一步拆分文本以分离 sub-title 的文本来自其他文本。

下面的代码演示了这一点：

from fake_useragent import UserAgent
import requests
def headers():
    ua = UserAgent()
    chrome_header = ua.chrome
    headers = {'User-Agent': chrome_header}
    return headers

headers = headers()

r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]

total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])

# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title+1]}',j_description.text)[1:Position_of_last_sub_title]
else:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]

final_dict_with_sub_t_n_prec_txt= {
    sub_titles_in_description[0]: text_after_sub_t[0],
    sub_titles_in_description[1]: text_after_sub_t[1],
    sub_titles_in_description[2]: text_after_sub_t[2]
    
}

问题是基于 sub-title 的文本拆分。它太手动了，尝试了其他方法都无济于事。我将如何使这部分动态化，因为在以后的文本中，sub-titles 的数量会有所不同。

Answer 1

您可以通过对元素使用 css selectors 来简化或使其更通用，例如 select p:has(strong:-soup-contains(":")) 将 select 具有 child <strong> 和 : 的所有 <p>。获取附加信息使用 find_next_sibling():

dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

注意： 添加 | 作为 get_text() 的分隔符，因此在这种情况下您可以稍后拆分列表元素.您也可以将其替换为空格 get_text(' ',strip=True)

例子

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup = BeautifulSoup(r.text, 'html.parser')

data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

print(data)

输出

{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
 'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
 'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}

如何根据给定文本中的多个 sub-titles 为每个新文本动态拆分文本？

How can I dynamically split the text based on multiple sub-titles in a given text for every new text?

python

string

split

list

python-re

例子

输出