如何使用自定义标记颜色将每个句子包装在标签中?
How to wrap each sentence within a tag with its custom mark colour?
我正在使用漂亮的汤并请求加载网站的 HTML(例如 https://en.wikipedia.org/wiki/Elephant)。我想模仿这个页面,但我想给 'p' 标签(段落)中的句子着色。
为此,我使用 spacy 将文本分解成句子。我 select 一种颜色(一种基于二进制深度学习分类器的概率颜色,供感兴趣的人使用)。
def get_colorized_p(p):
doc = nlp(p.text) # p is the beautiful soup p tag
string = '<p>'
for sentence in doc.sents:
# The prediction value in anything within 0 to 1.
prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
# I am using a custom function to map the prediction to a hex colour.
color = get_hexcolor(prediction)
string += f'<mark style="background: {color};">{sentence.text} </mark> '
string += '</p>'
return string # I create a new long string with the markup
我在 p 标签内创建了一个带有 HTML 标记的新长字符串。我现在想替换漂亮汤对象中的 'old' 元素。
我用一个简单的循环来做到这一点:
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p':
if len(element.text.split()) > 2:
element = get_colorized_p(element)
这不会给出任何错误,但是当我渲染 HTML 文件时。 HTML 文件显示时没有标记
我正在使用 jupyter 快速显示 HTML 文件:
from IPython.display import display, HTML
display(HTML(html_file))
但是这不起作用。我确实通过 get_colorized_p
验证了返回的字符串。当我在单个 p 元素上使用它并渲染它时,它工作正常。但是我想把字符串插入到漂亮的汤对象中。
我希望任何人都可以阐明这个问题。在循环内替换元素会出错。但是,我不知道如何修复它。
以防万一的渲染字符串示例:
<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>
element = get_colorized_p(element)
分配一个局部变量,然后 for-loop 变量永远不会使用 again/overwritten。您需要保存处理过的元素,例如通过将它们连接成一个字符串。
html = ''
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p' and len(element.text.split()) > 2:
html += get_colorized_p(element)
else:
html += element.text
display(HTML(html))
喜欢这个想法和配色方案 - 我认为主要问题是您尝试用 string
替换 tag
,而您应该 replace_with()
[=16] =] 为您的 soup
增添新风味:
for element in tqdm_notebook(soup.find_all()):
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
将您的 soup
转换回字符串并尝试显示它:
display(HTML(str(soup)))
在较新的代码中,避免使用旧语法 findAll()
,而是使用 find_all()
- 如需更多信息,请花一分钟时间查看 check docs
例子
from bs4 import BeautifulSoup
from IPython.display import display, HTML
html = '''
<p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
def get_colorized_p(element):
### processing and returning of result str
return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'
for element in soup.find_all():
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
display(HTML(str(soup)))
不完全相同但非常接近您问题中的行为:
我正在使用漂亮的汤并请求加载网站的 HTML(例如 https://en.wikipedia.org/wiki/Elephant)。我想模仿这个页面,但我想给 'p' 标签(段落)中的句子着色。
为此,我使用 spacy 将文本分解成句子。我 select 一种颜色(一种基于二进制深度学习分类器的概率颜色,供感兴趣的人使用)。
def get_colorized_p(p):
doc = nlp(p.text) # p is the beautiful soup p tag
string = '<p>'
for sentence in doc.sents:
# The prediction value in anything within 0 to 1.
prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
# I am using a custom function to map the prediction to a hex colour.
color = get_hexcolor(prediction)
string += f'<mark style="background: {color};">{sentence.text} </mark> '
string += '</p>'
return string # I create a new long string with the markup
我在 p 标签内创建了一个带有 HTML 标记的新长字符串。我现在想替换漂亮汤对象中的 'old' 元素。 我用一个简单的循环来做到这一点:
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p':
if len(element.text.split()) > 2:
element = get_colorized_p(element)
这不会给出任何错误,但是当我渲染 HTML 文件时。 HTML 文件显示时没有标记
我正在使用 jupyter 快速显示 HTML 文件:
from IPython.display import display, HTML
display(HTML(html_file))
但是这不起作用。我确实通过 get_colorized_p
验证了返回的字符串。当我在单个 p 元素上使用它并渲染它时,它工作正常。但是我想把字符串插入到漂亮的汤对象中。
我希望任何人都可以阐明这个问题。在循环内替换元素会出错。但是,我不知道如何修复它。
以防万一的渲染字符串示例:
<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>
element = get_colorized_p(element)
分配一个局部变量,然后 for-loop 变量永远不会使用 again/overwritten。您需要保存处理过的元素,例如通过将它们连接成一个字符串。
html = ''
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p' and len(element.text.split()) > 2:
html += get_colorized_p(element)
else:
html += element.text
display(HTML(html))
喜欢这个想法和配色方案 - 我认为主要问题是您尝试用 string
替换 tag
,而您应该 replace_with()
[=16] =] 为您的 soup
增添新风味:
for element in tqdm_notebook(soup.find_all()):
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
将您的 soup
转换回字符串并尝试显示它:
display(HTML(str(soup)))
在较新的代码中,避免使用旧语法 findAll()
,而是使用 find_all()
- 如需更多信息,请花一分钟时间查看 check docs
例子
from bs4 import BeautifulSoup
from IPython.display import display, HTML
html = '''
<p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
def get_colorized_p(element):
### processing and returning of result str
return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'
for element in soup.find_all():
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
display(HTML(str(soup)))
不完全相同但非常接近您问题中的行为: