美汤无法编辑 Phone 带括号的数字
Beautiful Soup Can't Redact Phone Number with Parentheses
我正在尝试编辑 html 文件中的 phone 号码信息……虽然我可以很容易地识别所有 phone 号码,但我无法确定找出为什么我无法替换其中带有括号的 phone 数字。示例如下:
import re
from bs4 import BeautifulSoup
text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: 555-222-2222 | Record Number#:
</html>'''
soup = BeautifulSoup(text, 'html.parser')
def find_phone_numbers(text):
phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
return phones
phones = find_phone_numbers(str(soup))
print(phones)
for i in phones:
target = soup.find_all(text=re.compile(i, re.I))
try:
for v in target:
v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
except TypeError:
pass;
print(soup)
这些是我从 运行 上面得到的结果:
['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>
方法略有改变。获取所有 li
标签,然后对于每个标签,如果存在 phone 数字,则将 phone 数字替换为您的掩码。我为此使用了一个临时变量 (temp_text
),只是为了让代码更具可读性。
all_li=soup.find_all('li')
for li in all_li:
temp_text=re.sub(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", '(XXX) XXX-XXXX', li.text)
if temp_text:
li.replace_with(temp_text)
print(soup)
输出:
你可以使用.find_all(text=True)
从HTML汤中获取所有文本内容,然后将其替换为re.sub
(这样,你保留了所有标签,包括<li>
):
for content in soup.find_all(text=True):
s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
content.replace_with(s)
print(soup)
打印:
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (XXX) XXX-XXXX</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (XXX) XXX-XXXX | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>
我正在尝试编辑 html 文件中的 phone 号码信息……虽然我可以很容易地识别所有 phone 号码,但我无法确定找出为什么我无法替换其中带有括号的 phone 数字。示例如下:
import re
from bs4 import BeautifulSoup
text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: 555-222-2222 | Record Number#:
</html>'''
soup = BeautifulSoup(text, 'html.parser')
def find_phone_numbers(text):
phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
return phones
phones = find_phone_numbers(str(soup))
print(phones)
for i in phones:
target = soup.find_all(text=re.compile(i, re.I))
try:
for v in target:
v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
except TypeError:
pass;
print(soup)
这些是我从 运行 上面得到的结果:
['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>
方法略有改变。获取所有 li
标签,然后对于每个标签,如果存在 phone 数字,则将 phone 数字替换为您的掩码。我为此使用了一个临时变量 (temp_text
),只是为了让代码更具可读性。
all_li=soup.find_all('li')
for li in all_li:
temp_text=re.sub(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", '(XXX) XXX-XXXX', li.text)
if temp_text:
li.replace_with(temp_text)
print(soup)
输出:
你可以使用.find_all(text=True)
从HTML汤中获取所有文本内容,然后将其替换为re.sub
(这样,你保留了所有标签,包括<li>
):
for content in soup.find_all(text=True):
s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
content.replace_with(s)
print(soup)
打印:
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (XXX) XXX-XXXX</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (XXX) XXX-XXXX | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>