抓取不包含在任何元素中的文本
Scrape text not contained in any element
我正在使用 Beautiful Soup 4 抓取一个写得非常糟糕的网站。除了用户的电子邮件地址之外,我得到了所有内容,该地址不在任何区分它的包含元素中。任何想法如何刮它? strong元素的next_sibling
直接跳过了,如我所料。
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
useremail@yahoo.com
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
我不确定这是最好的方法,但是您可以获取父元素,然后遍历其子元素并查看非标签:
from bs4 import BeautifulSoup
import bs4
html='''
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
useremail@yahoo.com
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
'''
def print_if_email(s):
if '@' in s: print s
soup = BeautifulSoup(html)
# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
# Iterate over the children of each matching div
for c in div.children:
# If it wasn't parsed as a tag, it may be a NavigableString
if isinstance(c, bs4.element.NavigableString):
# Some heuristic to identify email addresses if other non-tags exist
print_if_email(c.strip())
打印:
useremail@yahoo.com
当然,这个内部 for 循环和 if 语句可以合并为:
for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):
我无法直接回答你的问题,因为我从未使用过 Beautiful Soup(所以不要接受这个答案!)但只是想提醒你,如果页面都非常简单,另一种选择可能是使用 .split()
编写您自己的解析器?
这相当笨拙,但如果页面 simple/predictable...
则值得考虑
也就是说,如果您对页面的整体布局有所了解
(例如,用户电子邮件始终是第一个提到的电子邮件)您可以编写自己的解析器,以查找“@”符号前后的位
# html = the entire document as a string
# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about
# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you
# (I've just guessed some html elements that might precede the username)
# you could similarly parse the bit after the @ sign,
# html.split('@')[1]
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know
如果您想缩小文档的重点范围,您也可以随意使用:
如果您想确保单词 'e-mail' 也在您正在解析的字符串中
if 'email' in lower(b) or 'e-mail' in lower(b):
# do something...
检查 @ 符号在文档中首次出现的位置
html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about
# such as the word 'e-mail', or a particular div element or '</strong>'
将您的电子邮件搜索限制在 300 个字符以内 before/after 您知道的另一个元素:
startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]
我想 Google 多花几分钟可能会很有用;你的任务听起来很正常:)
并确保考虑页面上有多个电子邮件地址的情况(例如,这样您就不会将支持@site.com 分配给每个用户!)
无论您采用何种方法,如果您有疑问,可能值得使用 email.utils.parseaddr() 或其他人的正则表达式检查器检查您的答案。参见 previous question
我正在使用 Beautiful Soup 4 抓取一个写得非常糟糕的网站。除了用户的电子邮件地址之外,我得到了所有内容,该地址不在任何区分它的包含元素中。任何想法如何刮它? strong元素的next_sibling
直接跳过了,如我所料。
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
useremail@yahoo.com
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
我不确定这是最好的方法,但是您可以获取父元素,然后遍历其子元素并查看非标签:
from bs4 import BeautifulSoup
import bs4
html='''
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
useremail@yahoo.com
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
'''
def print_if_email(s):
if '@' in s: print s
soup = BeautifulSoup(html)
# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
# Iterate over the children of each matching div
for c in div.children:
# If it wasn't parsed as a tag, it may be a NavigableString
if isinstance(c, bs4.element.NavigableString):
# Some heuristic to identify email addresses if other non-tags exist
print_if_email(c.strip())
打印:
useremail@yahoo.com
当然,这个内部 for 循环和 if 语句可以合并为:
for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):
我无法直接回答你的问题,因为我从未使用过 Beautiful Soup(所以不要接受这个答案!)但只是想提醒你,如果页面都非常简单,另一种选择可能是使用 .split()
编写您自己的解析器?
这相当笨拙,但如果页面 simple/predictable...
则值得考虑也就是说,如果您对页面的整体布局有所了解 (例如,用户电子邮件始终是第一个提到的电子邮件)您可以编写自己的解析器,以查找“@”符号前后的位
# html = the entire document as a string
# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about
# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you
# (I've just guessed some html elements that might precede the username)
# you could similarly parse the bit after the @ sign,
# html.split('@')[1]
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know
如果您想缩小文档的重点范围,您也可以随意使用:
如果您想确保单词 'e-mail' 也在您正在解析的字符串中
if 'email' in lower(b) or 'e-mail' in lower(b):
# do something...
检查 @ 符号在文档中首次出现的位置
html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about
# such as the word 'e-mail', or a particular div element or '</strong>'
将您的电子邮件搜索限制在 300 个字符以内 before/after 您知道的另一个元素:
startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]
我想 Google 多花几分钟可能会很有用;你的任务听起来很正常:)
并确保考虑页面上有多个电子邮件地址的情况(例如,这样您就不会将支持@site.com 分配给每个用户!)
无论您采用何种方法,如果您有疑问,可能值得使用 email.utils.parseaddr() 或其他人的正则表达式检查器检查您的答案。参见 previous question