BS4：如何将 find_all 减少到最小值（忽略而不是提取）

Question

我需要忽略 Comments 和 Doctype 以便后面的操作（因为我将替换一些字符，这些字符以后将不再允许我区分 comments 和 doctype）。

最小示例

#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup, Comment, Doctype


def is_toremove(element):
    return isinstance(element, Comment) or isinstance(element, Doctype)


def test1():
    html = \
    '''
    <!DOCTYPE html>
    word1 word2 word3 word4
    <!-- A comment -->
    '''
    soup = BeautifulSoup(html, features="html.parser")
    to_remove = soup.find_all(text=is_toremove)
    for element in to_remove:
        element.extract()

    # some operations needing soup.findAll
    for txt in soup.findAll(text=True):
        # some replace computations
        pass
    return soup
print(test1())

预期的结果是“word1 word2 word3 word4”被替换计算替换。它有效，但我认为它不是很有效率。我想过做一些像

for txt in soup.findAll(text=not is_toremove()):

只使用未移除的部分。

所以我的问题是：

是否有一些内部魔法允许您调用 findAll 两次而不是低效或
如何将它们合二为一find_all

我也试着去寻找父标签：

if(not isinstance(txt, Doctype)

或

if(txt.parent.name != "[document]")

例如。这并没有改变我的主程序。

Answer 1

正如评论中所说，如果你只想得到普通的NavigableString，你可以这样做：

from bs4 import BeautifulSoup, NavigableString


html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml')

for visible_string in soup.find_all(text=is_string_only):
    print(visible_string)

打印：

word1 word2 word3 word4

BS4：如何将 find_all 减少到最小值（忽略而不是提取）

BS4: How to reduce find_all to a minimum (ignoring instead of extracting)

performance

beautifulsoup

findall