Beautifulsoup 在常见的抓取数据中提取文本需要花费太多时间

Question

我必须解析常见爬网数据集（warc.gz 文件）中的 html 内容。我决定按照大多数人的建议使用 bs4 (Beautifulsoup) 模块。以下是获取文本的代码片段：

from bs4 import BeautifulSoup

soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')

没有 bs4，一个文件在 9 分钟内完全处理（测试用例）但如果我使用 bs4 来解析文本，那么作业在大约 4 小时内完成。这是怎么回事。除了bs4还有更好的解决办法吗？注意：bs4 是 class 包含许多模块，例如 Beautifilsoup.

Answer 1

这里主要耗时的部分是列表压缩中标签的提取。使用 lxml 和 python 正则表达式，你可以像下面那样做。

import re

script_pat = re.compile(r'<script.*?<\/script>')

# to find all scripts tags
script_pat.findall(src)

# do your stuff
print re.sub(script_pat, '', src)

使用lxml你可以这样做：

from lxml import html, tostring
et = html.fromstring(src)

# remove the tags
[x.drop_tag() for x in et.xpath('//script')]

# do your stuff
print tostring(et)

Beautifulsoup 在常见的抓取数据中提取文本需要花费太多时间

Beautifull soup takes too much time for text extraction in common crawl data

python

beautifulsoup

amazon-web-services

common-crawl

bs4