RE 在 pythex 中有效,但在 python 中无效
RE works in pythex but doesn't work in python
我正在做一项作业,我需要从实时网站上抓取信息。
为此,我正在使用 https://www.nintendo.com/games/nintendo-switch-bestsellers,并且需要抓取游戏名称、价格,然后是图像来源。我的标题有效,但价格和图片来源只是重新调整空列表,尽管当通过 pythex 时它返回正确的答案。
这是我的代码:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")
# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)
# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">($[.0-9]+)</p>', game_html_source)
print(game_prices)
# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)
使用正则表达式解析 HTML 对于可靠处理来说存在太多陷阱。 BeautifulSoup 和其他 HTML 解析器通过构建完整的文档数据结构来工作,然后您可以导航以提取有趣的位 - 它是彻底和全面的,但是如果在任何地方存在一些错误 HTML来源,即使它是您不关心的部分,它也可以破坏解析过程。 Pyparsing 采用中间方法 - 您可以定义仅匹配所需位的迷你解析器,并跳过其他所有内容(这也简化了 post-解析导航)。为了解决 HTML 样式中的一些可变性,pyparsing 提供了一个函数 makeHTMLTags
,其中 return 是一对用于开始和结束标记的 pyparsing 表达式:
foo_start, foo_end = pp.makeHTMLTags('foo')
foo_start
将匹配:
<foo>
<foo/>
<foo class='bar'>
<foo href=something_not_in_quotes>
以及更多属性和空格的变体。
foo_start
表达式(像所有 pyparsing 表达式一样)将 return 一个 ParseResults 对象。这使得访问已解析标签的部分变得容易:
foo_data = foo_start.parseString("<foo img='bar.jpg'>")
print(foo_data.img)
对于您的 Nintendo 页面抓取器,请参阅下面带注释的来源:
import pyparsing as pp
# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")
# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")
# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")
# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))
# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end
# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr
# not shown - read web page into variable 'html'
# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
if 'title' in match:
print("Title:", match.title)
elif 'price' in match:
print("Price:", match.price)
elif 'src' in match:
print("Img src:", match.src)
else:
print("???", match.dump())
打印的前几个匹配项是:
Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
Title: Hyrule Warriors: Definitive Edition
Price: .99
Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
Title: Donkey Kong Country: Tropical Freeze
Price: .99
Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
Title: Wizard of Legend
Price: .99
我正在做一项作业,我需要从实时网站上抓取信息。
为此,我正在使用 https://www.nintendo.com/games/nintendo-switch-bestsellers,并且需要抓取游戏名称、价格,然后是图像来源。我的标题有效,但价格和图片来源只是重新调整空列表,尽管当通过 pythex 时它返回正确的答案。
这是我的代码:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")
# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)
# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">($[.0-9]+)</p>', game_html_source)
print(game_prices)
# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)
使用正则表达式解析 HTML 对于可靠处理来说存在太多陷阱。 BeautifulSoup 和其他 HTML 解析器通过构建完整的文档数据结构来工作,然后您可以导航以提取有趣的位 - 它是彻底和全面的,但是如果在任何地方存在一些错误 HTML来源,即使它是您不关心的部分,它也可以破坏解析过程。 Pyparsing 采用中间方法 - 您可以定义仅匹配所需位的迷你解析器,并跳过其他所有内容(这也简化了 post-解析导航)。为了解决 HTML 样式中的一些可变性,pyparsing 提供了一个函数 makeHTMLTags
,其中 return 是一对用于开始和结束标记的 pyparsing 表达式:
foo_start, foo_end = pp.makeHTMLTags('foo')
foo_start
将匹配:
<foo>
<foo/>
<foo class='bar'>
<foo href=something_not_in_quotes>
以及更多属性和空格的变体。
foo_start
表达式(像所有 pyparsing 表达式一样)将 return 一个 ParseResults 对象。这使得访问已解析标签的部分变得容易:
foo_data = foo_start.parseString("<foo img='bar.jpg'>")
print(foo_data.img)
对于您的 Nintendo 页面抓取器,请参阅下面带注释的来源:
import pyparsing as pp
# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")
# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")
# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")
# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))
# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end
# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr
# not shown - read web page into variable 'html'
# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
if 'title' in match:
print("Title:", match.title)
elif 'price' in match:
print("Price:", match.price)
elif 'src' in match:
print("Img src:", match.src)
else:
print("???", match.dump())
打印的前几个匹配项是:
Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
Title: Hyrule Warriors: Definitive Edition
Price: .99
Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
Title: Donkey Kong Country: Tropical Freeze
Price: .99
Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
Title: Wizard of Legend
Price: .99