匹配网页内容中的文本但抛出索引错误

Question

此代码抛出错误索引超出范围错误

import os
import re

url = "http://www.jabong.com/purys-Beige-Shirts-1059637.html"
wget_data = os.popen('wget -qO- %s'% url).read()
data = re.findall(r'c999 fs12 mt10 f-bold">(.*)<\/table',wget_data)[0]
print data

输出：

Traceback (most recent call last):
  File "variable_concat.py", line 7, in <module>
    images = re.findall(r'c999 fs12 mt10 f-bold">(.*)<\/table',wget_data)[0]
IndexError: list index out of range

这是网页内容中的一个大字符串，我如何匹配它？

r'c999 fs12 mt10 f-bold">(.*)<\/table'

Answer 1

使用 BeautifulSoup 解析器。

import os
import re
from bs4 import BeautifulSoup
url = "http://www.jabong.com/purys-Beige-Shirts-1059637.html"
wget_data = os.popen('wget -qO- %s'% url).read()
soup = BeautifulSoup(wget_data)
print soup.find('table', class_="c999 fs12 mt10 f-bold").contents

如果你真的想使用正则表达式，那么你需要启用 DOTALL 修饰符。因为默认情况下 . 不会匹配换行符（\n 或 \r）。

import os
import re

url = "http://www.jabong.com/purys-Beige-Shirts-1059637.html"
wget_data = os.popen('wget -qO- %s'% url).read()
data = re.findall(r'(?s)c999 fs12 mt10 f-bold">(.*?)<\/table',wget_data)[0]
print data

匹配网页内容中的文本但抛出索引错误

Matching a text in webpage content but thorws out of index error

python

regex

wget

urllib