Python: BeautifulSoup 过滤所有 'rel' 值

Question

我需要从 HTML 文件中过滤所有 'rel' 值，我正在使用 BeautifulSoup 进行过滤

    for tag in self.loadtree_parser.find_all('a'):
        print tag.get('rel')

给我

[u'0']
[u'83']
[u'84']
[u'39']
[u'24']
[u'41']

我只想要 'pure' 个数字，没有奇怪的 [u'']。怎么做到的？

HTML:

<a href="#" rel="0" title="" id="Ta_0">
<a href="#" rel="83" title="" id="Ta_83">
<a href="#" rel="84" title="" id="Ta_84">
<a href="#" rel="39" title="" id="Ta_39">
<a href="#" rel="24" title="" id="Ta_24">
<a href="#" rel="41" title="" id="Ta_41">

Answer 1

看起来 print tag.get('rel') 正在打印包含一个元素的列表。列表元素的呈现效果不如标量值；尝试在打印前使用索引提取字符串。

print tag.get('rel')[0]

Answer 2

在 python2 中，字符串文字旁边的 u 表示它是 unicode。如果你用 str(u'123') 投射它，你将得到一个标准字符串。如果你想要一个数字，你可以简单地用 int(u'123').

来转换值

不过请记住，您在这里得到的是一个单元素列表，因此您实际需要做的是：

print int(tag.get('rel')[0])

注意：在python3中不再有u修饰符，因为默认情况下每个字符串都是unicode。

Python: BeautifulSoup 过滤所有 'rel' 值

Python: BeautifulSoup filter all 'rel' values

python

load

filtering

beautifulsoup