如何从 alexa.com 解析 div 标签并在 django 中的 table 中显示结果

Question

我已经使用 HTMLParser 和 urllib2 成功创建了一个网络应用程序，它从 www.alexa.com/topsites/global[=41= 获取前 20 个网站] 并将结果放在 HTML table 中。我的问题是我不能遵循相同的规则并为 <div class="count"> 和 <div class="description"> 应用相同的算法。

任何人都可以帮助我使用 BS4 而不做一些片段吗？

到目前为止我的代码：

urlparse.py

import HTMLParser, urllib class MyHTMLParser(HTMLParser.HTMLParser): site_list = [] def reset(self): HTMLParser.HTMLParser.reset(self) self.in_a = False self.next_link_text_pair = None def handle_starttag(self, tag, attrs): if tag=='a': for name, value in attrs: if name=='href': self.next_link_text_pair = [value, ''] self.in_a = True break def handle_data(self, data): if self.in_a: self.next_link_text_pair[1] += data def handle_endtag(self, tag): if tag=='a': if self.next_link_text_pair is not None: if self.next_link_text_pair[0].startswith('/siteinfo/'): self.site_list.append(self.next_link_text_pair[1]) self.next_link_text_pair = None self.in_a = False if __name__=='__main__': p = MyHTMLParser() p.feed(urllib.urlopen('http://www.alexa.com/topsites/global').read()) print p.site_list[:20]

urls.py

urlpatterns = patterns('', url(r'^$', 'myapp.views.top_urls', name='home'), url(r'^admin/', include(admin.site.urls)), )

views.py

def top_urls(request): p = MyHTMLParser() p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read()) urls = p.site_list[:20] print urls return render(request, 'top_urls.html', {'urls': urls})

top_urls.html

... <tbody> {% for url in urls %} <tr> <td>Something</td> <td>{{ url }}</td> <td>something</td> </tr> {% endfor %} </tbody> ...

Answer 1

这个想法是创建某种状态机。在 starttag 事件中，我们决定稍后在 data 事件中填充站点信息的哪个数据字段。根据 <div>/<span> class 属性和 ATTR_FIELDS 映射做出决定。

例如，如果启动了 <div class="count"> 标签，那么我们将填充当前 self.site 字典的 rank 字段。

class MyHTMLParser(HTMLParser.HTMLParser):

    ATTR_FIELDS = {'count': 'rank',
                   'description': 'description', 'remainder': 'description'}

    def reset_site(self):
        self.site = {'rank': '', 'url': '', 'description': ''}
        self.in_site_listing = self.data_field = False

    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        self.reset_site()
        self.site_list = []

    def handle_starttag(self, tag, attrs):
        class_attr = dict(attrs).get('class')
        if tag == 'li' and class_attr == 'site-listing':
            self.in_site_listing = True
        elif self.in_site_listing:
            if tag == 'a':
                if class_attr != 'moreDesc':
                    self.site['url'] = dict(attrs)['href'].replace(
                                                             '/siteinfo/', '')
            elif tag in ['div', 'span']:
                self.data_field = self.ATTR_FIELDS.get(class_attr)

    def handle_data(self, data):
        if self.data_field:
            self.site[self.data_field] += data

    def handle_endtag(self, tag):
        if tag == 'li' and self.in_site_listing:
            self.site_list.append(self.site)
            self.reset_site()
        self.data_field = None

然后更改视图和模板：

view.py

def top_urls(request):
    p = MyHTMLParser()
    p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
    sites = p.site_list[:20]
    return render(request, 'top_urls.html', {'sites': sites})

top_urls.html

...
<tbody>
    {% for site in sites %}
        <tr>
            <td>{{ site.rank }}</td>
            <td>{{ site.url }}</td>
            <td>{{ site.description }}</td>
        </tr>
    {% endfor %}
</tbody>
...

说明更新：

使用的变量：

self.site - 当前站点信息
self.in_site_listing' - flag is set to True if we are in the` 标签
self.data_field - 输入站点信息以添加数据
ATTR_FIELDS - <div>/<span> 类到站点信息键

关键方法是handle_starttag():

def handle_starttag(self, tag, attrs):
    # get the tag `class` attribute if any
    class_attr = dict(attrs).get('class')
    # if the tag is `<li class="site-listing">` then set the flag that we
    # should populate the site info
    if tag == 'li' and class_attr == 'site-listing':
        self.in_site_listing = True
    # we a in the site population mode
    elif self.in_site_listing:
        if tag == 'a':
            # `<li class="site-info">` contains two `<a>` tags. We should
            # use the tag withoud `class="moreDesc"` attribute to set the url
            if class_attr != 'moreDesc':
                self.site['url'] = dict(attrs)['href'].replace(
                                                         '/siteinfo/', '')
        elif tag in ['div', 'span']:
            # we are in the `<div>` or `<span>` tag. Get the `class` attribute
            # of the tag and decide which field of the site info we will
            # populate in the `handle_data()` method
            self.data_field = self.ATTR_FIELDS.get(class_attr)

所以 handle_data() 非常简单：

def handle_data(self, data):
    # if we know which field of site info should be populated
    if self.data_field:
        # append the data to this field. Site description is spread in several
        # tags this is why we append data instead of simple assigning.
        self.site[self.data_field] += data

如何从 alexa.com 解析 div 标签并在 django 中的 table 中显示结果

How to parse div tag from alexa.com and show results in table in django

django

django-templates

django-models

django-forms

django-views