Python 爬虫Beautifulsoup decompose()函数

Python Crawler Beatifulsoup decompose() function

我在用 python 和 BeautifulSoup.

制作的爬虫中无法使用 decompose() 函数

问题如下。我正在尝试从网站产品中获取所有规格数据(您可以在源代码中看到):

soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
dt = soup.findAll('dt', {'class': 'product-specs--item-title'})

for i in range(0, len(dt)):

    dtRows = dt[i]
    dtRowsStrip = dtRows.text.strip()

    print(dtRows.text.strip())

    # print(dtRows)

    # dtRowsSplit = "".join(dtRowsStrip.split())
    # print(dtRowsSplit)

当我使用:

> print(dtRows.text.strip())

我得到的输出是:

Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
Serie           


        Serie
Socket          


        Socket
Codenaam            


        Codenaam
Threads         


        Threads
Turbo Frequency         


        Turbo Frequency
Multiplier unlocked         


        Multiplier unlocked
Cache           


        Cache
Geheugencontroller          


        Geheugencontroller
etc ....

第一个完整的行是正确的。在第二行,由于 <dt> 标签中的 <a> 标签,我得到了双倍值。

一个例子是这样的:

<dt class="product-specs--item-title">
    <a class="product-specs--help-icon js-tooltip" href="#spec_Serie" title="Zowel AMD als Intel produceren processoren in verschillende series. Een serie is bedoeld voor bepaald gebruik. Zo zijn Core i3 processoren geschikt voor internet &amp; office werkzaamheden en Core i7 processoren voor veeleisende multitasking en gaming. Binnen een serie zijn verschillende modellen processoren verkrijgbaar. Van welke serie is deze processor onderdeel?"><i class="icon icon-circle-questionmark"></i><span class="product-specs--help-title">Serie</span></a>
    <span>Serie</span>
</dt>

谁能帮我删除完整的 <a> 标签?

附加信息:

#

如果我使用下面的代码:

    soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')

for spec in soup.select('dt.product-specs--item-title'):
    print(spec.get_text(strip=True))

输出如下:

Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
SerieSerie
SocketSocket
CodenaamCodenaam
ThreadsThreads
Turbo FrequencyTurbo Frequency
Multiplier unlockedMultiplier unlocked
CacheCache
GeheugencontrollerGeheugencontroller
ProductieprocesProductieproces
Stroomverbruik maximaalStroomverbruik maximaal
KloksnelheidKloksnelheid
ProcessorkernenProcessorkernen
Type GPUType GPU

如你所见。在第二个 <dl> 块之后,我得到了双倍值。

附加: 谢谢。。。我也是刚查到。我知道你的代码更好,但只是想分享我的解决方案:

    for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
    print(spec.get_text(strip=True))

    parent = spec.find_parent('dt')
    value = parent.find_next_sibling("dd", {'class': 'product-specs--item-spec'})
    print(value.text.strip())

您只需要更具体地说明要提取的节点和节点:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))

for spec in soup.select('div.product-specs > dl.product-specs--list > dt.product-specs--item-title'):
    print spec.get_text(strip=True)

打印:

Serie
Threads
Socket
Kloksnelheid

在这里,我们基本上得到以下块:


如果您需要获取所有产品规格并避免重复,则需要使用 class="product-specs--help-title":

向下一级到 span
for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
    print spec.get_text(strip=True)

打印:

Serie
Socket
Codenaam
Threads
Turbo Frequency
Multiplier unlocked
Cache
Geheugencontroller
Productieproces
Stroomverbruik maximaal
Kloksnelheid
Processorkernen
Type GPU
Koeler meegeleverd

以下是获得 name:value 对规格的方法:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))

for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title'):
    name = spec.span
    if not name:
        continue

    value = spec.find_next_sibling('dd', class_='product-specs--item-spec')
    print name.get_text(strip=True), value.get_text(strip=True)

打印:

Serie Core i7
Socket 1150
Codenaam Haswell Refresh
Threads 8
Turbo Frequency 4400 MHz
Multiplier unlocked Ja
Cache 8 MB
Geheugencontroller DDR3-1600
Productieproces 22 nm
Stroomverbruik maximaal 88 watt
Kloksnelheid 4000 MHz
Processorkernen Quad-core
Type GPU Intel HD Graphics 4600
Koeler meegeleverd Ja