从站点抓取某些字段时遇到问题

Question

我在 vba 中编写了一个脚本，使用它我可以从特定站点解析 "Company Name"、"Phone"、"Fax" 和 "Email"，但是在抓取 "Address"、"Web" 和 "Name" 的情况下，我卡住了。我在 vba 中使用 responsetext 和 split 方法编写了脚本。希望有人能告诉我一个解决方法。

这是我尝试过的：

str = Split(http.responseText, " class=""contact-details block dark"">")
y = UBound(str)
    For i = 1 To y
        Cells(x, 1) = Split(Split(str(i), "Company Name:")(1), "<")(0)
        Cells(x, 2) = Split(Split(str(i), "Phone:")(1), "<")(0)
        Cells(x, 3) = Split(Split(str(i), "Fax:")(1), "<")(0)
        Cells(x, 4) = Split(Split(str(i), "mailto:")(1), ">")(0)
        x = x + 1
    Next i

这是 html 元素内容：

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: PPEHeads Australia<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Web: <a target="_blank" href="http://www.ppeheads.com.au">http://www.ppeheads.com.au</a></p><h4>Address</h4><p>Unit 2 / 4 Reaghs Farm Road<br>MINTO<br>NSW<br>2566</p><h4>Contact</h4><p>Name: Alan Hadfield<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Email: <a href="mailto:alan@ppeheads.com.au">alan@ppeheads.com.au</a></p>
            </div>

Answer 1

下次请提供您的其余代码，因为问题可能不在您认为的地方。还好我找到了你之前的post

如果仔细观察，您的 html 元素中有 3 个 p 标签：

第一个是联系公司的详细信息，您可以通过

获得

Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)

第二个是地址详细信息，您可以通过

获得

Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)

第三个是联系人详细信息，您可以通过

获得

Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)

注意代码末尾的 (0)、(1)、(2) 更改，它给出了 p 标签的出现顺序。

我修改了您以前的代码并注释了更改，以便您可以看到不同之处：

Sub RestData()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim ele, ele2, ele3 As Object, post As Object 'declare
Dim TypeDetails() As String
Dim TypeDetails3() As String 'declare
Dim TypeDetail() As String
Dim i As Long, r As Long
With CreateObject("MSXML2.serverXMLHTTP")
    .Open "GET", "http://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG0120000508&folderid=1736", False
    .send
    html.body.innerHTML = .responseText
End With

'get all the p elements
Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
r = 2

'split from line feed
TypeDetails() = Split(ele.innerText, Chr(10))
TypeDetails3() = Split(ele3.innerText, Chr(10))

'This part goes for Contact Company Details, notice the operator is ": ",not ":"
For i = 0 To UBound(TypeDetails())
    TypeDetail() = Split(TypeDetails(i), ": ")
    Cells(r, 1) = VBA.Trim(TypeDetail(0))
    Cells(r, 2) = VBA.Trim(TypeDetail(1))
    r = r + 1
Next i

'This part goes for Address Details, replaced new line with " " for it to be in the same line
Cells(r, 1) = "Address"
Cells(r, 2) = Replace(ele2.innerText, vbLf, " ")
r = r + 1

'This part goes for Contact Person Details
For i = 0 To UBound(TypeDetails3())
    TypeDetail() = Split(TypeDetails3(i), ": ")
    Cells(r, 1) = VBA.Trim(TypeDetail(0))
    Cells(r, 2) = VBA.Trim(TypeDetail(1))
    r = r + 1
Next i

Set html = Nothing: Set ele = Nothing: Set docs = Nothing
End Sub

希望对您有所帮助

从站点抓取某些字段时遇到问题

Got stuck scraping certain fields from a site

vba

split

responsetext

web-scraping