网络抓取这个方案

web scraping this scheme

编辑

网页:https://eresearch.fidelity.com/eresearch/evaluate/snapshot.jhtml?symbols=RDS%2FA

我正在尝试使用 HTML 文档从下面的 html 中提取一些股票信息(很确定它是一个锚定的 href 元素,但是..)。我以为我有

doc.getElementsByTagName("a")(158).innerText

但是发现有的股票会在158位有文字,有的是159位,我也试过

doc.getElementById("busDesc-more")

 doc.getElementsByTagName("h3")

两者似乎都让我进入了正确的社区,但我不知道从那里去哪里。

经验丰富的爬虫如何爬取这个?

HTML

<DIV id=busDesc>
<P>Google Inc., a technology company, builds products and provides services to organize the information. The company offers Google Search, which provides information online; Knowledge Graph that allows to search for things, people, or places, as well…</P>
<DIV class=spacing-div_10X0></DIV><A href="javascript:viewMore('busDesc');"><IMG src="https://scs.fidelity.com/common/application/etf/14.10/images/plus_blue.gif"> View more </A></DIV>
<DIV id=busDesc-more class=hidden>
<P>Google Inc., a technology company, builds products and provides services to organize the information. The company offers Google Search, which provides information online; Knowledge Graph that allows to search for things, people, or places, as well as builds systems that recognize speech and understand natural language; Google Now, which provides information to users when they need it; and Product Listing Ads that offer product image, price, and merchant information. It also provides AdWords, an auction-based advertising program; AdSense, which enables Websites that are part of the Google Network to deliver ads; Google Display, a display advertising network; DoubleClick Ad Exchange, a marketplace for the trading display ad space; and YouTube that offers video, interactive, and other ad formats. In addition, the company offers Android, an open source mobile software platform; hardware products, including Chromebook, Chrome, Chromecast, and Nexus devices; Google+ to share things online with people; Google Play, a cloud-based digital entertainment store for apps, music, books, and movies; Google Drive, a place for users to create, share, collaborate, and keep their stuff; and Google Wallet, a virtual wallet for in-store contactless payments. Further, it provides Google Apps, which include Gmail, Calendar, and Google Sites that are built for people to work anywhere, anytime, on any device without loss of security or control; Google Maps Application Programming Interface; and Google Earth Enterprise, a software solution for imagery and data visualization. Additionally, the company offers Google App Engine, a platform as a service offering; Google Cloud Storage; Google BigQuery for real time analytics; Google Cloud SQL for structured query language; and Google Compute Engine, an infrastructure as a service platform. It also offers mobile wireless devices, and related products and services. Google Inc. was founded in 1998 and is headquartered in Mountain View, California.</P>
<DIV class=spacing-div_10X0></DIV><A href="javascript:viewLess('busDesc');"><IMG src="https://scs.fidelity.com/common/application/etf/14.10/images/minus_blue.gif"> View less </A></DIV>
<DIV class=spacing-div_15X0></DIV>
<DIV class=dark-grey-hr>
<DIV class=hr-for-ie></DIV></DIV>
<DIV class=spacing-div_13X0></DIV>
<DIV class=sub-heading>
<H3>Sector (GICS®)</H3><SPAN class=right><A href="http://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&amp;sector=45">Information Technology</A></SPAN> </DIV>
<DIV class=clear-both></DIV>
<DIV class=spacing-div_13X0></DIV>
<DIV class=dark-grey-hr>
<DIV class=hr-for-ie></DIV></DIV>
<DIV class=spacing-div_13X0></DIV>
<DIV class=sub-heading>
<H3>Industry (GICS®)</H3><SPAN class=right><A href="http://eresearch.fidelity.com/eresearch/markets_sectors/sectors/industries.jhtml?tab=learn&amp;industry=451010">Internet Software &amp; Services</A></SPAN> 

目标

得到 "Information Technology"
<A href="http://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&amp;sector=45">Information Technology</A>

** 最终更新 **

根据 Kerry 的回答(和 Matteo 的编辑)我有下面的代码,它对近 200 只股票始终有效:

Private Function GetAnchorTextForSubHeading(ByRef headerNbr As Integer, ByRef doc As HTMLDocument) As String

   Dim tags As IHTMLElementCollection
   Dim anchors As IHTMLElementCollection

   Set tags = doc.getElementsByClassName("sub-heading")
   Set anchors = tags(headerNbr).getElementsByTagName("a")
   GetAnchorTextForSubHeading = anchors(0).innerText

结束函数

假设它是页面上的第一个 .sub-heading 分类标签,应该这样做。

Set tags =  doc.getElementsByClassName("sub-heading")
yourdata = tags(0).getElementsByTagName("A").innerText

更新

根据关于 .sub-heading 不是唯一的反馈更改了代码以获取 .sub-heading 的第一个实例并将 ByID 中的拼写错误修复为 ByClassName