如何获取这些数据
how to get at this data
我想从下面的 html 示例中抓取突出显示并带有边框的三个项目。我还突出显示了一些看起来很有用的标记。
你会怎么做?
一个解决方案
好吧,这不是一个很好的问题,我真的很惊讶它没有被更多人否决!哦,好吧,这是给别人的面包屑。
我想要的四项信息中的三项是具有已知 id 的 span 元素的内部文本(即 "yfs_l10_gm150220c00036500" 的 $0.83),所以我下面的助手 class 似乎做一个体面而直接的镜头:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetSpanTextForId"
CheckArgNotNothing doc, "doc"
CheckArgNotBadString spanId, "spanId"
' Procedure
Dim oSpan As HTMLSpanElement
Set oSpan = doc.getElementById(spanId)
Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
GetSpanTextForId = oSpan.innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
跨度唯一不直接知道的项目是 OpenInterest,它是 table 的一部分,它是具有 id 的元素的第二个子元素。以下方法 return 紧跟在带有我想要的文本的单元格之后的单元格(即 "Open Interest")
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
Dim tbl As IHTMLTable
Set tbl = GetSummaryDataTable(doc, 1)
Dim k As Integer
k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function
Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
' Procedure
Dim tblCell As HTMLTableCell
Dim k As Integer
For Each tblCell In tbl.Cells
If tblCell.innerText Like ("*" & s) Then
GetCellNumberForTextStartingWith = k
Exit Function
End If
k = k + 1
Next
' if we got here it was not found so
GetCellNumberForTextStartingWith = -1
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
Check tbl.Cells.Length > 0, "table is empty"
Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
' Procedure
GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
这些方法工作正常,但似乎有很多不同的方法都可以工作,包括建议的正则表达式解析方法作为答案。 RedShift 的 excellent link 更深入地分析了 html 并提出了策略。
干杯
我可能会使用 XML 解析器首先获取文本内容(或者这样:xmlString.replace(/<[^>]+>/g, "") 来替换所有标签带空字符串),然后使用以下正则表达式提取您需要的信息:
/-OPR\s+(\d+\.\d+)/
/Bid:\s+(\d+\.\d+)/
/Ask:\s+(\d+\.\d+)/
/Open Interest:\s+(\d+,\d+)/
这个过程可以在 nodejs (more info) 或任何其他支持正则表达式的语言中轻松完成。
现场演示:
- 等待 1 秒,然后删除标签。
- 再等一秒钟,然后找到所有模式并创建一个 table。
wait = true; // Set to false to execute instantly.
var elem = document.getElementById("parsingStuff");
var str = elem.textContent;
var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;
if (wait) timeout = 1000;
setTimeout(function() { // Removing tags.
elem.innerHTML = elem.textContent;
}, timeout);
if (wait) timeout = 2000;
setTimeout(function() { // Looking for patterns.
for (var i = 0; i < keywords.length; i++) {
output[keywords[i]] = str.match(RegExp(keywords[i] + "\s+(\d+[\.,]\d+)"))[1];
}
// Creating basic table of found data.
elem.innerHTML = "";
var table = document.createElement("table");
for (k in output) {
var tr = document.createElement("tr");
var th = document.createElement("th");
var td = document.createElement("td");
th.style.border = "1px solid gray";
td.style.border = "1px solid gray";
th.textContent = k;
td.textContent = output[k];
tr.appendChild(th);
tr.appendChild(td);
table.appendChild(tr);
}
elem.appendChild(table);
}, timeout);
<div id="parsingStuff">
<div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
<div class="hd">
<div class="title">
<h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
<span class="rtq_exch">
<span class="rtq_dash">-</span>OPR
</span>
<span class="wl_sign"></span>
</div>
</div>
<div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
<div>
<span class="time_rtq_ticker">
<span id="yfs_110_gm150220c00036500">0.83</span>
</span>
</div>
</div>undefined</div>undefined
<div class="yui-u first yfi-start-content">
<div class="yfi_quote_summary">
<div id="yfi_quote_summary_data" class="rtq_table">
<table id="table1">
<tr>
<th scope="row" width="48%">Bid:</th>
<td class="yfnc_tabledata1">
<span id="yfs_b00_gm150220c00036500">0.76</span>
</td>
</tr>
<tr>
<th scope="row" width="48%">Ask:</th>
<td class="yfnc_tabledata1">
<span id="yfs_a00_gm150220c00036500">0.90</span>
</td>
</tr>
</table>
<table id="table2">
<tr>
<th scope="row" width="48%">Open Interest:</th>
<td class="yfnc_tabledata1">11,579</td>
</tr>
</table>
</div>
</div>
</div>
</div>
我想从下面的 html 示例中抓取突出显示并带有边框的三个项目。我还突出显示了一些看起来很有用的标记。
你会怎么做?
一个解决方案
好吧,这不是一个很好的问题,我真的很惊讶它没有被更多人否决!哦,好吧,这是给别人的面包屑。
我想要的四项信息中的三项是具有已知 id 的 span 元素的内部文本(即 "yfs_l10_gm150220c00036500" 的 $0.83),所以我下面的助手 class 似乎做一个体面而直接的镜头:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetSpanTextForId"
CheckArgNotNothing doc, "doc"
CheckArgNotBadString spanId, "spanId"
' Procedure
Dim oSpan As HTMLSpanElement
Set oSpan = doc.getElementById(spanId)
Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
GetSpanTextForId = oSpan.innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
跨度唯一不直接知道的项目是 OpenInterest,它是 table 的一部分,它是具有 id 的元素的第二个子元素。以下方法 return 紧跟在带有我想要的文本的单元格之后的单元格(即 "Open Interest")
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
Dim tbl As IHTMLTable
Set tbl = GetSummaryDataTable(doc, 1)
Dim k As Integer
k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function
Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
' Procedure
Dim tblCell As HTMLTableCell
Dim k As Integer
For Each tblCell In tbl.Cells
If tblCell.innerText Like ("*" & s) Then
GetCellNumberForTextStartingWith = k
Exit Function
End If
k = k + 1
Next
' if we got here it was not found so
GetCellNumberForTextStartingWith = -1
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
Check tbl.Cells.Length > 0, "table is empty"
Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
' Procedure
GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
这些方法工作正常,但似乎有很多不同的方法都可以工作,包括建议的正则表达式解析方法作为答案。 RedShift 的 excellent link 更深入地分析了 html 并提出了策略。
干杯
我可能会使用 XML 解析器首先获取文本内容(或者这样:xmlString.replace(/<[^>]+>/g, "") 来替换所有标签带空字符串),然后使用以下正则表达式提取您需要的信息:
/-OPR\s+(\d+\.\d+)/
/Bid:\s+(\d+\.\d+)/
/Ask:\s+(\d+\.\d+)/
/Open Interest:\s+(\d+,\d+)/
这个过程可以在 nodejs (more info) 或任何其他支持正则表达式的语言中轻松完成。
现场演示:
- 等待 1 秒,然后删除标签。
- 再等一秒钟,然后找到所有模式并创建一个 table。
wait = true; // Set to false to execute instantly.
var elem = document.getElementById("parsingStuff");
var str = elem.textContent;
var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;
if (wait) timeout = 1000;
setTimeout(function() { // Removing tags.
elem.innerHTML = elem.textContent;
}, timeout);
if (wait) timeout = 2000;
setTimeout(function() { // Looking for patterns.
for (var i = 0; i < keywords.length; i++) {
output[keywords[i]] = str.match(RegExp(keywords[i] + "\s+(\d+[\.,]\d+)"))[1];
}
// Creating basic table of found data.
elem.innerHTML = "";
var table = document.createElement("table");
for (k in output) {
var tr = document.createElement("tr");
var th = document.createElement("th");
var td = document.createElement("td");
th.style.border = "1px solid gray";
td.style.border = "1px solid gray";
th.textContent = k;
td.textContent = output[k];
tr.appendChild(th);
tr.appendChild(td);
table.appendChild(tr);
}
elem.appendChild(table);
}, timeout);
<div id="parsingStuff">
<div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
<div class="hd">
<div class="title">
<h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
<span class="rtq_exch">
<span class="rtq_dash">-</span>OPR
</span>
<span class="wl_sign"></span>
</div>
</div>
<div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
<div>
<span class="time_rtq_ticker">
<span id="yfs_110_gm150220c00036500">0.83</span>
</span>
</div>
</div>undefined</div>undefined
<div class="yui-u first yfi-start-content">
<div class="yfi_quote_summary">
<div id="yfi_quote_summary_data" class="rtq_table">
<table id="table1">
<tr>
<th scope="row" width="48%">Bid:</th>
<td class="yfnc_tabledata1">
<span id="yfs_b00_gm150220c00036500">0.76</span>
</td>
</tr>
<tr>
<th scope="row" width="48%">Ask:</th>
<td class="yfnc_tabledata1">
<span id="yfs_a00_gm150220c00036500">0.90</span>
</td>
</tr>
</table>
<table id="table2">
<tr>
<th scope="row" width="48%">Open Interest:</th>
<td class="yfnc_tabledata1">11,579</td>
</tr>
</table>
</div>
</div>
</div>
</div>