Grep html html 标签之间的代码包含 R 中的关键字
Grep html code between html tags containing a keyword in R
在文件中,我想使用 grep 或者可能使用包 qdapRegex's
rm_between 函数提取包含关键字的整个 html 代码段,在本例中假设为 "discount rate"。具体来说,我想要类似于以下代码片段的结果:
<P>This is a paragraph containing the words discount rate including other things.</P>
和
<TABLE width="400">
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>Discount Rate</td>
<td>10.0%</td>
</tr>
<tr>
<td>February</td>
<td></td>
</tr>
</TABLE>
- 这里的诀窍是它必须先找到折扣率,然后再取出其余的。
- 它总是在
<P> and </P>
或 <TABLE and </TABLE>
之间,没有其他 html 标签。
可在此处找到一个很好的示例 .txt 文件:
https://www.sec.gov/Archives/edgar/data/66740/0000897101-04-000425.txt
您可以将该文件视为 html 并像使用 rvest
抓取它一样探索它:
library(rvest)
library(stringr)
# Extract the html from the file
html = read_html('~/Downloads/0000897101-04-000425.txt')
# Get all the 'p' nodes (you can do the same for 'table')
p_nodes <- html %>% html_nodes('p')
# Get the text from each node
p_nodes_text <- p_nodes %>% html_text()
# Find the nodes that have the term you are looking for
match_indeces <- str_detect(p_nodes_text, fixed('discount rate', ignore_case = TRUE))
# Keep only the nodes with matches
# Notice that I remove the first match because rvest adds a
# 'p' node to the whole file, since it is a text file
match_p_nodes <- p_nodes[match_indeces][-1]
# If you want to see the results, you can print them like this
# (or you could send them to a file)
for(i in 1:length(match_p_nodes)) {
cat(paste0('Node #', i, ': ', as.character(match_p_nodes[i]), '\n\n'))
}
对于 <table>
标签,您不会删除第一个匹配项:
table_nodes <- html %>% html_nodes('table')
table_nodes_text <- table_nodes %>% html_text()
match_indeces_table <- str_detect(table_nodes_text, fixed('discount rate', ignore_case = TRUE))
match_table_nodes <- table_nodes[match_indeces_table]
for(i in 1:length(match_table_nodes)) {
cat(paste0('Node #', i, ': ', as.character(match_table_nodes[i]), '\n\n'))
}
在文件中,我想使用 grep 或者可能使用包 qdapRegex's rm_between 函数提取包含关键字的整个 html 代码段,在本例中假设为 "discount rate"。具体来说,我想要类似于以下代码片段的结果:
<P>This is a paragraph containing the words discount rate including other things.</P>
和
<TABLE width="400">
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>Discount Rate</td>
<td>10.0%</td>
</tr>
<tr>
<td>February</td>
<td></td>
</tr>
</TABLE>
- 这里的诀窍是它必须先找到折扣率,然后再取出其余的。
- 它总是在
<P> and </P>
或<TABLE and </TABLE>
之间,没有其他 html 标签。
可在此处找到一个很好的示例 .txt 文件:
https://www.sec.gov/Archives/edgar/data/66740/0000897101-04-000425.txt
您可以将该文件视为 html 并像使用 rvest
抓取它一样探索它:
library(rvest)
library(stringr)
# Extract the html from the file
html = read_html('~/Downloads/0000897101-04-000425.txt')
# Get all the 'p' nodes (you can do the same for 'table')
p_nodes <- html %>% html_nodes('p')
# Get the text from each node
p_nodes_text <- p_nodes %>% html_text()
# Find the nodes that have the term you are looking for
match_indeces <- str_detect(p_nodes_text, fixed('discount rate', ignore_case = TRUE))
# Keep only the nodes with matches
# Notice that I remove the first match because rvest adds a
# 'p' node to the whole file, since it is a text file
match_p_nodes <- p_nodes[match_indeces][-1]
# If you want to see the results, you can print them like this
# (or you could send them to a file)
for(i in 1:length(match_p_nodes)) {
cat(paste0('Node #', i, ': ', as.character(match_p_nodes[i]), '\n\n'))
}
对于 <table>
标签,您不会删除第一个匹配项:
table_nodes <- html %>% html_nodes('table')
table_nodes_text <- table_nodes %>% html_text()
match_indeces_table <- str_detect(table_nodes_text, fixed('discount rate', ignore_case = TRUE))
match_table_nodes <- table_nodes[match_indeces_table]
for(i in 1:length(match_table_nodes)) {
cat(paste0('Node #', i, ': ', as.character(match_table_nodes[i]), '\n\n'))
}