R：查找模式并获取两者之间的值

Question

我正在使用 readLines() 从站点提取 html 代码。几乎每一行代码中都有 <td>VALUE1<td>VALUE2<td> 形式的模式。我想取 <td> 之间的值。我尝试了一些编译，例如：

output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\2',x='<td>VALUE1<td>VALUE2<td>')

但输出只返回一个值。知道怎么做吗？

Answer 1

string <- "<td>VALUE1<td>VALUE2<td>"   

regmatches(string , gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T) )

# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T)
# this should be the result

# [1]  5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the 
#second match starts at index 15

#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the 

#second match with length of 6

# then get the result of this match and pass it to regmatches function to 
# substring your string at these indices
regmatches(string , indices)

Answer 2

您是否看过可以从 HTML 中提取表格的 "XML" 包？您可能需要提供您尝试解析的整个消息的更多上下文，以便我们可以查看它是否合适。

R：查找模式并获取两者之间的值

R: Find patern and get the values in between

r

gsub