R:查找模式并获取两者之间的值
R: Find patern and get the values in between
我正在使用 readLines()
从站点提取 html 代码。几乎每一行代码中都有 <td>VALUE1<td>VALUE2<td>
形式的模式。我想取 <td>
之间的值。我尝试了一些编译,例如:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\2',x='<td>VALUE1<td>VALUE2<td>')
但输出只返回一个值。知道怎么做吗?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
您是否看过可以从 HTML 中提取表格的 "XML" 包?您可能需要提供您尝试解析的整个消息的更多上下文,以便我们可以查看它是否合适。
我正在使用 readLines()
从站点提取 html 代码。几乎每一行代码中都有 <td>VALUE1<td>VALUE2<td>
形式的模式。我想取 <td>
之间的值。我尝试了一些编译,例如:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\2',x='<td>VALUE1<td>VALUE2<td>')
但输出只返回一个值。知道怎么做吗?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
您是否看过可以从 HTML 中提取表格的 "XML" 包?您可能需要提供您尝试解析的整个消息的更多上下文,以便我们可以查看它是否合适。